eMpTy Pages

2025 in Review and the Year Ahead

2025-12-31T11:22:00.000-08:00

Looking back at 2025, the 'AI Revolution' often felt a lot more like an 'AI Science Fair' than real progress. We saw many interesting experiments, but if we’re being honest, production-ready deployments were surprisingly hard to find.

The primary reason for this gap is the "Chatbot Trap." While AI tools are easy to start using, that simplicity is deceptive. Achieving real business impact requires more than a chat interface; it requires transforming core business workflows with the same engineering rigor and discipline applied to any mission-critical automation.

Close examination of the lack of success revealed at least four contributing factors for the high number of failed pilot programs. It’s easy to get a bot to talk, but it’s an entirely different beast to make it work. According to experts, here’s what’s actually holding things back:

1. Focus on the Wrong Problem: AI is suitable for some but not all business challenges. If data is not available to reenvision and enhance business processes, AI is unlikely to deliver successful outcomes.

2. Lack of Engineering Discipline: Treating AI as a "plug-and-play" tool rather than a complex system that requires careful design by technical experts, and ongoing evolutionary investments.

3. Superficial Technical Knowledge: A failure to deeply understand the tools and their limitations.

4. Unrealistic Executive Expectations: Expecting instant results without doing the necessary groundwork to ensure that all the pieces align.

True success requires deliberate alignment among the business problem, available data, friction in current processes, and the technical expertise of the development team. Most importantly, high-value automation comes from redesigning processes from the ground up, rather than simply "lifting and shifting" manual tasks into a digital format.

Some of the key themes that stood out in the industry in 2025 include:

1) LLM MT Outperforms NMT (In Research, But Not Yet in Production)

Leading industry research, most notably from WMT25 has established that LLM-based translation (using models like Gemini, Claude, and OpenAI) consistently outperforms traditional NMT. Something we also see with Lara Translate. Despite this clear technical superiority, the industry has been slow to switch to LLM-only production. Why the lag?

Industry adoption is lagging, not because the tech isn't better (it is), but because we're staring down massive technical debt. Retrofitting 20-year-old workflows for LLMs is expensive, complex, and, frankly, a bit of a headache for LSPs, localization, and IT teams. The familiar data, process, and workflows do not align.

Thus, instead of a full transition, many organizations have settled on "hybrid" systems, where an LLM further refines NMT output. While intended as a functional and reliable compromise, this approach has created significant issues:

Operational Heaviness: Combining Translation Memory (TM), NMT, Quality Estimation (QE), and Post-Editing (PE) creates an overly complex production environment.
Diminishing Returns: This complexity adds significant management costs and technical debt without necessarily delivering tangible business value, increased speed, or lower costs that marketing and product leaders expect.

2) Will Language AI Eliminate or Reduce Professional Translation Opportunities?

As Large Language Model (LLM) translation quality continues to improve, professionals are understandably concerned about the future of the industry. While AI handles general business content exceptionally well, the landscape of professional translation is shifting rather than disappearing.

The Current Limits of AI

Despite the hype, human expertise remains essential in at least three specific areas:

Domain Specialization: Highly technical, legal, or creative content still requires human nuance and deep subject-matter expertise.
Low-Resource Languages: Most LLMs only excel in the top 30 global languages where training data is abundant. For the thousands of other languages, AI performance remains unreliable.
Emerging Use Cases: Human expertise in analysis, research, and guidance remains essential for implementing automated translation in specialized domains.

The Opportunity in "Latent Demand"

A common mistake is viewing the translation market as a "fixed pie." In reality, there is a massive amount of latent demand for content that needs to be, or could be translated, but currently isn't.

Consider some statistics from CSA Research that show the sheer volume of content that could be translated is staggering. CSA states that 11.36 Exabytes of textual content are generated globally every single day, and 99% of what is translated is handled by machines; humans handle less than 1%. The truth is that only a teeny tiny portion (0.00000389%) of the world's daily text is currently translated at all.

The Future Outlook

From Translators to Architects: We’re likely looking at a 100x explosion in translation demand. As we start tackling making more content in high-resource languages visible and addressing hundreds of "low-resource" languages, the job description is going to change. We won't be "word-for-word" translators anymore. We’re becoming Strategic Language Architects—the ones who design the systems and oversee the flows that keep this massive amount of information accurate and culturally on-point.

3) The Evolution of Translation Memory: Moving Beyond String Matching

For over 45 years, Translation Memory (TM) has been the backbone of the industry. It is a database technology that matches text strings, storing human translations as isolated segments for reuse later. While TM was essential for developing Statistical and Neural MT (NMT), it is increasingly viewed as an outdated approach when paired with modern Large Language Models (LLMs) like Lara.

Why TM is No Longer Enough

The traditional practice of relying on "100% TM matches" is becoming suboptimal. Here is why the industry is shifting:

Context Over Matches: We now have clear evidence from the large-scale use of Lara that providing an LLM with richer context (the surrounding text, tone, and intent) produces far better results and higher efficiency than simply inserting a pre-translated string from a database.
Segment Isolation: TM stores segments in isolation. LLMs, however, excel when they can "understand" the relationship between sentences and paragraphs and other in-use context that a standard TM cannot provide.
Arcane Architecture: Using a 45-year-old string-matching tool to power a cutting-edge LLM MT model limits the system's potential.

Looking Toward 2026: A New Data Architecture

The industry is reaching a consensus: while TM still has its uses, we need a more sophisticated, context- and metadata-rich data architecture.

To unlock the full power of LLMs, we must move toward systems that store not just "what" was translated, but "how" and "why," including style guides, situational metadata, and document-level context. Expect this transition to be a major topic of debate and innovation throughout 2026.

4) The Reality of Translation AI – ChatGPT has Not “Solved” the Translation Problem

It’s easy to look at Generic AI and think the "translation problem" is a thing of the past. It isn’t. Even with data-rich languages like French or Spanish, a quick stress test reveals that we still have a long way to go. While generic models work well for a quick email, they often stumble when tasked with complex enterprise material, specialized scientific data, or esoteric knowledge. They lack the precision required for high-stakes, technical, or highly niche content.

The reality is that generic LLM translation capabilities lack the robustness and adaptability required for high-stakes business environments. To bridge this gap, we need specialized, translation-optimized solutions like Lara Translate. These tools don't just provide a "basic translation"; they offer the personalization and precision that professionals actually need to do their jobs.

What Makes Specialized AI Like Lara Translate Different?

Professionals require more than just "good enough" text. They need a system that acts as a sophisticated assistant, capable of the following:

Deep Customization: Leveraging your existing linguistic assets (like Translation Memories) to fine-tune results at a high level.
Domain Expertise: Learning the specific terminology and unique stylistic "voice" of your business. The ability to improve with ongoing use and experience is a highly valued attribute for such a system.
File Versatility: Processing everything from PDFs and slide decks to spreadsheets, social media posts, and internal chats without breaking the formatting.
Dynamic Learning: Evolving rapidly as you provide corrective feedback, ensuring the AI learns your personal stylistic and domain preferences over time.
Quality Transparency: Providing instant feedback on translation quality to ensure fidelity in shared multilingual communications and allowing for "on-the-fly" modifications based on the specific intent of the message.
Creative Alternatives: Offering multiple ways to phrase critical sentences, which is essential for properly tuning high-value content that might have a high communication impact.

Looking Ahead

Translation AI will continue to evolve rapidly. In the coming year, we should expect products like Lara Translate to become even more intuitive. These tools aren't here to replace the human touch; they are here to enhance and amplify it. By removing the friction of language barriers, they allow hundreds of millions of business professionals to become effectively multilingual with minimal effort.

Merry Xmas, Happy Holidays, and a Happy New Year to all.

A View from the GALA 2025 Conference

2025-04-16T21:14:00.000-07:00

These are uncertain times for many in the language services and localization industry. There was a palpable air of concern and angst in Montreal. This is to be expected given all the changes that we face from so many directions:

· Disruption of established government and trade policies
· AI hype in general is threatening many white-collar jobs
· Unrealizable expectations about the potential capabilities of AI technology from C-suite leaders that cannot be delivered
· An emerging global economic slowdown after an already tough business year
· High levels of economic and business uncertainty

The day after the conference, I saw the following in my inbox from CSA Research:

I also saw an announcement for an upcoming webinar from Women In Localization with the theme: Maintaining motivation during disruption, which added the byline, "with constant change, staying motivated can be hard." There is concern in the industry far beyond the community present at GALA.

However, the keynote presentation by Daniel Lamarre, CEO of the Cirque du Soleil Entertainment Group, provided a memorable, uplifting, and inspiring message to the attendees. I rate it as one of the best, if not THE best, keynotes in all the years I have been attending localization conferences. His message was relevant, authentic, and realistically optimistic while speaking to the heart.

He is uniquely qualified to speak to a doomy, gloomy audience, as he also faces challenges and has risen from what seemed insurmountable odds. In response to pandemic shutdowns in March 2020, Cirque du Soleil suspended all 44 active shows worldwide and temporarily laid off 4,679 employees, 95% of its workforce. Annualized revenue dropped from over $1 billion to zero almost overnight. And today, Cirque has to work to remain relevant to digitally obsessed world where many youth have never experienced a circus.

He engineered a recovery, and by early 2023/2024, revenue had climbed back to the pre-pandemic level of approximately $1 billion, though growth is expected to moderate around this level for the next couple of years. Leadership stated the recovery exceeded expectations according to financial market observers.

For someone whose primary focus is to find outstanding artists from around the world, provide them with a regular living, and curate entertainment that leaves the audience enthralled and inspired, he had a clear understanding of the challenges that business translation professionals might have in this age of AI madness. Somewhat similar to what his organization faced during the pandemic, when the possibility of large audiences congregating to watch a magical musical circus-like performance in 45 cities across the world was an impossibility.

The heart of his message was about building the right mindset as we face challenges, to break through, which he said begins with continual investment in research and development and a strong focus on creativity. This is very much the ethos of Cirque and pervades their overall approach and culture. A summarized highlight of his message follows:

· Creativity is foundational since it leads to innovation which in turn often results in market leadership.
· Ongoing and regular reflection is essential to building creativity.

· Deep curiosity and the questions that it generates are a building block to discovering successful outcomes.
· While it is important to focus on the problem to get a clear definition of the challenge, it is even more important to focus creatively on possible solutions.
· Investment in research and development (e.g., AI impact on translation-related processes) and organizational creativity is essential to finding your value-add in challenging times.
· Debate is a foundation of evolutionary creativity, and a culture that encourages debate is most likely to find the best outcomes and the best ideas that are not possible with hierarchical mandates.
· Finding a purpose is more likely to create successful outcomes than goals and objectives.
· The best ideas will come from a clear mandate, debate, and a defined sandbox rather than through unstructured, frequent meetings with rambling, unfocused discussions.
· He encouraged the audience to focus much more on communicating the value-add of the business.

I’m not sure I have captured the essence effectively, and as they say, “you had to be there,” as he was successful in lifting everyone’s spirits. His ability to build connections and empathy was indeed unique. He closed by encouraging the GALA community to make more concerted and active efforts to raise their profile and communicate LSP/localization value creation characteristics in the marketplace aggressively, given the unrelenting AI hype.

Raising the LSP Industry Profile

This keynote triggered a recurring theme that attendees raised across several sessions about finding a better way to describe the value of service/product offerings to the marketplace. There were different opinions and views on whether translation, localization, langops, or something else would be the most effective professional self-descriptor to build a value-oriented communication message.

I decided to do some basic research on this subject, via LLM, and I have mixed feelings about the utility of the output, as it lacks insight and understanding. I summarize the unedited responses from 3 different LLM models (which all had very similar results) below:

GPT 4.1 Summary

Normalized Google Trends Frequency (2022-2025)

	Avg Absolute Monthly Search Volume	Normalized Monthly Searches 0-100 Scale 3-Year Mean
Translation	1,000,000	79.9
Localization	200,000	40.1
LangOps	2,000	5.2

Which term provides the greatest reach?

Translation is by far the most recognized and searched term globally. It is used by major platforms like Google, DeepL, and Microsoft for their consumer-facing services, which reinforces its dominance and public familiarity.
Localization is important for industry professionals and clients needing cultural adaptation and more sophisticated services, but it has a narrower audience.
LangOps is emerging in industry circles as a concept for scalable, AI-driven language operations, but its search volume and public awareness remain very low.

Key Points:

If your goal is maximum visibility and broad customer acquisition, “translation” is the most effective term. It captures the widest audience, aligns with consumer expectations, and is the default for everyday users seeking language services.
“Localization” is valuable for targeting clients who require cultural adaptation and market-specific solutions, but it should be used as a supporting term rather than the primary one.
“LangOps” is best reserved for thought leadership, technical blogs, or when targeting enterprise clients already familiar with advanced localization operations.

“Translation” is the term with the greatest and widest reach for LSPs seeking to increase visibility and attract a broad customer base. It is the industry standard, the most searched, and the most recognized by both consumers and businesses. Using “translation” as your primary keyword will maximize your discoverability and support value-driven messaging for the widest possible audience.

Claude Sonnet 3.7 Summary

Search Term Frequency Analysis

The Google Trends data shows significant differences in search interest across the three terms. "Translation consistently maintains the highest search interest, with values ranging from approximately 80-85 on the relative search interest scale (0-100). "Localization shows moderate interest, with values between 30 and 35, while LangOps demonstrates minimal interest, with values hovering around 2-3.

Translation

"Translation" consistently dominates the search landscape with approximately 2.5 times more search interest than "localization" and about 40 times more interest than "LangOps." This term has maintained its strong position throughout the three years with a slight upward trend.

Localization

"Localization" maintains a consistent moderate interest level, showing some fluctuation but generally staying within the 30-35 range on Google's scale. While significantly lower than "translation," it still demonstrates substantial search volume that has remained stable over the three years.

LangOps

"LangOps" shows minimal search interest, consistently remaining below 5 on Google's scale. Despite being positioned as "the future of scalable, efficient, AI-powered localization" and "the next evolution of localization," it has not gained significant traction in search behavior.

Building Value-Oriented Messaging

To build strong value-oriented messaging:

1. Lead with "translation" in primary marketing materials, website headers, and SEO strategy to maximize visibility and reach.

2. Differentiate through "localization" by explaining how your services go beyond basic translation to include cultural adaptation, which adds significant value.

3. Introduce "LangOps" as an advanced concept for enterprise clients, positioning it as the strategic integration of language services into business operations.

4. Bridge the terminology gap by explaining how translation evolves into localization and ultimately into LangOps as part of a maturity model for global businesses.

Perplexity Pro Summary

This output is essentially the same as the data already provided above, as it probably draws its answers from these models. However, this summary seems to be more accommodating of the potential of LangOps, quite possibly because of the extensive use of the term AI in its descriptive materials. They also analyzed each term and its implications in more detail.

Strategic Term Selection for LSPs
Which Term Provides the Greatest Reach?

For an LSP seeking to increase visibility and reach a broader customer base, the choice of terminology is strategically important:

Translation: Maximum Reach, Minimum Differentiation

"Translation" would provide the widest potential reach due to its universal recognition. It's the term used by major platforms like Google, DeepL, and Microsoft[query], indicating its mainstream appeal.

Advantages:

Highest search volume and recognition
Understood by both technical and non-technical audiences
Already associated with established tools and services

Disadvantages:

Highly competitive space
May position services as commoditized
Doesn't communicate advanced capabilities beyond basic translation

Localization: Balanced Reach with Differentiation

"Localization" represents a middle ground, with growing recognition and the ability to communicate more sophisticated services.

Advantages:

Communicates a more comprehensive approach than basic translation

Established industry term with growing recognition
o Aligns with current industry trends for 2025
Suggests cultural adaptation and market-specific expertise

Disadvantages:

Lower search volume than "translation"
May not capture the attention of clients seeking basic translation services

LangOps: Niche Appeal with Maximum Differentiation

"LangOps" offers cutting-edge positioning but with significantly limited reach.

Advantages:

Positions the company at the forefront of industry evolution
Communicates AI integration and advanced capabilities
Potential to attract forward-thinking enterprise clients
Lower competition for the term itself

Disadvantages:

Very limited search volume and recognition
Requires explanation to most potential clients
May alienate clients seeking traditional service

Perplexity Conclusion

For LSPs seeking to maximize market reach and build value-oriented messaging, "translation" remains the most effective foundational term due to its universal recognition and substantial search volume. However, a strategic approach that leverages all three terms at appropriate stages of the customer journey will likely yield the best results.

As the industry continues to evolve, LSPs should monitor the relative popularity of these terms and adjust their terminology strategies accordingly. The rapid growth of "LangOps," despite its currently smaller search volume, suggests that forward-thinking providers may benefit from early adoption of this terminology when targeting enterprise clients seeking comprehensive language solutions.

Ultimately, the most successful terminology strategy will align with both current search behaviors and the evolving direction of the industry, positioning LSPs at the intersection of what clients are looking for today and what they'll need tomorrow.

AI in Enterprise Localization Panel

I was also involved in an interactive panel organized by Johan Sporre with Britta Aagaard, Gaëtan Chrétiennot, Georg Kirchner, and Konstantin Savenkov, who auto-summarized the session with GPT shown here. We discussed misconceptions, opportunities, and the changing role of humans.

Here is the auto-summary:

🔹 AI is not just a better translation tool. It’s a set of technologies that require the right setup, people, and processes to work.
🔹 Many AI deployments in the enterprise are not delivering ROI. Localization is one of the few areas where AI shows clear value—but only when applied with care.
🔹 Clients now care about language in a new way. That opens the door to conversations we couldn’t have before—across IT, marketing, and other teams.
🔹 The real work is not about chasing new buzzwords. It’s about understanding complexity and helping others navigate it.
🔹 Our role is changing—from translation providers to solution architects, guiding AI through data, process, and purpose.

Also, a shoutout to Marina Pantcheva, who gave an instructive and entertaining presentation, which somehow managed to make Cleaning Dirty TM sound fun.

Congratulations to Allison Ferch and the GALA team for holding a successful and substantial conference in such difficult and tumultuous times.

The Evolving LLM Era and its Potential Impact

2024-12-18T10:21:00.004-08:00

With the advent of Large Language Models (LLMs), there are exciting new possibilities available. However, we also see a large volume of mostly vague and poorly defined claims of "using Al" by practitioners with little or no experience with machine learning technology and algorithms.

The signal-to-noise (hype-to-reality) ratio has never been higher, and much of the hype fails to meet real business production use case requirements. Aside from the data privacy issues, copyright problems, and potential misuse of LLMs by bad actors, hallucinations and reliability issues also continue to plague LLMs.

Enterprise users expect production IT infrastructure output to be reliable, consistent, and predictable on an ongoing basis, but there are very few use cases where this is currently possible with LLM output. The situation is evolving, and many expect that the expert use of LLMs could have a dramatic and favorable impact on current translation production processes.

There are several areas in and around the machine translation task where LLMs can add considerable value to the overall language translation process. These include the following:

LLM translations tend to be more fluent and acquire more contextual information, albeit in a smaller set of languages
Source text can be improved and enhanced before translation to produce better-quality translations
LLMs can carry out quality assessments on translated output and identify different types of errors
LLMs can be trained to take corrective actions on translated output to raise overall quality
LLM MT is easier to adapt dynamically and can avoid the large re-training that typical static NMT models require

At Translated, we have been carrying out extensive research and development over the past 18 months into these very areas, and the initial results are extremely promising, as outlined in our recent whitepaper.

The chart below shows some evidence of our progress with LLM MT. It compares Google (static), DeepL (static), Lara RAG-tuned LLM MT, GPT-4o (5-shot), and ModernMT (TM access) for nine high-resource languages. These results for Lara are expected to improve further.

At Translated, we have been carrying out extensive research and development over the past 12 months into these very areas, and the initial results are extremely promising, as outlined in our recent whitepaper.

One approach involves using independent LLM modules to handle each category separately. The other approach is to integrate these modules into a unified workflow, allowing users to simply submit their content and receive the best possible translation. This integrated process includes MTQE as well as automated review and post-editing.

While managing these tasks separately can offer more control, most users prefer a streamlined workflow that focuses on delivering optimal results with minimal effort, with the different technology components working efficiently behind the scenes.

LLM-based machine translation will need to be secure, reliable, consistent, predictable, and efficient for it to be a serious contender to replace state-of-the-art (SOTA) NMT models.

This transition is underway but will need more time to evolve and mature.

Thus, SOTA Neural MT models may continue to dominate MT use in any enterprise production scenarios for the next 12-15 months, except where the highest quality automated translation is required.

Currently, LLM MT makes the most sense in settings where high throughput, high volume, and a high degree of automation are not a requirement and where high quality can be achieved with reduced human review costs enabled by language AI.

Translators are already using LLMs for high-resource languages for all the translation-related tasks previously outlined. It is the author’s opinion that there is a transition period where it is quite plausible that both NMT and LLM MT might be used together or separately for different tasks in new LLM-enriched workflows. NMT will likely perform high-volume, time-critical production work as shown in the chart below.

In the scenario shown above, information triage is at work. High-volume content is initially processed by an adaptive NMT model, followed by an efficient MTQE process that sends a smaller subset to an LLM for cleanup and refinement. These corrections can be sent back to improve the MT model and increase the quality of the MTQE (not shown in the diagram above).

However, as LLMs get faster and it is easier to automate sequences of tasks, it may be possible to embed both an initial quality assessment and an automated post-editing step together for an LLM-based process to manage.

An emerging trend among LLM experts is the use of agents. Agentic AI and the use of agents in large language models (LLMs) represent a significant evolution in artificial intelligence, moving beyond simple text generation to create autonomous, goal-driven systems capable of complex reasoning and task execution.

AI agents are systems that use LLMs as their core controller to autonomously pursue complex goals and workflows with minimal human supervision.

They potentially combine several key components:

An LLM core for language understanding and generation
Memory modules for short-term and long-term information retention
Planning capabilities for breaking down tasks and setting goals
Some ability to iterate to a goal
Tools for accessing external information and executing actions
Interfaces for interacting with users or other systems

One approach involves using independent LLM agents to address each of the categories below as separate and discrete steps.

The other approach is to integrate these steps into a unified and robust workflow, allowing users to simply submit content and receive the best possible output through an AI-managed process. This integrated workflow would include source cleanup, MTQE, and automated post-editing. Translated is currently evaluating both approaches to identify the best path forward in different production scenarios.

Agentic AI systems are capable of several advanced capabilities that include:

Autonomy: Ability to take goal-directed actions with minimal oversight
Reasoning: Contextual decision-making and weighing tradeoffs
Adaptive planning: Dynamically adjusting goals and plans as conditions change
Natural language understanding: Comprehending and following complex instructions
Workflow optimization: Efficiently moving between subtasks to complete processes

A thriving and vibrant open-source community will be a key requirement for ongoing progress. The open-source community has been continually improving the capabilities of smaller models and challenging the notion that scale is all you need. We see an increase in recent models that are smaller and more efficient but still capable and are thus often preferred for deployment.

All signs point to an exciting future where the capabilities of technology to enhance and improve human communication and understanding get better, and we are likely to see major advances in bringing an increasing portion of humanity into the digital sphere for productive, positive engagement and interaction.

The Evolution of AI Translation Technology

2024-12-17T15:12:00.004-08:00

Translated Srl is a pioneer in using MT in professional translation settings at a production scale. The company has a long history of innovation in the effective use of MT technology (an early form of AI) in production settings. It has deployed MT extensively across much of its professional translation workload for over 15 years and has acquired considerable expertise in doing this efficiently and reliably.

Machine Translation
IS
Artificial Intelligence

One of the main drivers behind language AI has been the ever-increasing content volumes needed in global enterprise settings to deliver exceptional global customer experience. The rationale behind the use of language AI in the translation context has always been to amplify the ability of stakeholders to produce higher volumes of multilingual content more efficiently and at increasingly higher quality levels.

Consequently, we are witnessing a progressive human-machine partnership where an increasing portion of the production workload is being transferred to machines as technology advances.

Research analysts have pointed out that even as recently as 2022-23 LSPs and localization departments have struggled with using generic (static) MT systems in enterprises for the following reasons:

Inability to produce MT output at the required quality levels. Most often due to a lack of training data needed to see meaningful improvement.
Inability to properly estimate the effort and cost of deploying MT in production.
The ever-changing needs and requirements of different projects with static MT that cannot adapt easily to new requirements create a mismatch of skills, data, and competencies.

The Adaptive MT Innovation

In contrast to much of the industry, Translated was the first mover in the production use of adaptive MT since the Statistical MT era. The adaptive MT approach is an agile and highly responsive way to deploy MT in enterprise settings as it is particularly well-suited to rapidly changing enterprise use case scenarios.

From the earliest days, ModernMT was designed to be a useful assistant to professional translators to reduce the tedium of the typical post-editing (MTPE) work process. This focus on building a productive and symbiotic human-machine relationship has resulted in a long-term trend of continued improvement and efficiency.

ModernMT is an adaptive MT technology solution designed from the ground up to enable and encourage immediate and continuous adaptation to changing business needs. It is designed to support and enhance the professional translator's work process and increase translation leverage and productivity beyond what translation memory alone can. It is a continuous learning system that improves with ongoing corrective feedback. This is the fundamental difference between an adaptive MT solution like ModernMT and static generic MT systems.

The ModernMT approach to MT model adaptation is to bring the encoding and decoding phases of model deployment much closer together, allowing dynamic and active human-in-the-loop corrective feedback, which is not so different from the in-context corrections and prompt modifications we are seeing being used with large language models today.

It is now common knowledge that machine learning-based AI systems are only as good as the data they use. One of the keys to long-term success with MT is to build a virtuous data collection system that refines MT performance and ensures continuous improvement. This high-value data collection effort has been underway at Translated for over 15 years and is a primary reason why ModernMT outperforms competitive alternatives.

This is also a reason why it makes sense to channel translation-related work through a single vendor so that an end-to-end monitoring system can be built and enhanced over time. This is much more challenging to implement and deploy in multi-vendor scenarios.

The existence of such a system encourages more widespread adoption of automated translation and enables the enterprise to become efficiently multilingual at scale. The use of such a technological foundation allows the enterprise to break down the language as a barrier to global business success.

The MT Quality Estimation & Integrated Human-In-The-Loop Innovation

As MT content volumes rapidly increase in the enterprise, it becomes more important to make the quality management process more efficient, as human review methods do not scale easily. It is useful for any multilingual-at-scale initiative to rapidly identify the MT output that most need correction and focus critical corrective feedback primarily on these lower-quality outputs to enable the MT system to continually improve and ensure overall improved quality on a large content volume.

The basic idea is to enable the improvement process to be more efficient by immediately focusing 80% of the human corrective effort on the 20% lowest-scoring segments. Essentially, the 80:20 rule is a principle that helps individuals and companies prioritize their efforts to achieve maximum impact with the least amount of work. This leveraged approach allows overall MT quality, especially in very large-scale or real-time deployments, to improve rapidly.

Human review at a global content scale is unthinkable, costly, and probably a physical impossibility because of the ever-increasing volumes. As the use of MT expands across the enterprise to drive international business momentum and as more automated language technology is used, MTQE technology offers enterprises a way to identify and focus on the content that needs the least, and the most human review and attention, before it is released into the wild.

When a million sentences of customer-relevant content need to be published using MT, MTQE is a means to identify the ~10,000 sentences that most need human corrective attention to ensure that global customers receive acceptable quality across the board.

This informed identification of problems that need to be submitted for human attention is essential to allow for a more efficient allocation of resources and improved productivity. This process enables much more content to be published without risking brand reputation and ensuring that desired quality levels are achieved. In summary, MTQE is a useful risk management strategy as volumes climb.

Pairing content with lower MTQE scores into a workflow that connects a responsive, continuously learning adaptive MT system like ModernMT with expert human editors creates a powerful translation engine. This combination allows for handling large volumes of content while maintaining high translation quality.

When a responsive adaptive MT system is integrated with a robust MTQE system and a tightly connected human feedback loop, enterprises can significantly increase the volume of published multilingual content.

The conventional method, involving various vendors with different and distinct processes, is typically slow and prone to errors. However, this sluggish and inefficient method is frequently employed to enhance the quality of MT output, as shown below.

MTQE technology aims to pinpoint errors quickly and concentrate on minimizing the size of the data set requiring corrective feedback. The business goal centers on swiftly identifying and rectifying the most problematic segments.

Speed and guaranteed quality at scale are highly valued deliverables. Innovations that decrease the volume of data requiring review and reduce the risk of translation errors are crucial to the business mission.

The additional benefit of an adaptive rather than a generic MTQE process further extends the benefit of this technology by reducing the amount of content that needs careful review.

The traditional model of post-editing everything is now outdated.

The new approach entails translating everything and then only revising the worst and most erroneous parts to ensure an acceptable level of quality.

For example, if an initial review of 40% of the sentences with the lowest MTQE score using a generic MTQE model identifies 60% of the major problems in a corpus, using the adaptive QE model informed by customer data can result in the identification of 90% of the "major" translation problems in a corpus by focusing only on the 20% lowest scoring MTQE scores using the adaptive MTQE model.

This innovation greatly enhances the overall efficiency. The chart below shows how a process that integrates adaptive MT, MTQE, and focused human-in-the-loop (HITL) work together to build a continuously improving translation production platform.

The capability to enhance the overall quality of translation in a large, published corpus by analyzing less data significantly boosts the efficiency and utility of automated translation. An improvement process based on Machine Translation Quality Estimation (MTQE) is a form of technological leverage that advantages extensive translation production.

The Evolving LLM Era and Potential Impact

The emergence of Large Language Models (LLMs) has opened up thrilling new opportunities. However, there is also a significant number of vague and ill-defined claims of "using AI" by individuals with minimal experience in machine learning technologies and algorithms. The disparity between hype and reality is at an all-time high, with much of the excitement not living up to the practical requirements of real business use cases. Beyond concerns of data privacy, copyright, and the potential for misuse by malicious actors, issues of hallucinations and reliability persistently challenge the deployment of LLMs in production environments.

Enterprise users expect their IT infrastructure to consistently deliver reliable and predictable outcomes. However, this level of consistency is not currently easily achievable with LLM output. As the technology evolves, many believe that expert use of LLMs could significantly and positively impact current translation production processes.

Comparing MT System Performance

2024-12-17T14:11:00.001-08:00

The advantages of a dynamic adaptive MT system are clarified in this post. Most static MT systems need significant upfront investment to enable adaptation. Adaptive systems like ModernMT have a natural advantage since the system is so easily adapted to customer domain and data.

Machine Translation (MT) system evaluation is necessary for enterprises considering increasing the use of automated translation to meet the increasing information and communication needs to engage the global customer. Managers need to understand which MT system is best for their specific use case and language combination, and which MT system will improve the fastest with their data and with the least effort to perform best for the intended use case.

What is the best MT system for my specific use case, and this language combination?

The comparative evaluation of the quality performance of MT systems has been problematic and often misleading because the typical research approach has been to assume that all MT systems work in the same way.

Thus, comparisons by “independent” third parties are generally made at the lowest common denominator level i.e. the static or baseline version of the system. Focusing on the static baseline makes it easier for a researcher to line up and rank different systems but penalizes highly responsive MT systems that are designed and able to immediately respond to the user's focus and requirements, and perform system optimization around user content.

Which MT system is going to improve the fastest with my unique data and require the least amount of effort to get the best performance for my intended use case?

Ideally, a meaningful evaluation would test a model on its potential capabilities with new and unseen data as it is expected that a model should do well on data it has been trained on and knows.

However, many third-party evaluations use generic test data that is scoured from the web and slightly modified. Thus, data leakage is always possible as shown in the center diagram below.

Issues like data leakage and sampling bias can cause AI to give faulty predictions or produce misleading rankings. Since there is no reliable way to exclude test data contained in the training data this problem is not easily solved. Data leakage will cause overly optimistic results (high scores) that will not be validated or seen in product use.

This issue is also a challenge when comparing LLM models especially since much of what LLMs are tested on is data that these systems have already seen and trained on. Some key examples of the problems that data leakage causes in machine translation evaluations include:

Overly optimistic performance estimates: because the model has already seen some of the test data during training. This gives a false impression of how well the model will perform on real, unseen data.
Poor real-world performance: Models that suffer from data leakage often fail to achieve anywhere near the same level of performance when deployed on real-world data. The high scores do not translate to the real world.
Misleading comparisons between models: If some models evaluated on a dataset have data leakage while others do not, it prevents fair comparisons and identifying the best approaches. The leaky models will seem superior but not legitimately so.

In addition, the evaluation and ranking of MT systems done by third parties is typically done using an undisclosed and confidential "test data" set that attempts to cover a broad range of generic subject matter. This approach may be useful for users who intend to use the MT system as a generic, one-size-fits-all tool but is less useful for enterprise users who want to understand how different MT systems might perform on their subject domain and content in different use cases.

Rankings on generic test data are often not likely to be useful for predicting actual performance in the enterprise domain. If the test data is not transparent how can an enterprise buyer be confident that the rankings are valid for their use cases? These often irrelevant scores are used to select an MT system for production work and thus are often sub-optimal.

Unfortunately, enterprises looking for the ideal MT solution have been limited to third-party rankings that focus primarily on comparing generic (static) versions of public MT systems, using undisclosed, confidential test data sets that are irrelevant or unrelated to enterprise subject matter.

With the proliferation of MT systems in the market, translation buyers are often bewildered by the range of MT system options and thus resort to using these rankings to make MT system selections without understanding the limitations of the evaluation and ranking process.

What is the value of scores that provide no insight or detail on what the scores and rankings are based on?

Best practices suggest that users have visibility on what data is used to calculate the score for it to be meaningful or relevant.

Thus, Translated recently undertook some MT comparison research to answer the following questions:

What is the quality performance of an easily tuned and agile adaptive MT system compared to generic MT systems that require special adaptation efforts to accommodate and tune to typical enterprise content?
Can a comparative analysis be done using public-domain enterprise data so that a realistic enterprise case can be evaluated, and so that others can replicate, reproduce, and verify the results?
Can this evaluation be done transparently, by making test scripts publicly available so other interested parties can replicate and reproduce the results?
Additionally, can the evaluation process be easily modified so that comparative performance on other data sets can also be tested?
Can we provide a better, more accurate comparison of ModernMT's out-of-the-box capabilities against the major MT alternatives available in the market?

This evaluation further validates and reinforces what Gartner, IDC, and Common Sense Advisory have already said about ModernMT being a leader in enterprise MT.

The evaluation described in this post provides a deeper technical foundation to illustrate ModernMT's responsiveness and ability to quickly adapt to enterprise subject matter and content.

Evaluation Methodology Overview

Translated SRL commissioned Achim Ruopp of Polyglot Technology LLC and asked him to find viable evaluation data and establish an easily reproducible process that could be used to periodically update the evaluation and/or enable others to replicate, reproduce, or otherwise modify the evaluation. He chose the data and developed the procedural outline for the evaluation. This is a typical enterprise use case where MT performance on specialized corporate domain material needs to be understood before deployment in a production setting. It is understood that some of the systems can potentially be further customized with specialized training efforts but this analysis provides a perspective when no effort is made on any of the systems under review.

The process followed by Achim Ruopp in his analysis is shown below:

Identify evaluation data and extract the available data for the languages that were of primary interest and that had approximately the same volume of data. The 3D Design, Engineering, and Construction software company Autodesk provides high-quality software UI and documentation translations created via post-editing machine translations.
- US English → German,
- US English → Italian,
- US English → Spanish,
- US English → Brazilian Portuguese, and
- US English → Simplified Chinese
Clean and prepare data into two data sets:
- 1) ~10,000 segments of TM data for each language pair and,
- 2) a Test Set with 1,000 segments that had no overlap with the TM data
The evaluation aimed to measure the accuracy and speed of the out-of-the-box adaptation of ModernMT to the IT domain and contrast this with generic translations from four major online MT services (Amazon Translate, DeepL, Google Translate, and Microsoft Translator). This is representative of many translation projects in enterprise settings. A zero-shot output score for GPT-4 was also added to show how the leading LLM scores against leading NMT solutions. Thus the “Test Set” was processed and run through all these systems and three versions of ModernMT (Static baseline, Adaptive, and Adaptive with dynamic access to reference TM.) Please note that many “independent evaluations” that compare multiple MT systems focus ONLY on the static version of ModernMT which in reality would rarely happen.
The MT output was scored using three widely used MT output quality indicators that are based on a reference Test Set. These include:
- COMET – A measure of semantic similarity that achieves state-of-the-art levels of correlation with human judgment and is the most commonly used metric in current expert evaluations.
- SacreBLEU – A measure of syntactic similarity that is possibly the most popular metric used in MT evaluation, despite many shortcomings, that compares the token-based similarity of the MT output with the reference segment and averages it over the whole corpus.
- TER – A measure of syntactic similarity that measures the number of edits (insertions, deletions, shifts, and substitutions) required to transform a machine translation into a reference translation. This is a measurement that is popular in the localization industry.
The results and scores produced are presented in detail in this report in a series of charts with some limited commentary. The summary is shown below. The objective was to understand how ModernMT performs relative to the other alternatives and provide a more accurate out-of-the-box picture, thus the focus of this evaluation remains on how systems perform without any training or customization effort. It is representative of the results if the user were to make virtually no effort beyond pointing to a translation memory.

Summary Results

This is the first proper evaluation and comparison of ModernMT's out-of-the-box adaptive MT model (with access to a small translation memory, but not trained) against leading generic (or static) public MT systems.
The comparison shows that ModernMT outperforms generic public MT systems using data from an Autodesk public dataset, where translation performance was measured for translation from US English to German, Italian, Spanish, Brazilian Portuguese, and Simplified Chinese using COMET, SacreBLEU, and TER scoring.
ModernMT achieves these results without any overt training effort, simply by dynamically using and referencing relevant translation memory (TM) when available.
A state-of-the-art LLM (GPT-4) failed to outperform the production NMT systems in most of the tests in this evaluation.
The evaluation and comparison tools and research data are in the public domain. Interested observers can replicate the research with their own data.

The effortless improvements in ModernMT show why comparisons to the static version of the system are meaningless

Why is MT evaluation so difficult?

Language is one of the most nuanced, elaborate, and sophisticated mediums used by humans to communicate, share, and gather knowledge. It is filled with unwritten and unspoken context, emotion, and intention that is not easily contained in the data used to train machines on how to understand and translate human language. Thus, machines can only approach language at a literal textual string level and will likely always struggle with finesse, insinuation, and contextual subtleties that require world knowledge and common sense. Machines have neither.

Thus, while it is difficult to do this kind of evaluation with absolute certainty, it is still useful to get a general idea. MT systems will tend to do well on material that is exactly like the material they train on and function almost like translation memory in this case. Both MT system developers and enterprise users need to have some sense of what system might perform best for their purposes.

It is common practice to test MT system performance on material it has not already memorized to get a sense of what system performance will be in real-life situations. Thus quick and dirty quality evaluations provided by BLEU, COMET, and TER can be useful even though they are never as good as expert, objective human assessments. These metrics are used because human assessment is expensive and slow and also difficult to do consistently and objectively over time.

To get an accurate sense of how an MT system might perform on new and unseen data it is worth considering how these factors could undermine any absolute indication of any one system being “better” or “worse” than any other.

Language translation for any single sentence does not have a single correct answer. Many different translations could be useful and adequate and correct for the purpose at hand.
It is usually recommended that a varied but representative set of 1,000 to 2,000 segments/sentences be used in an evaluation. Since MT systems will be compared and scored against this “gold standard” the Test Set should be professionally done. This can cost $1,500 to $2,500 per language. So, 20 languages can cost $50,000 just to create the Test Set. This cost often results in MT use to reduce costs which builds in a bias for the MT system (typically Google) used to produce this data.
There is no definitive way to ensure that there is no overlap between the training data and the test data so data leakage can often undermine the accuracy of the results.
It is easier to use generic tests but the most useful performance indicators in production settings will always be with carefully constructed test sentences of actual enterprise content (that are not contained in the training set).

Automated quality evaluation metrics like COMET are indeed useful but the experts in the community now realize that these scores have to be used together with competent human assessments to get an accurate picture of the relative quality of different MT systems. Using automated scores alone is not advised.

What matters most?

This post explores some broader business issues that should also be considered when considering MT quality.

While much attention is given to comparative rankings of different MT systems, one should ask how useful this is in understanding how any particular MT system will perform on any enterprise-specific use case. Scores on generic test sets do not accurately predict how a system will perform on enterprise content in a highly automated production setting.

The rate at which an MT system improves for specific enterprise content with least effort possible is possibly the most important criterion for MT system selection.

Ideally, improvement should be seen on a daily or at least weekly basis.

So instead of asking what COMET score System A has on its EN > FR system? It is important to ask other questions that are more likely to ensure successful outcomes. The answers to the following questions will likely lead to much better MT system selections.

How quickly will this system adapt to my unique customer content?
How much data will I need to provide to see it perform better on my content and use case?
How easy is it to integrate the system with my production environment?
How easy or difficult is it to set up a continuously improving system that continues to improve and learn from ongoing corrective feedback?
How easy or difficult is it to manage and maintain my optimized systems on an ongoing basis?
Can I automate the ongoing MT model improvement process?
Ongoing improvements are driven both by technology enhancements and by expert human feedback, are both these available from this vendor?

Please follow this link for a detailed report on this evaluation and more detailed analysis and commentary on understanding MT evaluation from a more practical and business-success-focused perspective.

ModernMT Introduces Adaptive Quality Estimation (MTQE)

2024-12-16T17:10:00.000-08:00

As MT quality improves, MT use expands to publishing millions of words monthly to improve global customer experience. MTQE can quickly identify potential problems to focus MTPE only on the most problematic sections and quickly publish large volumes of global CX-enhancing content safely.

Historically, the path to achieving quality in professional language translation work is to involve multiple humans in the creation and validation of every translated segment. This multi-human translation production process is known as TEP or Translate > Edit > Proof. The way to guarantee the best translation quality will be produced has always been to provide a quality review by a second and sometimes a third person. When this process works well it produces “good quality” translation, but this approach also has serious limitations:

1) it is an ad-hoc process with constantly changing humans that can result in the same mistakes happening again, and,

2) it is a time-consuming, miscommunication-prone, and costly process that is difficult to scale as volumes increase.

The TEP model has been the foundation for much of the professional translation work done over the last 20 years and is still the production model used for much of the translation work managed by localization groups. While this is a historical fact, the landscape for professional business translation has been changing in two primary ways:

1) The volumes of content that need to be translated to be successful in international business settings are continually increasing,

2) An increasing need and use of machine translation and more automation to cope with the ever-increasing demand, and the need for much faster turnaround on translation projects.

One solution to this problem is to increase the use of machine translation and post-edit the output (MTPE or PEMT). This is an attempt to reproduce part of the entirely human TEP process described above with a machine starting the process. This approach has met with limited success, and many LSPs and localization managers struggle to find an optimal MT process due to the following issues:

Uneven or poor machine translation quality: The automation can only be successful when the MT provides a useful and preferably continuously improving first draft submitted for human approval or refinement. MT quality varies by language and few LSPs and localization managers know how to engineer and optimize MT systems to perform optimally for their specific needs. Recent surveys by researchers show that LSPs (and localization managers) still struggle to meet quality expectations and estimate cost and efforts when using MT.

Translator resistance: As MTPE is a machine output-driven process, and typically paid at lower unit rates, many translators are loathe to do this kind of work without assurances that the MT will be of adequate quality to assure fair overall compensation. Low quality MT is much more demanding to correct and thus translators find that their compensation is negatively impacted when they work with low-quality MT. The converse is also true, many translators have found that high-quality adaptive MT work results in higher-than-expected compensation due to the continuous improvement in the MT output and overall system responsiveness.

Lack of standardization: there is currently no standardization in the post-editing process, which can lead to inconsistencies in the quality of the final translation.

Training and experience: Post-editing MT requires a different skill set than traditional translation, and post-editors need to be trained accordingly. The translator versus post-editing task remains a source of friction in an industry that depends heavily on skillful human input, largely due to improper work specification, and compensation-related concerns.

Cost: Post-editing can be expensive, especially for large volumes of text. This can be a significant obstacle for companies that need to translate large amounts of content since it is often assumed that all the MT output must be reviewed and edited.

MT Quality Evaluation vs MT Quality Estimation

But as we move forward and expand the use of machine translation to make ever-increasing volumes of content multilingual, we see the need for two kinds of quality assessment tools that can be useful to any enterprise that seeks to be multilingual at scale.

1) Quality Evaluation estimates provide a quality assessment of multiple versions of an MT system that may be used by the MT system developers to better understand the impact of changing development strategies. Commonly used evaluation metrics include BLEU, COMET, TER, and ChrF which all use a human reference test set (the gold standard) to calculate a quality score of each MT system’s performance and is well understood by the developer.

These scores are useful to developers to find optimal strategies in the system development process but unfortunately, these scores are also used by “independent” researchers who seek to sell aggregation software to less informed buyers and localization managers who usually have limited understanding of the scores, the test sets, and the opaque process used to generate the scores. Thus, buyers will often make sub-optimal and naïve choices in MT system selection.

2) Quality Estimation scores, on the other hand, are quality assessments made by the machine without using reference translations or actively requiring humans in the loop. It is an assessment of quality made by a machine itself on how good or bad a machine-translated output segment is. MTQE can serve as a valuable tool for risk management in high-volume translation scenarios where human intervention is limited or impractical due to the volume of translations or speed of delivery. MTQE enables efficiency and minimizes potential risks associated with using raw MT because it directs attention to the most likely problematic translations, and reduces the need to look at all the automated translations.

Interest in MTQE has gained momentum as the use of MT has increased, as it allows rapid error detection in large volumes of MT output, thus enabling rapid and focused error correction strategies to be implemented.

Another way to understand MTQE is to more closely examine the difference in training data used in developing an MT engine versus the data used in building a QE model. An MT system is trained on large volumes of source and target sentence pairs or segments or what is generally called translation memory.

An MTQE system is trained on the original MT output and corrected sentence pairs which are also compared to the original source (ground truth) to identify error patterns. The MTQE validation process seeks to confirm that there is a high level of agreement between a machine's quality prediction of machine output and human quality assessment of that same output

Quality estimation is a method for predicting the quality without having to compare it to a human reference set. Quality estimation uses machine learning methods to assign quality scores to machine-translated segments and since it works through machine learning it can be used in dynamic, live situations. Quality estimation can predict quality at various levels of text, including at the level of the word, phrase, sentence, or even document but is used most commonly at a segment level.

What is T-QE?

The current or traditional process used to improve adaptive machine translation quality uses one of two methods:

1) random segments are selected and reviewed by professional translators or,

2) every segment has to be reviewed by a translator to ensure the required quality.

However, as MT content volumes rapidly increase in the enterprise, it becomes more important to make this process more efficient, as these human review methods do not scale easily. It is useful to the production process to rapidly identify those segments that most need human attention, and focus critical corrective feedback primarily on these problem segments to enable the MT system to continually improve and ensure overall improved quality on a large content volume.

The MT Quality Estimator (T-QE) streamlines the system improvement process by providing a quality score for each segment, thus identifying those segments that most need human review, rather than depending only on random segment selection, or requiring that each segment be reviewed.

The basic idea is to enable the improvement process to be more efficient by immediately focusing 80% of the human corrective effort on the 20% lowest-scoring segments. Essentially, the 80:20 rule is a principle that helps individuals and companies prioritize their efforts to achieve maximum impact with the least amount of work. This approach allows overall MT quality, especially in very large-scale or real-time deployments, to improve rapidly.

The MT Quality Estimator assists in solving this challenge by providing an MT quality score for each translated segment, directly within Matecat or via an API.

The MT Quality Estimator at Translated was validated by taking many samples (billions of segments) of different types of content of varying source quality and comparing the correlation between the T-QE scores and human quality assessments.

The initial tests conducted by the ModernMT team suggest that the T-QE scores are more accurate predictors on high-quality segments but it was noted that lower-quality segments contained more UGC, had longer sentences, and were in general noisier.

The Key Benefits of MT Quality Estimation

Human review at a global content scale is unthinkable, costly, and probably a physical impossibility because of the ever-increasing volumes. As the use of MT expands across the enterprise to drive international business momentum and as more automated technology is used, MTQE offers enterprises a way to identify and focus on the content that needs the least, and the most attention, before it is released into the wild.

MTQE is an effective means to manage risk when an enterprise wishes to go multilingual at scale. Quality estimation can predict the quality of a given machine translation, allowing for corrections to be made before the final translation is published. MTQE identifies high-quality MT output that does not require human post-editing and thus makes it easier to focus attention on the lower-quality content, allowing for faster turnaround times and increased efficiency.

This informed identification of problems that need to be submitted for human attention is essential to allow for a more efficient allocation of resources and improved productivity. This process enables much more content to be released to global customers without risking brand reputation, and ensuring that desired quality levels are achieved.

When MTQE is paired and combined with a highly responsive MT system, like ModernMT, it can accelerate the rate at which large volumes of customer-relevant content can be released and published for a growing global customer base.

MTQE provides great value in identifying the content that needs more attention and also identifying the content that can be used in its raw MT form, thus speeding up the rate at which new content can be shared with a global customer base.

“We believe that localization value comes from offering the right balance between quality and velocity,” says Conchita Laguardia, Senior Technical Program Manager at Citrix, and “the main benefit QE gives is the ability to release content faster and more often.”

Other ways that MTQE ratings can also be used include:

Informing an end user or a localization manager about the overall estimated quality of translated content at a corpus level,
Identifying different kinds of matches in translation memory, e.g., an In-Context Exact (ICE) match is a type of translation match that guarantees a high level of appropriateness by the match having been previously translated in the same context. It is an exact match that occurs in exactly the same context, that is, the same location in a paragraph, which is better than a 100% match and better than fuzzy matches of 80% or less. These different types of TM matches can be processed in differently optimized localization workflows to maximize efficiency and productivity and are useful even in traditional localization work.
Deciding if a translation is ready for publishing or if it requires human post-editing,
Highlighting problematic content that needs to be revised and changed.

The pairing of content with lower MTQE scores into a workflow that also links into a responsive, continuously learning, adaptive MT system like ModernMT, makes for a powerful translation engine that can handle making large volumes of content multilingual without compromising overall translation quality.

Effective MTQE systems allow the enterprise to produce higher quality fast translations at low cost and safely increase the use of “raw MT”.

The MT Quality Estimator at Translated has been trained on a dataset comprising over 5 billion sentences from parallel corpora (source, MT output, and corrected output) and professional translations in various fields and languages. The AI identifies and learns the error correction patterns by training on these billions of sentences, and provides a reliable prediction of which segments are most likely to need no correction, thus efficiently directing translators to those low-scoring segments that are most likely to need correction. MTQE can be combined with ModernMT, to automatically provide an overall MT quality score for a custom adaptive model, as well as a quality score for MT suggestions within Matecat.

When combined with a highly responsive MT system like ModernMT, it is also possible to improve the overall output quality of a custom MT model by focusing human review only on those sentences that fall below a certain quality score.

Salvo Giammarresi, head of localization of Airbnb, a company that has been beta-testing the service, says:

“Thanks to T-QE, Airbnb can systematically supervise the quality of content generated by users, which is processed through our custom MT models. This allows us to actively solicit professional translator reviews for critical content within crucial areas. This is vital to ensure that we are providing our clients with superior quality translations where it truly matters”.

Ongoing Evolution: Adaptive Quality Estimation

The ability to quickly identify errors and focus on reducing the size of the overall data set that needs to receive corrective feedback is an important goal of the MTQE technology. Focus on identifying the most problematic segments and correct them quickly.

Any innovation that reduces the amount of data that needs to be reviewed to improve a larger corpus is valuable.

Thus, while the original MTQE error identification process uses the most common error patterns learned from the 5 billion-sentence generic dataset, the ModernMT team is also exploring the benefits of applying the adaptive approach to MTQE segment prediction.

The impact of this innovation is significant. The following hypothetical example illustrates the potential impact and reflects the experience of early testing. (This will, of course, vary depending on the dataset and data volume.)

For example, if an initial review of 40% of the sentences with the lowest MTQE score using the generic MTQE model identifies 60% of the major problems in a corpus, using the adaptive model with customer data can result in the identification of 90% of the major problems in a corpus by focusing only on the 20% with the lowest MTQE score using the adaptive MTQE model.

This ability to improve the overall quality of the published corpus by looking at less data, dramatically increases the efficiency of the MTQE-based improvement process.

This is technological leverage that benefits large-scale translation production.

T-QE is primarily designed and intended for high-volume enterprise users but is also available for translators in MateCat or by API for enterprises.

Please contact info@modernmt.com for more information.

The Importance of User-Generated Content (UGC) and Listening to the Customer

2024-12-16T16:47:00.000-08:00

As the importance of establishing an ever-expanding digital corporate presence to build, enhance, and improve the customer experience for both B2C and B2B customers has gained momentum, companies are realizing the growing importance of what is known as User Generated Content (UGC).

Consumers trust authentic, unpaid recommendations from real customers more than any other type of content.

UGC consists of content such as text, videos, images, and reviews that are generated by real customers, influencers, and independent individuals rather than by the brands themselves. It is important to note that any modifications made to this content should only aim to enhance clarity, conciseness, or formality without altering the original message or quotes. This content focuses on customer experiences, such as reviews, testimonials, case studies, guest posts, comments in online communities and forums, collaborative webinars, podcasts, hosted events, social media posts, and PR campaigns, as well as partner, distributor, and vendor promotions can be utilized in numerous ways to educate both new and current customers about the potential brand experience.

UGC is clear evidence of direct customer feedback, often unsolicited. It is the voice of the customer in its purest form. The value and impact of UGC are even greater in eCommerce settings where this content is widely understood to be a primary driver for conversions and purchase motivation.

In the B2B context, UGC is more than just reviews and case studies, and should be considered to be "any content others create related to your business".

UGC is important in modern digital marketing for many reasons, as summarized below:

Authenticity: UGC is a more authentic and experiential form of content than corporate content because it is created by customers, free from artificial embellishments or supervision by brands. Consumers tend to trust UGC more than traditional advertising, and it serves as a contemporary variation of word-of-mouth marketing, a force that has always played a significant role in influencing consumer purchasing decisions.
Social Proof: UGC offers social proof that impacts the buyer's journey. It builds consumer confidence and is an extremely efficient strategy for a brand to influence its audience and convert them into customers. In simpler terms, social proof is the equivalent of a reference in a B2B setting or someone else's stamp of approval. UGC also facilitates community-building, which can result in greater loyalty and advocacy.
Unlimited Authentic and Unfiltered Content: UGC offers brands unrestricted, genuine, and unedited content to improve brand awareness and strengthen brand reputation. Brands that implement UGC show their willingness to engage in a two-way discussion, fostering more trusted and engaged relationships with consumers.
Cost-Effective: Generating marketing content can be a time-consuming and expensive process for an enterprise, which is why UGC is quickly becoming a critical component of digital marketing campaigns.
Increased Engagement and conversions: User engagement increases due to user-generated content, which is directly correlated with conversions. User-generated content validates and legitimizes your marketing message, leading to an increased likelihood of user conversion and higher sales.

While some marketers still believe that branded content is more trustworthy or preferable to user-generated content, research suggests otherwise. Customers consider authentic user-generated content (UGC) the most trustworthy content in both B2C and B2B contexts.

UGC has many benefits for businesses. Authentic and uncensored content can establish trust and credibility, as customers are more likely to believe and engage with content from peers and independent observers than from the brand itself.

Today, most customers are cautious of claims of superiority made by brands and actively seek information from like-minded customers and independent observers to better understand the product or service during the buyer and customer journey.

Additionally, it is a cost-effective way for a business to create trusted content that can favorably influence engagement and build stronger relationships with customers at various stages in the buyer and customer journeys.

Furthermore, UGC provides valuable insights into customers' experiences and perspectives and enables the enterprise to engage with customers more deeply and effectively. Statistics show that consumers find UGC 9.8x more impactful than influencer content, and 79% of people say UGC highly impacts their purchasing decisions. Some of the most recent research also confirms that consumers rank authentic UGC as the most trustworthy content in their buyer journey.

Here are some recent statistics from reputable sources on the value and impact of UGC:

64% of consumers agree that when a brand they like and use re-shares content by customers, they are more likely to share content about the brand or its products.
76% of consumers have purchased a product because of someone else’s recommendation before.
72% of consumers believe that reviews and testimonials submitted by customers are more credible than the brand talking about their products.
A study by Bazaarvoice showed that websites with UGC can see an increase of 29% in web conversions, a 20% increase in return visitors, and a 90% increase in time spent on-site.
Research by BrightLocal indicated that 79.69% of consumers look at ratings and reviews before making a purchase.
6 in 10 marketers report that their audience engages more with UGC in marketing and communications channels than branded content.
75.78% of consumers have used social media to search for or discover products, brands, and experiences.
Three-quarters or more of travelers were active on at least one social media platform in 2019.
Cost-per-click has been seen to decrease by 50% with the addition of user-generated content in social media ads.
The majority of millennials, 66%, book their travel trips using their smartphone. A higher majority, 74%, said that they use their smartphone for research related to their travels. Again the most trusted content tends to be UGC and peer commentary on travel experience.

These statistics show that User Generated Content (UGC) is a valuable tool for marketers to establish trust, engagement, and loyalty with their audiences. Engaging with UGC helps marketers listen to their customers, understand their needs, and collaborate with them as co-marketers to create more compelling content. This engagement strategy enables marketers to attract new customers, foster brand loyalty, and increase customer satisfaction.

However, research indicates that many businesses still struggle to comprehend, utilize, and harness the potential of fast-moving, high-impact UGC content. Furthermore, most marketing organizations remain focused on developing and disseminating brand messages, rather than actively monitoring and engaging with the ongoing stream of customer feedback across social media and the internet.

The Translation Challenge & Perspective

As can be expected, the volume of user-generated data is constantly increasing in the modern era, and the challenge for the modern enterprise is to track it in all its most relevant variants and to set up translation production processes for the most important and relevant content.

According to World Economic Forum estimations, by 2025, the amount of data created by humans each day will be about 463 exabytes (one exabyte is equal to one billion gigabytes). As of 2021, we produce over 500 million tweets, ~300 billion emails, and 4 million gigabytes of Facebook data every single day.

While this data has primarily focused on G7 economies in the past, it is expected to shift significantly as economic growth continues to surge in the Global South and South Asia over the next two decades. As a result, global business leaders must master the skills to listen, share, communicate, translate, and comprehend various content streams in an expanding array of languages. The languages that hold the utmost relevance at present may not retain the same level of significance in the upcoming decades.

This will require that leading global businesses will enable and be capable of being multilingual along all of the following content dimensions:

Social Media Content: As social media grows into a better search engine, it’s up to marketers to create searchable content. Many buyers request user-generated content along their buying journey and this should be easily accessible as they peruse and investigate your site. Here are some examples of B2B use of social media as a digital marketing channel.

Multilingual Email Content: Personalized email content that enables quick and effortless retrieval of User Generated Content (UGC) and reviews, and prompts customers to share their feedback for future content development.

Digital Advertising: There is a clear trend towards more video/audio content, along with a strong preference for access to genuine user-generated reviews, forums, and discussions.

Web Content: Customers crave reviews from others with similar needs. The inclusion of visual reviews on your website and product pages, in addition to user-generated content, can create the feedback loop necessary to satisfy your audience's desires.

Brand Content: Branded content mixed with relevant and specific user-generated content addressing evaluation issues raised by many customers is crucial. However, numerous consumers only consult it after they have already satisfied themselves with other customer opinion data. While consumers often consult other customer opinions before turning to UGC, buyers are 4-6 times more likely to purchase from purpose-driven companies that they advocate for through UGC and word-of-mouth referrals. Moreover, the addition of UGC in social media ads has been shown to decrease cost-per-click by 50%. 6 out of 10 marketers report that their audience more frequently engages with user-generated content (UGC) in marketing and communications channels than with branded content.

The truth is that today, the #1 marketing channel used by most companies is social media and the brand's website is the second most used marketing channel, especially in B2C settings.

Measuring the success of a UGC campaign involves tracking key performance indicators (KPIs) that align with overall business goals. These can vary by language and can thus help to identify the most and least receptive markets. Here are some KPIs and metrics to consider when evaluating the success of a UGC campaign:

Engagement Metrics: Monitor likes, comments, shares, and clicks to understand the impact of UGC on audience engagement.
Reach and Impressions: Measure the number of people who see your UGC and the total number of times it's displayed.
UGC Volume: Track the total number of user-generated posts, reviews, or other content forms associated with your brand.
Conversion Rates: Analyze how UGC influences customer behavior, such as driving traffic to your website, increasing sales, or prompting sign-ups for newsletters.
Content Performance Metrics: Track metrics tied to specific goals, pieces of content, or distribution channels, such as impressions, reach, engagement, clicks, conversions, sales, revenue, or customer loyalty.
ROI Calculation: Consider factors like content creation costs, revenue spent on paid social ads, the value of your visual content library, cost per click (CPC), and overall conversions when calculating the ROI of your UGC campaign

To be able to participate effectively in the global market an enterprise will need not only the most streamlined and efficient translation production capabilities but also have infrastructure and processes that continually improve and adapts to changing customer requirements.

This is precisely the solution that has been developed by Translated for any global enterprise to be able to undertake this content deluge challenge successfully. This is a solution and a technology that has been developed in close collaboration with clients who have focused on serving customers who have expressed a preference for having multilingual content access at scale, particularly for more dynamic real-time UGC which inform evaluation and purchase decisions.

Unveiling Hyper Adaptive ModernMT

Translated recently announced a new model of ModernMT, its adaptive machine translation (MT) system. The new model, called Hyper Adaptive, enables companies to translate billions of words at ultra-fast speeds without compromising quality. It is domain-specific and designed for use cases such as translating user-generated content, datasets for multilingual large language models, and web content for data mining activities.

In recent years, companies have approached Translated with requests to leverage the accuracy of ModernMT's adaptive MT system to quickly translate specialized, unique content and high volumes of ongoing content. While a generic adaptive MT model can handle the request to some extent, it is not designed to translate millions of words per minute in a specific domain.

Hyper Adaptive solves this issue by using sophisticated compression techniques and training the MT model for specific use cases based on the customer's previous translations and translation memories (TMs) to ensure high-quality performance even at a scale of many billions of words a month.

The resulting highly specialized MT model is much smaller and more efficient than a generic adaptive model and can process content at ultra-fast speeds, in as little as 50ms for a typical sentence. An example to clarify the performance capability at Translated's dedicated data centers: it can translate the entire English Wikipedia (4.4 billion words) into another language in less than a day (3 million words per minute). By training directly using customer data, the Hyper Adaptive model achieves translation accuracy equal to or better than state-of-the-art custom adaptive MT models.

Often, when very high throughput is required, MT systems will need to make compromises on output quality. Typically there is a trade-off between quality and throughput. In contrast, this solution helps companies maintain high quality even when translating massive volumes of content at ultra-high speeds.

In some specific use cases, such as dynamically changing user-generated content, combining the dynamically learning adaptive MT model with ongoing professional translator corrective feedback can further improve the quality of the MT output over time.

Even though the model is optimized throughput speed, the model is still adaptive, and thus, it continues to improve after initial training through ongoing corrective feedback and the addition of new TMs delivered to match the company's style.

As the demand for agile global enterprises scales to translating billions of words a month, solutions like Hyper Adaptive ModernMT allow continuous improvement daily yet can easily translate billions of words of relevant UGC into over 200 languages every day.

We designed the Hyper Adaptive model to enable the translation of content that has never been translated before. Its language coverage allows companies to reach over 99% of the world's population in their own language. Hyper Adaptive is one more step towards global understanding.

Marco Trombetti – Translated CEO

Integration and Costs

Like all other ModernMT models, the Hyper Adaptive model can be integrated into the translation workflow via API. Costs vary depending on the use case, the amount of data to be translated, and the amount and quality of existing translations and TMs. Existing Translated customers can contact their account manager to get a new service quote.

Thanks to the Hyper Adaptive model, user-generated content on Airbnb has reached an unprecedented level of quality, greatly improving the experience for our user base. The real-time, high-quality translation of UGC has helped Airbnb foster a stronger sense of community among our hosts and guests, which has had a tremendous impact on our business.

Salvo Giammarresi – Head of Localization at Airbnb

An Overview of ModernMT V7

2023-12-08T13:33:00.000-08:00

Serious MT technology development requires ongoing efforts and research to continually improve the performance of systems and to address important emerging requirements as the use of MT expands. Researchers have been working on MT for over 70 years and success requires a sustained and continuing effort.

These efforts approach the goal of producing as close as possible to human-quality MT output in multiple ways, and these improvement strategies can be summarized in the following ways:

Acquire better and higher volumes of relevant training data. Any AI initiative is highly dependent on the quality and volume of the training data that is used to teach the machine to properly perform the task.
Evaluate new algorithms that may be more effective in extracting improved performance from available training data. We have seen the data-driven MT technology evolve from Statistical MT (SMT) to various forms of Neural MT (NMT) using different forms of deep learning. The Transformer algorithm which also powers LLMs like GPT-4 is the state-of-the-art in NMT today.
Use more powerful computing resources to dig deeper into the data to extract more learning. As the demand for translation grows with the massive increases in content and ever-expanding volumes of user-created content (UGC) it becomes increasingly important for MT to handle massive scale. Today there are global enterprises that are translating billions of words a month into a growing portfolio of languages and thus scalability and scale are now key requirements for enterprise MT solutions. Some researchers use more computing during the training phase of the MT model development process as there can be quality advantages gained at inference from doing this extra-intensive training.
Build more responsive and integrated human-machine collaboration processes to ensure that expert human feedback is rapidly incorporated into the core data used to tune and improve these MT engines. While the benefits gained from more and better data, improved algorithms, and more computing resources are useful, the integration of expert human feedback into the MT model's continuous learning is a distinctive advantage that allows an MT model to significantly outperform models where only data, algorithms, and compute are used.
Add special features that address the unique needs of large groups of users, or use cases that are being deployed. As the use of MT continues to build momentum with the enterprise many specialized requirements also emerge e.g. enforcement of specific terminology for brand integrity, profanity filters to avoid egregious MT errors, and improvement of document-specific content awareness.

All these different approaches have the goal of producing improved MT output quality and it will require progress along all of these different fronts to achieve the best results.

The ModernMT development team pursues ongoing improvements along all these fronts on an ongoing basis, and ModernMT V7 is the result of several measured improvements on many of these dimensions to provide improved performance.

As machine translation (MT) continues to evolve and expand beyond the traditional use case areas such as e-commerce, global collaboration, and customer care, those interested in the expanding future of localization are now also looking to use generative artificial intelligence (AI) and, in particular, large language models (LLMs) such as OpenAI’s GPT

Unlike typical Neural MT, LLMs prioritize fluency over accuracy. But while LLMs show promising results in improving the fluency of translations, they can also produce confabulations (hallucinations), i.e. output that is inaccurate or unrelated to the input data and thus require careful monitoring and oversight to ensure accuracy.

With the latest release of ModernMT (V7), Translated has introduced a novel technique to increase the accuracy of neural MT models, called “Trust Attention,” which can also be used to address reliability within generative AI models.

The design and implementation of Trust Attention was inspired by how the human brain prioritizes trusted sources in the learning process, linking the origin of data to its impact on translation quality.

ModernMT V7 preferentially uses the most trusted data (identified by users) and thus the highest quality and most valuable training data has the greatest influence on how a model performs. This is in stark contrast to most MT models which have no discernment of data quality and thus tend to perform using only statistical density as the primary driver of model performance.

The Trust Attention capability prioritizes its learning based on data value and importance like how humans sift through multiple sources of information to identify the most trustworthy and reliable ones. Data extracted from translations performed and reviewed by professional translators is always preferred over other data, especially unverified translation memory content acquired from web crawling, which is typically used by most MT systems today.

The development team at ModernMT considers Trust Attention to be as significant an innovation as Dynamic Adaptive MT engines. It is the kind of feature that can dramatically improve MT system performance for different use cases when properly used.

According to an evaluation by professional translators, done to validate the beneficial impact, Trust Attention alone improves MT quality by up to 42%, and by an average of 16.5% in cases across the top 50 languages. Interestingly, even many high-resource languages, such as Italian and Spanish, showed significant improvements (in the 30% range) in human evaluations.

ModernMT V7 New Features: Up to 60% Better MT Quality

ModernMT V7 is the evolution of Translated’s renowned adaptive MT system, recognized as a leader in the Machine Translation Software Vendor Assessment for enterprises by IDC Marketscape 2022, and as “the most advanced implementation of responsive MT for enterprise use” in CSA Research’s 2023 Vendor Briefing.

In addition to Trust Attention, ModernMT V7 includes several other new features that further enhance the reliability and dependability of MT output. Here are the most impactful:

Advanced Terminology Control: Along with its ability to learn the client’s terminology from past translations, ModernMT now provides companies with self-managed glossary control to ensure brand and context-specific terminology consistency. This ability to enforce terminology has not been needed in the past because the dynamic adaptive MT technology acquires terminology very effectively even without this feature.
DataClean AI: V7 relies on a new sanitization algorithm that identifies and removes poor-quality data to refine the training material and reduce the likelihood of hallucinations. The close examination of errors over many years has provided clues on the root causes of strange output from MT engines. This learning and related benefits also transfer to LLM-based MT engines should they become more viable in the future.
Expanded Context: ModernMT can now leverage up to 100,000 words of document content —Four times more than GPT-4 - to preserve style and terminology preferences, providing unparalleled document-specific accuracy in MT suggestions and providing controls to solve persistent problems such as gender bias and inconsistent terminology.
Profanity Filter: V7 masks words in translation suggestions that could be regarded as inappropriate in the target language, minimizing the possibility of cultural offenses.

The combined effect of all the improvements and innovations described above has a significant impact on the overall performance and capabilities of ModernMT.

The MT quality is now considered to be 45% to 60% better than the previous version according to systematic human evaluations.

These improvements have greatly reduced the Time to Edit (TTE) for MT suggestions. At the end of July, the aggregate TTE measured across tens of thousands of samples showed a 20% reduction, reaching a record low of 1.74 seconds. This milestone indicates an acceleration towards singularity in translation, a trend further supported by preliminary TTE data collected continuously since the 1.74 seconds record was established.

The Hallmark of the Symbiosis Between Translators and MT

ModernMT V7 is available in 200 languages and covers all the fastest-growing economies likely to emerge over the next 20 years. Its hallmark is the ability of the MT model to learn from corrections in real time, enabling a powerful collaboration between the expertise of professional translators and the speed and capacity of MT.

Thanks to this unique approach, combined with Translated’s vast community of professional translators and leading AI-enabled localization solutions (Gartner 2022), Airbnb was able to ditch the translate button and simply make multilingual content pervasive and comprehensive across the platform and become one of the top 3 global brands (Global by Design 2023).

Success stories like that of Airbnb and others, along with market research that shows the ever-growing demand for more multilingual content, have led Translated to estimate that once MT reaches what is commonly referred to as “parity with human translation” (singularity in translation), we can expect a 100-fold increase in MT requests alongside a 10-fold growth in demand for professional translations.

We are entering a new era in which significantly larger volumes of content will be translated automatically. In this scenario, professional translators play an increasingly important role, not only in guiding the MT through the adaptive process but also in ensuring that the key messages are appropriately conveyed. By engaging the best translators with the best adaptive MT, companies can now take on projects that simply weren’t feasible before.

Moving Towards LLMs for Translation

Recently, Translated conducted a large-scale study to compare the performance of the most advanced MT systems with LLMs in terms of enterprise readiness. The findings showed real potential for LLMs, particularly in terms of more fluent translation quality, and also revealed areas where improvements are needed. Based on this research, Translated believes elements of both MT systems and LLMs will be critical as we move forward, and plans to provide in-depth insights into using LLMs in translation in the coming weeks and months.

Comments by John Tinsley of Translated SRL on LLM-based Translation in November 2023:

❗ LLMs - the new default for machine translation ❗

I've seen a lot of commentary along these lines over the past few months. I've also seen a lot of well-articulated commentary, not strictly opposing this line, but with added nuance and context (a challenge on the internet!)

I wanted to offer my two cents, from being at the forefront of these developments through actually building the software, and from having many conversations with clients.

In summary, today, LLMs are not fit for purpose as a drop-in replacement for MT for enterprises.

More broadly, any general-purpose GPT application will find it super challenging to outperform a purpose-built enterprise solution that considers an entire workflow in a holistic way (note, the purpose-built solution could be GPT-based itself, but with a much narrower scope).

🧠 As a concrete example, at Translated, we've built a version of ModernMT that uses GPT-4 as a drop-in replacement for our Transformer model (while retaining the framework in ModernMT that allows us to do real-time adaptation). We've also built, and continue to test, a version of ModernMT with other open source LLMs fine-tuned for translation.

While we find that they perform well in terms of quality on some content types and some languages, it's far from unanimous across the board. And that's just quality. Other critical enterprise factors such as speed, cost, and importantly, information security, are just not there yet. Similarly, language coverage for LLMs is a challenge as there are large discrepancies in performance, particularly for content generation.

I appreciate there's a lot of downward pressure today to use AI across workflows, particularly in localization teams for translation and content creation. Let me hop on my soapbox to give you some information that might help with those conversations...

📣 If you're using MT, you're already using very advanced AI! 📣

You probably already know that the T in GPT stands for Transformer. But did you know that the Transformer was invented at Google in 2017...specifically for machine translation!? So what we're seeing today is a repurposing of that technology for a different application (generative AI) other than translation.

There will come a day, possibly soon, when it's better across the board to use LLMs for translation. When that happens, it will become the standard and people will stop talking about it. Just like when Neural MT came on the scene ~6 years ago.

When it happens, Translated will have already deployed it in ModernMT and worked out the best way for you to adapt it to your business. We already have a lot of ideas. We already have a lot of data from the testing I mentioned earlier. And in the meantime, we still have what I believe to be the most complete enterprise translation solution available.

Prioritization of Trustworthy Data in NMT Model Development

2023-12-07T17:38:00.000-08:00

ModernMT: A History of Innovation and Evolution

Neural machine translation (NMT) has had impressive evolutionary progress over the last five years, showing continually improving performance in accuracy. This progress is specially marked and clear with the dynamically adaptive NMT models like ModernMT, where small amounts of ongoing corrective expert feedback results in continuously improving MT output quality.

The historical track record with ModernMT has been so impressive that it did not seem unreasonable to point out that ModernMT's performance across billions of samples and many languages was approaching singularity in production-use scenarios. This is a point at which human editors are unable to tell whether the sample is coming from a human or machine since they are so close in quality and style.

NMT technology continues to evolve and improve with recent updates that provide much richer and more granular document-level contextual awareness. Document-level adaptation in machine translation has been a core design intention with ModernMT from the outset. This originally involved referencing similar sentences in translation memories and using these to influence new translation requests.

Despite the success and pioneering nature of this approach, early implementations faced challenges: translators struggled with issues such as gender bias and inconsistent terminology due to the distance between the segment they were working on and its related context.

By taking into account all edits within an individual document, even those in completely different or distant segments, the MT model is now able to provide document-specific translation suggestions. This development significantly reduces the need for repeated corrections of elements such as pronouns. This has greatly eased the amount of corrective work needed to address gender bias errors and modify incorrect terminology.

The Emergence of LLM-Based Translation Models

In the summer of 2023, we are at an interesting junction in the development of AI-based language translation technology, where we now see that Large Language Models (LLMs) are also an emerging technological approach to having machines perform the language translation task. LLMs are particularly impressive in handling idioms and enhancing the fluency of machine translations.

However, at this point, there are still serious latency, high training, and inference costs, and most importantly trustworthiness issues with the output produced by Generative AI models like GPT-4. These issues will need to be addressed for Gen AI models to be viable in production-use translation settings. There is also the issue of poor performance in low-resource languages and a bias toward better performance with systems that translate into English.

The AI product team at Translated continues to research and investigate the possibilities for continued improvement of pure NMT models, hybrid NMT and Gen AI models, as well as pure Gen AI models. Special consideration is given to ensure that any major improvements made in existing NMT model technology can also be leveraged in the future with potentially production-use capable Gen AI translation models.

AI systems are trained on large datasets found on the internet, data that can be of varied quality and reliability. If the data used for training is biased or of poor quality, it can lead to biased or unreliable AI outputs, and we have seen that one of the biggest obstacles to the widespread use of Gen AI in mission-critical applications has been the high levels of problematic and fluent, but untrustworthy output.

Better data validation and verification can indeed improve the trustworthiness of AI output. Data validation involves ensuring that the data used to train and evaluate AI models is accurate, consistent, and representative of the real-world scenarios the AI system will encounter. This can be done through data cleaning, data preprocessing techniques, and careful selection of training data.

The Importance of Data Quality

With this in mind, ModernMT Version 7, introduces a significant upgrade to its core adaptive machine translation (MT) system. This new version introduces Trust Attention, a novel technique inspired by how human researchers prioritize information from trusted sources and the V 7 model preferentially uses identified trustworthy data both in training and inference.

This innovation is the first of a long-term thematic effort focused on improving data quality being undertaken at Translated, to ensure that data quality and trustworthiness is a pervasive and comprehensive attribute of all new translation AI initiatives.

Translated has realized from a large number of independent evaluations and internal testing over the years, that this focus on data quality enables ModernMT to compare favorably in quality performance evaluations to many other better-funded public generic MT engines produced by Google, Microsoft, and others.

They have developed a robust data governance framework to define data quality standards, processes, and roles over the last decade. This helps create a culture of data quality and ensures that data management practices are aligned with organizational efficiency goals and technology improvements.

This culture, together with close long-term collaboration with translators ensures that ongoing data replenishment is of the highest quality and systematically identifies and removes lower-quality data. Finally, regularly measuring and monitoring data quality metrics helps to identify and address potential issues before they impact AI performance.

Trust Attention is possible because of the long-term investment in developing a data-quality culture that produces the right data to feed innovation in new AI technologies.

While it is common practice in the industry to use automated algorithm-driven methods to drive data validation and verification practices, Translated’s 20 years of experience working with human translators show that human-verified data is the most trustworthy data available to drive the learning of language AI models.

This human-verified data foundation is precisely the most influential driver of preferential learning in the ModernMT Version 7 models. Automated cleaning and verification are valid ways to enhance data quality in machine learning applications, but 10 years of experience show that human-verified data provide a performance edge that is not easily matched by large-scale automated cleaning and verification methods.

Human quality assessments made comparing ModernMT V6 output versus V7 output show that the use of Trust Attention improves translation quality by as much as 42% of the time based on human evaluations. It is interesting to note that many high-resource languages like Spanish, Chinese, and Italian also saw major improvements near the 30% range in human evaluations.

Human evaluations and judgments are corroborated by concurrent BLEU and COMET score measurements which are also used to ensure that conclusions being drawn by introducing new technology are accurate and trustworthy.

The following is a sample of MT output from the ModernMT V7 system compared to the previous V6. Three independent professional reviewers were shown two randomized samples of a translation of the same source segment and asked to judge if one was better, no different, or worse. The chart above shows how often the V7 translation was preferred by a majority of the reviewers by language.

Examples below show sample sentences from English to Brazilian Portuguese and Simplified Chinese.

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

Andrew Ng, Professor of AI at Standford University and founder of DeepLearning.AI

How is Trust Attention Different?

“Garbage in, garbage out” (GIGO) is a concept in computing and artificial intelligence (AI) that highlights the importance of input data quality. It means that if the input data to a system, such as an AI model or algorithm, is of poor quality, inaccurate, or irrelevant, the system’s output will also be of poor quality, inaccurate, or irrelevant.

This concept is particularly significant in the context of AI models which use machine learning and deep learning models, and rely heavily on the data used for training and validation. If the training data is biased, incomplete, or contains errors, the AI model will likely produce unreliable or biased results.

All Data Is Not Equally Important

Traditional MT systems generally are not able to distinguish between trustworthy data and lower-quality training material during the training process, and typically all the data has equal weight. Thus, high-quality data and high-volume noisy data can have essentially the same amount of impact on how a translation model will perform.

Trust Attention allows an engine to prioritize more trustworthy data and have this data influence ongoing model behavior more heavily.

ModernMT now uses a first-of-its-kind weighting system to enable primary learning from high-quality, trusted, and verified data – translations performed and/or reviewed by professional translators – over unverified data that is acquired from the Web.

As with adaptive MT, Translated looked to established human practices to develop this new technique. In any serious research, humans collect and sift through multiple information sources to identify and assign preferential status to the most trustworthy and reliable data sources.

ModernMT V7 similarly identifies the most valuable training data and prioritizes its learning based on certified and verified data by modeling this human behavior. This certification and verification is not an automated machine-led process, rather it is an expert human validation that raises the trustworthiness of the data.

This focus on prioritizing the use of trusted, verified data is a major step forward in the development of enterprise-focused MT technology.

The efforts made to identify and build repositories of high-quality data will also be useful in the future if there is indeed a shift to Gen AI-based language translation models.

Today, there is considerable discussion regarding the application of large language models in translation. While the traditional NMT models seem to perform much better on the accuracy dimension, though they can be less fluent than humans, LLMs tend to emphasize and often win on fluency, even though these models often produce misleading output due to hallucinations (generative fabrication).

Trust Attention methodology deployed in LLMs, will also enhance the accuracy of generative models, reducing the chances of random fabrication and confabulation errors. This could set the stage for an emerging era of new machine translation methodologies, one that combines the accuracy of dynamic adaptive NMT with the fluency of Gen AI models.

ModernMT Version 7 also introduces a data-cleaning AI that minimizes the likelihood of hallucinations, making it valuable for companies seeking greater accuracy in high-volume automated translation use cases, and is also useful for translators integrating MT into their workflow.

John Tinsley, VP of AI Solutions at Translated, added, "We are confident that these new data validation and verification techniques can also improve accuracy in generative AI systems, paving the way for the next generation of machine translation."

The introduction of this new approach is a major step forward for companies seeking greater accuracy in the translation of large volumes of content or requiring a high degree of customization of the MT engine, as well as for translators integrating MT into their workflow.

The combined impact of these multiple innovations provides global enterprises with a superior platform to rapidly transform generic engines into highly tuned enterprise-specific translation engines.

eMpTy Pages

2025 in Review and the Year Ahead

A View from the GALA 2025 Conference

The Evolving LLM Era and its Potential Impact

The Evolution of AI Translation Technology

Machine TranslationISArtificial Intelligence

The Adaptive MT Innovation

The MT Quality Estimation & Integrated Human-In-The-Loop Innovation

The Evolving LLM Era and Potential Impact

Comparing MT System Performance

Evaluation Methodology Overview

Summary Results

Why is MT evaluation so difficult?

What matters most?

ModernMT Introduces Adaptive Quality Estimation (MTQE)

MT Quality Evaluation vs MT Quality Estimation

What is T-QE?

The Key Benefits of MT Quality Estimation

Ongoing Evolution: Adaptive Quality Estimation

The Importance of User-Generated Content (UGC) and Listening to the Customer

The Translation Challenge & Perspective

Unveiling Hyper Adaptive ModernMT

Integration and Costs

An Overview of ModernMT V7

These efforts approach the goal of producing as close as possible to human-quality MT output in multiple ways, and these improvement strategies can be summarized in the following ways:

The ModernMT development team pursues ongoing improvements along all these fronts on an ongoing basis, and ModernMT V7 is the result of several measured improvements on many of these dimensions to provide improved performance.

The design and implementation of Trust Attention was inspired by how the human brain prioritizes trusted sources in the learning process, linking the origin of data to its impact on translation quality.

The development team at ModernMT considers Trust Attention to be as significant an innovation as Dynamic Adaptive MT engines. It is the kind of feature that can dramatically improve MT system performance for different use cases when properly used.

ModernMT V7 New Features: Up to 60% Better MT Quality

The combined effect of all the improvements and innovations described above has a significant impact on the overall performance and capabilities of ModernMT.

The MT quality is now considered to be 45% to 60% better than the previous version according to systematic human evaluations.

The Hallmark of the Symbiosis Between Translators and MT

Moving Towards LLMs for Translation

Prioritization of Trustworthy Data in NMT Model Development

ModernMT: A History of Innovation and Evolution

The Emergence of LLM-Based Translation Models

The Importance of Data Quality

This innovation is the first of a long-term thematic effort focused on improving data quality being undertaken at Translated, to ensure that data quality and trustworthiness is a pervasive and comprehensive attribute of all new translation AI initiatives.

Trust Attention is possible because of the long-term investment in developing a data-quality culture that produces the right data to feed innovation in new AI technologies.

“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.”

How is Trust Attention Different?

All Data Is Not Equally Important

Trust Attention allows an engine to prioritize more trustworthy data and have this data influence ongoing model behavior more heavily.

This focus on prioritizing the use of trusted, verified data is a major step forward in the development of enterprise-focused MT technology.

The combined impact of these multiple innovations provides global enterprises with a superior platform to rapidly transform generic engines into highly tuned enterprise-specific translation engines.

Machine Translation
IS
Artificial Intelligence