DEV Community: Kseniya Autukhovich

Alconost Announces MQM Tool API for Automating Project Management in Translation Quality Evaluation

Kseniya Autukhovich — Tue, 17 Feb 2026 05:11:17 +0000

New API enables programmatic project setup, linguist coordination, progress tracking, and structured export — while keeping all linguistic evaluation fully human-performed

Alconost today announced the public release of the MQM Tool API, extending its free web-based MQM Tool with automation capabilities for project orchestration, data import, progress tracking, and export of structured translation quality evaluation results.

The MQM Tool is designed for manual quality evaluation of translations — including human translations, machine translation (MT) outputs, LLM-generated content, and vendor-delivered localization — using the industry-standard MQM (Multidimensional Quality Metrics) framework.

While translation assessment inside the MQM Tool remains fully manual and linguist-driven, the new API enables organizations to automate project setup and results management — making it easier to coordinate distributed reviewers and scale structured translation quality evaluation across teams.

API documentation: alconost.mt/mqm-tool/api-guide

Automating the Operational Side of Human Evaluation

The MQM Tool API does not automate linguistic judgment. Instead, it automates the administrative and technical processes surrounding evaluation workflows.

With the API, organizations can:

Create MQM projects programmatically
Upload large batches of source and target segments
Manage distributed linguist assignments
Monitor annotation progress in real time
Export structured evaluation results automatically

This approach preserves expert-driven quality assessment while eliminating manual project setup and reporting overhead.

Typical Workflow with the MQM Tool API

Automated Project Creation
Localization engineers or QA managers create projects via API, configure language pairs and evaluation settings, and upload translation segments (e.g., TSV or JSON). This step can be triggered from a TMS, MT evaluation pipeline, sampling system, or CI/CD workflow.
Manual Annotation by Linguists
Assigned linguists log into the MQM Tool web interface to review translations and annotate errors according to MQM categories and severity levels. All evaluation remains fully human-performed and structured.
Programmatic Progress Monitoring
Project owners retrieve project status and completion metrics via API, enabling efficient coordination of distributed reviewers across vendors or internal teams.
Automated Export and Integration
Once evaluation is complete, structured annotation data and reports can be exported via API and integrated into dashboards, vendor scorecards, business intelligence tools, or machine translation benchmarking systems.

Designed for Scalable, Human-in-the-Loop Quality Programs

The MQM Tool API is particularly suited for:

Enterprise localization teams running ongoing LQA programs
Language service providers managing vendor quality audits
AI and MT research teams benchmarking translation models
Organizations implementing structured quality governance

"Manual MQM evaluation remains the gold standard for measuring translation quality. With the MQM Tool API, we're enabling teams to scale that gold standard efficiently across distributed reviewers and modern localization workflows."
— Alexander Murauski, CEO, Alconost

About the MQM Tool
The MQM Tool is a free web-based platform for structured manual translation quality evaluation based on the MQM framework. It enables systematic error annotation, standardized scoring, and detailed reporting across translation workflows.

Learn more and access the API guide: alconost.mt/mqm-tool/api-guide

About Alconost
Alconost is a language services company specializing in translation, localization, and quality evaluation. With a network of 3,500+ professional linguists covering 100+ language pairs, Alconost serves technology companies, game developers, and enterprises requiring high-quality multilingual content. The company offers on-demand MQM annotation services and custom dataset creation, with methodology aligned to Workshop on Machine Translation (WMT) standards.

MEDIA CONTACT
Alconost MQM Tool Team
mqm-tool@alconost.com
alconost.mt/mqm-tool

Alconost Launches Free MQM Annotation Tool for MQM-Based Translation Quality Analysis

Kseniya Autukhovich — Wed, 21 Jan 2026 14:38:05 +0000

Free web-based tool for evaluating translation quality — no registration required

ALEXANDRIA, Va., January 2026 — Alconost, a localization company working with a network of 3,500+ professional linguists across more than 120 language pairs, has launched the Alconost MQM Annotation Tool — a free, web-based tool for annotating translation errors using the MQM framework to support translation quality evaluation.
The tool is available at alconost.mt/mqm-tool and can be used immediately, with no registration required.

MQM (Multidimensional Quality Metrics) is an industry-standard framework for assessing translation quality through human error annotation. It works by identifying and classifying translation errors by type (such as accuracy, fluency, or terminology) and by severity. MQM is widely used in the localization industry and is the methodology used in evaluations conducted by the Workshop on Machine Translation (WMT).

With MQM, trained reviewers annotate errors in a translation, creating a consistent, human-labeled reference (“ground truth”) for quality analysis. These annotations make it possible to calculate quality scores, compare results across machine translation systems or vendors, and document quality issues in a structured and reproducible way.

In practice, however, MQM annotation has often been difficult to adopt. Existing solutions typically require technical setup, expensive enterprise platforms, or manual work in spreadsheets, making consistent MQM use hard to scale in real-world projects.

The Alconost MQM Annotation Tool addresses these challenges with a simple, browser-based workspace designed specifically for MQM error annotation. It makes professional MQM-based quality analysis more accessible for linguists, localization teams, and researchers.
Users can annotate translations themselves, assign annotation tasks to linguists, or request professional MQM annotation services from Alconost.

Key features include:
Start immediately — no account creation or registration required
WMT-compatible MQM error typology — annotate using the full MQM taxonomy aligned with international standards
Customizable error severity weights — control how annotated errors contribute to quality scores
Automatic score calculation — segment-level and overall document scores derived from annotations
Pass/fail results — quality decisions based on configurable thresholds
TSV and JSON import/export — easy integration with existing tools and workflows
Detailed reports (online and PDF) — including error breakdowns by category, severity distribution, segment-level scores, overall weighted score, pass/fail outcome, and annotation time

“Most existing MQM tools are built for research and require a lot of setup, which makes them hard to use in everyday localization work,” said Alex Murauski, CEO of Alconost. “We wanted something simpler. The result is a practical tool: you open it, paste your segments, annotate errors, and then export the data or generate a report. That report can be used internally or shared with a linguist, vendor manager, or customer to clearly explain quality issues.”

Who the tool is for

Localization managers can annotate translations themselves or assign the task to linguists to evaluate vendor quality, compare machine translation systems, and create clear documentation for quality discussions — without purchasing specialized software.
ML researchers and engineers can create MQM annotation data compatible with WMT methodology, build gold-standard evaluation datasets, and produce reproducible quality benchmarks based on human annotation.
Language service providers (LSPs) can standardize how linguists annotate translation errors, demonstrate quality to clients using professional reports, and train teams on MQM methodology.
For teams that need expert support, Alconost also offers on-demand MQM annotation services performed by trained professional linguists. Custom MQM datasets are available as well, particularly for ML teams building evaluation benchmarks or researchers who require gold-standard annotation data.

Start annotating now: alconost.mt/mqm-tool
On-demand annotation services and custom MQM datasets: mqm-tool@alconost.com

About Alconost
Alconost handles every aspect of localization – from translation and cultural adaptation to continuous multilingual content updates and engineering – letting you focus on the top priority of growing your business in international markets.

In addition to localization services, Alconost provides MQM-based annotation services and custom MQM dataset creation, following methodology aligned with the Workshop on Machine Translation (WMT).

MEDIA CONTACT
Alconost MQM Tool Team
mqm-tool@alconost.com
alconost.mt/mqm-tool

AI-Powered Translation Quality Assessment: Alconost Demonstrates LLM Application for Translation Evaluation

Kseniya Autukhovich — Thu, 26 Jun 2025 07:13:03 +0000

A new experimental tool showcases the practical application of top-tier large language models (LLMs) for automated translation quality scoring and correction.

Alexandria, June 26, 2025 – Alconost, a localization services provider, has released Alconost.MT/Evaluate, an experimental, lightweight web tool that demonstrates how large language models can be effectively deployed for automated translation quality assessment and correction.

The tool addresses a key challenge in applying LLMs to professional translation evaluation: while single-segment assessment is straightforward, batch processing with consistent criteria, custom context injection, and structured output requires sophisticated prompt engineering and workflow orchestration.

Technical Implementation and AI Methodology

Multi-Model Architecture: The tool currently integrates OpenAI GPT-4 and Anthropic Claude 3, featuring a modular design that enables the rapid addition of new models. This multi-model approach enables a comparative analysis of the performance of different LLMs on translation evaluation tasks.

Structured Evaluation Framework: The system implements a comprehensive 100-point scoring algorithm based on the GEMBA-MQM framework, where LLMs identify specific error types and assess their severity. The AI models analyze four key dimensions: accuracy, fluency, terminology consistency, and stylistic appropriateness, outputting structured assessments across standardized quality bands (Publish-Ready: 91-100, Acceptable: 70-90, Fair: 50-69, Unusable: 1-49).

Dynamic Context Injection: Beyond base evaluation prompts, the tool supports custom guideline injection, enabling users to augment the LLM context with project-specific terminology, style guides, and quality criteria. The system incorporates glossary terms directly into the evaluation context, demonstrating practical application of in-context learning for domain-specific evaluation tasks.

Automated Correction Generation: The system extends its capabilities beyond assessment to generate corrected translations accompanied by detailed error explanations. Edit highlighting functionality provides transparency into the AI's decision-making process, making the corrections interpretable and actionable.

Batch Processing Capabilities: The tool handles file-based input (XLIFF 1.2/2.0, CSV) and implements batch processing workflows for up to 100 segments, demonstrating how LLMs can be practically deployed for professional translation evaluation scenarios.

AI Performance and Validation

The tool originated from Alconost's internal research into large language model (LLM) capabilities for assessing translation quality. The focus is on demonstrating how modern LLMs can provide consistent, granular translation evaluation when properly prompted and contextualized; however, the company emphasizes that this remains experimental technology requiring validation against human expert judgment.

Standardized Scoring: The implementation provides reproducible quality metrics that enable benchmarking across different translation providers, demonstrating potential for AI-driven quality assurance in professional translation workflows.

Explainable AI Output: All assessments include detailed explanations of identified errors and scoring rationale, addressing the interpretability challenge common in AI evaluation systems.

Research and Industry Implications

The tool represents a practical exploration of several key AI research areas:

Prompt Engineering: Optimization of evaluation prompts for consistent, professional-grade translation assessment
Multi-Model Integration: Implementation of different state-of-the-art language models for specialized evaluation tasks
Context Window Utilization: Effective use of extended context capabilities for incorporating custom guidelines and terminology
Structured Output Generation: Reliable extraction of formatted evaluation data from LLM responses

Open Access and Community Testing

Alconost.MT/Evaluate is available for free experimentation at alconost.mt with a 100-segment evaluation limit. No registration is required, enabling immediate testing by AI researchers and practitioners interested in translation evaluation applications.

"The localization industry is experiencing fundamental disruption as GenAI becomes the top priority for translation professionals," said Alexander Murauski, CEO of Alconost. "We're pivoting from traditional people warehouses to becoming AI workflow implementors for clients, with humans evolving into human-in-the-loop scenarios. Fortunately, we had a strong product development culture within the company, so we were among the first to implement large language models (LLMs) in localization. We're at the cutting edge of GenAI adoption - it's vital for business survival, so we're excited to adopt and experiment as fast as possible."

The tool complements Alconost's traditional human-based translation services while providing a testbed for the development of AI evaluation methodologies.

Technical Availability

The experimental tool is accessible at alconost.mt with API-based model integration for GPT-4 and Claude 3. Additional LLM integrations are planned based on community feedback and model availability.

About Alconost

Founded in 2004, Alconost provides professional localization services and has been actively researching AI applications in translation workflows. The company maintains a network of over 3,000 linguists while exploring how AI can augment rather than replace human expertise in language services.

For technical details about Alconost.MT/Evaluate or collaboration opportunities, visit https://alconost.com.

Note: Alconost.MT/Evaluate is experimental research software. While demonstrating promising AI evaluation capabilities, professional translation quality assurance should continue to include human expert validation.

NMT vs. LLM: Who Wins the Translation Battle?

Kseniya Autukhovich — Thu, 05 Jun 2025 10:12:05 +0000

In the recent few years, the localization industry has been buzzing about AI translations. At Alconost, the team has gained valuable insights and want to share, so they are starting a series of AI translation know-how.

This first article in the series will cover the basics as a quick start before we dive deep. Let’s compare two main AI technologies for translation tasks:
LLM (Large Language Models) referring to GenAI models like ChatGPT, Gemini, Anthropic, and
NMT (Neural Machine Translation) referring to models like DeepL, GoogleTranslate, Microsoft Translator.

By default, NMT engines are considered a good fit for standardized content like documentation, and LLMs are preferred for more creative and informal language. However, the choice gets complicated once fine-tuning, diverse language pairs and other aspects come into play.

To help you make informed decisions, we’ll discuss:

What technology stands behind NMT and LLM;
What strengths and weaknesses NMT and LLM models have;
Customizing NMTs and LLMs for your needs;
Data privacy issues and how to avoid them;
A must-do step when choosing the right AI model for your content.

How do NMT and LLM models work?
Both NMT and LLM models are based on complex neural networks and trained on massive datasets. What sets them apart?

NMTs were created specifically for translation, while LLMs do translation as just one of the many text generation tasks they can do.

As they were built for different purposes, they function differently. NMTs don’t ‘understand’ the text, they just translate it, but LLMs can ‘understand’ longer passages. This enables LLMs to analyze the content and generate culturally appropriate translations.

Key differences between NMT and LLM translations

Let’s take a closer look at these differences.

⭐ Output
NMT-based models are quite predictable: for popular language pairs and domains, they normally produce accurate, neutral-styled translations. Such translations still need adjusting to the style and editing errors, but that was expected.

However, their precision can be a double-edged sword: sometimes translations sound too literal and stiff.

NMT models struggle with:
⛔ Maintaining the same context across a piece of text — it leads to errors in consistency, style, and terminology;
⛔ Adjusting to a certain style (DeepL offers formal/informal style, but that’s it);
⛔ Translating figurative language (metaphors, idioms, puns, etc.)

💡 Some of these problems can be overcome by fine-tuning (fluency, consistency, terminology, style — to some extent), but unfortunately it’s not a silver bullet.

LLMs are much trickier. On one hand, they beat NMT models with:
✅ Fluent, natural-sounding translations;
✅ Handling figurative language;
✅ Maintaining context and consistency across longer texts;
✅ Catching and mimicking style and tone of voice;
✅ Ability to handle translations even without training on parallel texts and quick adaptation with just a few examples.

On the other hand, the creative side of LLMs can be a problem:
⛔ Random results without the right prompt: like mixing Simplified and Traditional Chinese in one translation;
⛔ May be inconsistent with tags and formatting;
⛔ May leave randomly untranslated words;
⛔ Hallucinations and making up non-existent words for low-resource languages;
⛔ Without fine-tuning or carefully chosen prompts, LLMs may be inconsistent in style and terminology.

As you see, without proper guidance LLMs may take too much freedom to experiment with your content. The best way to ‘tame’ them is to use well-crafted prompts and additional context (and sometimes fine-tuning).

⭐ Pricing
NMTs offer a simple and transparent pricing model: they charge only for the input content, and the input is calculated in characters. That’s it! When an NMT model goes through fine-tuning, the input is calculated in segments.

LLMs have a more complex pricing structure — they use tokens and charge for the input and output:

There are many intricacies involved in LLM pricing, so you may end up paying more with LLMs — despite the fact NMT rates are generally higher.

⭐ Speed
NMT models are fast and can be used for real-time translations, while LLMs are up to 100-500 times slower! For example, if it takes an NMT engine a few seconds to process a piece of content, LLM may need several minutes.

This difference gets especially noticeable and annoying with larger projects — you’d notice it for content with over 2000 words.

⭐ Customization
NMTs have limited customization capabilities: mainly it implies using TM (translation memory) and glossaries. NMT models fine-tuned with project-specific data produce impressive results.

How to customize NMT:

Fine-tuning
You can train an NMT model with previous translations (TM). To make a real difference in quality, you’ll need a very large TM (at least 100K high-quality segments). This adds up to the price and takes a long time.
Adaptive NMT
This translation approach means you provide the model with a dataset of reference sentences, and the model accesses them in real time to customize its translations.

NMT models that support this approach: ModernMT, Amazon, Google AutoML.

LLMs, on the contrary, offer flexible customization opportunities. The most popular practices are prompt engineering and RAG.

Thanks to prompting, LLM models can understand and adjust to the following:

Tone of voice
Audience
Domain and content type
Consistency and terminology
Style
Cultural adaptation, etc.

It might be tempting to send one super long prompt to the LLM and think it’s done, but be careful — it may produce poor results.

The studies have shown that the effective context length of LLMs does not exceed 4K-8K tokens, and their performance declines dramatically at 32K tokens.

RAG (Retrieval-Augmented Generation) is one of the techniques that helps LLM look for relevant context in TMs (translation memories), reference translations, glossaries, etc.

How does it work? We instruct the model how to interact with this additional context: for example, we can ask it to look for similar content in the knowledge base and pick up the style.

🤔 *Why use RAG if we have prompts? *
There is a lot of information we put in the context window:

Text for translation
Prompt
TM
Glossary
Additional context (images, reference translations, etc.)

If LLMs get too much content to process at a time, their short-term memory gets overwhelmed, so to say, and they start ignoring some of this context. RAG is a workaround to this problem.

The RAG framework gives LLMs access to an external knowledge base when needed, without affecting their performance. Also, it provides the model with fresh, updated information without fine-tuning.

⭐ Efficiency across languages
Both NMT and LLM models show good results for high-resource languages like English, French, German, Spanish, etc. — that is, all languages with a vast amount of digital data available.

However, both stumble upon low-resource languages — those with limited digital presence, such as Finnish, Kazakh, Macedonian, Hindi dialects, African languages, etc. Non-English language pairs are especially challenging.

With some rare language pairs (like Japanese to Thai translations) or uncommon combinations of domain and language you may face the situation that no AI model can produce decent translations because of the lack of data. In this case, we try to find parallel data and fine-tune the model.

⭐ Security risks
Most NMT and AI models are open-source and trained on public information, and the ways they use this information are not always transparent.

The good news is that you can protect your data by running an LLM model on a private infrastructure.

What is best for translation, NMT or LLM?
From our experience, LLMs generally perform better. It’s important to note though that our conclusions are based on the verticals and types of content we specialize in:

Games
Apps
Software
Helpdesks
Marketing
eCommerce
Fintech
eLearning

That doesn’t mean we totally dismiss NMT models — they may be more efficient not only for very large volumes, but also for certain combinations of domain and language pairs. That’s why we always run tests on the client’s content to select best-suited models.

For example, it makes sense that UI content is not a good fit for NMT models. But models are constantly evolving, and our tests show that in some cases NMT models can tackle UI content successfully.

*The bottom line is:
There are no ready answers to what AI model will do the job. Only tests can show what specific model(s) will work best for this particular content. *