DEV Community

eternalsix
eternalsix

Posted on • Originally published at eternalsix.com

AI translation: post-editing best practices

AI Translation Post-Editing: What Nobody Tells You Until You've Burned a Client

Last year I watched a senior developer ship a localized SaaS product to Japan after running every string through GPT-4 and doing a 20-minute "sanity check." Three weeks post-launch, a native Japanese user filed a support ticket pointing out that the onboarding flow's CTA translated literally to "Please insert your email address into the hole." The model had chosen 穴 (hole/cavity) over 欄 (field/blank). Technically defensible. Catastrophically wrong. This is the gap that post-editing is supposed to close — and most AI workflows treat it like a formality rather than a discipline.


The Real Problem Isn't Accuracy, It's Confidence Miscalibration

Every developer who has shipped AI-translated content thinks the hard part is catching wrong translations. It isn't. Modern frontier models translate accurately at the sentence level 90%+ of the time across major language pairs. The hard part is that the remaining errors are distributed in a way that defeats normal review strategies.

AI translation errors cluster in specific zones: idiomatic expressions, domain-specific terminology with register ambiguity (formal vs. casual in Japanese, tu/vous in French for UI copy), numbers and units, and anything where the source text has intentional ambiguity (marketing copy, product names, taglines). These are also the zones your 20-minute reviewer skims fastest because everything looks fluent.

The fix isn't "review more carefully." It's building a triage system that surfaces high-risk segments before human attention gets wasted on segments the model nailed. If you're post-editing without risk scoring, you're applying equal effort to "Click Save" and "By using this service you agree to our Terms."


Build a Segment Risk Model Before You Post-Edit Anything

Before any human touches translated output, classify each segment by failure probability. This doesn't require a separate ML model — a rule-based classifier gets you 80% of the value:

High-risk signals:

  • Proper nouns the model wasn't trained to recognize (your product name, competitor names, internal jargon)
  • Segments where source text is under 5 tokens (context-starved, model guesses register)
  • Segments containing numbers, currencies, dates, or units
  • Marketing or emotional language (superlatives, humor, metaphor)
  • UI strings with embedded variables or format strings ({username}, %d items)

Low-risk signals:

  • Procedural instructional text ("Click the button," "Enter your password")
  • Error messages following standard patterns
  • Boilerplate legal text with established translations in your TM

Route high-risk segments to a qualified human reviewer. Route low-risk segments to an automated consistency check against your glossary and translation memory. You've just made your post-editing workload 60% smaller without sacrificing quality where it counts.


Glossary Enforcement Is Infrastructure, Not a Style Guide

Here's a pattern I've seen destroy otherwise solid AI translation pipelines: the team builds a glossary, puts it in a Google Doc, and tells translators to "refer to it." This works for human translators who internalize it over time. It doesn't work for AI workflows where the model is stateless per request and your post-editors have thirty seconds per segment.

Glossary enforcement needs to be machine-readable and checked automatically. Concretely:

  1. Pre-translation injection: Feed your glossary as a system prompt or structured context block on every translation call. Not as prose. As a structured term list the model can pattern-match against.
  2. Post-translation verification: Run a regex/NLP check on output to confirm that every source-language glossary term maps to its approved target-language equivalent. Flag mismatches before human review, not during.
  3. Version your glossary: When a term changes (you rebrand "workspace" to "hub"), you need to know which translated assets are stale. Treat glossary entries like database records with timestamps, not like a living document.

The teams shipping clean localization at scale aren't reviewing more carefully. They've made violations structurally impossible to miss.


The Edit Distance Trap

There's a tempting metric in post-editing workflows: track how much editors change the raw MT output. Low edit distance = good MT quality = less human work. This is right in aggregate but dangerous at the segment level.

Editors learn to leave things that are wrong-but-passable because fixing them costs effort and the segment will "do." Over time, wrong-but-passable accumulates into a product that reads like it was translated by someone who speaks the language as a third language. Native users feel this before they can articulate it.

The counter-move: periodically sample segments with zero edit distance and run them past a native speaker specifically asking "does this feel natural?" Don't ask if it's correct. Correct and natural are different questions. You want to catch the category of errors where the model chose the dictionary-correct word that no native speaker would use in this context.

I've started calling these "invisible errors" because they pass automated QA, they pass tired reviewers, and they only surface when someone who actually speaks the language uses the product.


Post-Editing Checklist for AI-Translated Content

Before signing off on a translated asset, run through this in order:

Automated checks (should be blocking)

  • [ ] All glossary terms verified against approved target-language equivalents
  • [ ] Format strings and variables intact ({name}, %s, etc.)
  • [ ] Numbers, currencies, dates match source (or are correctly localized per locale rules)
  • [ ] No untranslated source-language strings in output
  • [ ] Character limits respected for UI strings (if applicable)

Human review (high-risk segments only)

  • [ ] Proper nouns and brand names correctly handled
  • [ ] Register/formality consistent with target market conventions
  • [ ] Idiomatic expressions resolve to natural target-language equivalents, not literal calques
  • [ ] CTAs and emotional/marketing copy reviewed by a native speaker, not just a bilingual one
  • [ ] Zero-edit-distance sample spot-check for naturalness

Final

  • [ ] Changes back-propagated to translation memory for future segments
  • [ ] Anomalous segments (high edit distance, unusual errors) flagged for model prompt improvement

This isn't exhaustive. It's the minimum that prevents the categories of errors that actually reach production.


How AI Handler Approaches This

Building AI Handler has forced me to think about translation workflows as a first-class use case, not an afterthought. What I kept running into was that the standard advice — "use a glossary, review your output, hire a translator for sensitive content" — is correct but unactionable inside a real development workflow where translation is one of twenty AI tasks running in parallel.

AI Handler's approach is to treat post-editing as a structured pipeline stage, not a manual review step. That means: risk scoring happens automatically before any segment reaches a human reviewer, glossary enforcement is a compiled rule set that runs on every translation output before it's committed, and edit distance anomalies surface as workflow alerts rather than silent quality degradation.

The specific thing I'm building that I haven't seen elsewhere is a segment-level confidence audit trail — every translated segment carries metadata about why it was flagged or cleared, what glossary terms were checked, and what the model's instruction context was. When something goes wrong in production (and it will), you can trace it back to the exact point in the pipeline where the decision was made, rather than staring at a finished translation trying to figure out what happened.

The goal isn't to eliminate human judgment from translation. It's to make sure human judgment gets spent on the segments where it actually moves the needle, not on verifying that "Click Save" was translated correctly for the fourteenth time.


AI Handler is the unified AI workflow tool I am building. Launching June 2026. Email ceo@eternalsix.com for beta access.

Top comments (0)