DEV Community

Auke de Haan
Auke de Haan

Posted on • Originally published at aitoolhub.nl

What I learned testing AI translation tools in 2026 (DeepL is still good, but LLMs caught up)

Translation used to be a settled question. DeepL was best at European languages, Google Translate was best at coverage, and you picked whichever fit your stack. In 2026 the picture is more interesting, because the big LLMs (Claude, ChatGPT, Gemini, Mistral) now match or beat the specialised tools on the same prompt, with one important catch.

I spent a couple of weeks pushing translation tasks through six tools, in five language pairs (Dutch, German, French, Spanish, Polish, all paired with English). Here is what actually shifted.

DeepL still wins for raw translation, but the gap shrunk

For a clean source-to-target translation with no instructions, DeepL still produces the most native-sounding output in European languages. Compound nouns, idioms, register choice, it gets these right more often than the LLMs. That has not changed.

What has changed: the gap is now small. On EN to NL, EN to DE and EN to FR, Claude and GPT-5 are within shouting distance. For most marketing copy you would publish the LLM output without editing.

The thing LLMs do that DeepL cannot

DeepL translates. It does not adapt. If you ask DeepL to translate a US blog post for a Dutch audience, you get a literal Dutch version, references to American cultural artefacts intact, dollars not euros, the lot.

Ask Claude or GPT-5 the same thing with a one-line instruction ("adapt for a Dutch audience, replace US-specific references, keep tone friendly but direct") and the output is a Dutch blog post, not a Dutch translation. References get localised, examples are swapped, tone shifts.

For anything you actually publish, that is a much bigger productivity win than the small quality edge DeepL holds on raw fidelity.

When to use which

My current default after the testing:

  1. Short, neutral, high-fidelity translation (legal text, product descriptions, UI strings). DeepL.
  2. Long-form content that needs to feel native and adapted (blogs, emails, marketing). Claude with explicit instructions about audience and tone.
  3. Bulk volume on a tight budget. Mistral Le Chat. Cheaper than the US LLMs, surprisingly competitive on FR and IT.
  4. Rare or low-resource language pairs. Gemini, which seems to have benefited most from large-scale multilingual training in the last year.

Two non-obvious gotchas

A pair of things that bit me during testing:

  • LLMs hallucinate brand names. If your source mentions a specific product, an LLM will sometimes "correct" it to a more famous similar product. DeepL never does this. Worth catching with a glossary or a post-translation diff.
  • Privacy matters more than people remember. Pasting a client contract into a US LLM is not the same as running it through an EU-hosted DeepL Pro account. For regulated content (legal, HR, medical) the hosting story still matters in 2026.

I compiled the full hands-on comparison, including how each tool handles Dutch specifically and what the real GDPR position looks like, in this 2026 AI translation tool guide. Curious how others here are mixing LLMs and dedicated translation tools, especially for less-common language pairs. What is winning in your stack?

Top comments (0)