Alexey D

Posted on Jul 1

Detecting Text Language in Production — What the Libraries Don't Tell You

#api #nlp #python #javascript

langdetect, langid, fasttext — all fine libraries. All have the same problem: they fail on short text, mixed-language input, and transliterated text. When your user submits a 6-word product review or a support ticket written half in English and half in their native language, accuracy drops fast.

There's also the question of what you do with the detection result. Routing support tickets to the right team, serving the right content, auto-tagging user-generated content — these need a reliable signal, not a guess with no confidence metric.

What production language detection actually needs

Confidence score — "I'm 94% sure this is Russian" is useful. "It's Russian" with no probability is not.
Script detection — Latin, Cyrillic, Arabic, CJK, Devanagari. Sometimes you need the script, not the language. A Russian text in Latin transliteration is still Russian content.
Dialect hints — Portuguese from Brazil vs Portugal, Chinese Simplified vs Traditional, American vs British English.
Reliability flag — when the text is too short or too ambiguous, you need to know that, not get a confident wrong answer.
Batch processing — running per-text HTTP calls at scale is wasteful.

The API

curl -X POST https://protection-meeting-what-elvis.trycloudflare.com/detect \
  -H "Content-Type: application/json" \
  -d '{"text": "Не могу войти в аккаунт, помогите пожалуйста"}'

Response:

{
  "language": "ru",
  "language_name": "Russian",
  "confidence": 0.97,
  "script": "Cyrillic",
  "dialect_hint": null,
  "reliability": "high",
  "alternatives": [
    {"language": "uk", "language_name": "Ukrainian", "confidence": 0.02},
    {"language": "bg", "language_name": "Bulgarian", "confidence": 0.01}
  ]
}

reliability is high when confidence is above 90%, medium for 70–90%, low below that. For routing and tagging workflows, I'd suggest only acting on high and medium — queue low for human review or fallback logic.

Python — routing support tickets

import requests

def detect_language(text: str) -> dict:
    r = requests.post(
        "https://protection-meeting-what-elvis.trycloudflare.com/detect",
        json={"text": text},
        timeout=5
    )
    r.raise_for_status()
    return r.json()

def route_ticket(ticket_text: str, ticket_id: str):
    result = detect_language(ticket_text)

    lang = result["language"]
    confidence = result["confidence"]
    reliability = result["reliability"]

    if reliability == "low":
        assign_to_queue("multilingual-review", ticket_id)
        return

    routing_map = {
        "en": "support-en",
        "de": "support-de",
        "fr": "support-fr",
        "es": "support-es",
        "ru": "support-ru",
        "zh": "support-zh",
    }

    queue = routing_map.get(lang, "support-general")
    assign_to_queue(queue, ticket_id, metadata={
        "detected_lang": lang,
        "confidence": confidence,
        "script": result["script"]
    })

Batch processing

Doing one HTTP call per document is wasteful. The batch endpoint takes up to 20 texts:

import requests

texts = [
    "Hello, I need help with my order",
    "Bonjour, j'ai un problème avec ma commande",
    "Здравствуйте, у меня проблема с заказом",
    "こんにちは、注文に問題があります",
    "مرحبا، لدي مشكلة في طلبي",
]

r = requests.post(
    "https://protection-meeting-what-elvis.trycloudflare.com/batch",
    json={"texts": texts}
)

for item in r.json()["results"]:
    print(f"{item['language_name']:15} ({item['confidence']:.0%}) [{item['script']}] — {item['reliability']}")

Output:

English         (99%) [Latin]    — high
French          (98%) [Latin]    — high
Russian         (97%) [Cyrillic] — high
Japanese        (99%) [CJK]      — high
Arabic          (98%) [Arabic]   — high

Script vs Language — when you need the script

Sometimes you don't care which Slavic language it is — you just need to know if the text is in Cyrillic so you can load the right font, keyboard layout, or content policy.

result = detect_language(user_bio)

if result["script"] == "Arabic":
    enable_rtl_layout()
elif result["script"] in ("CJK", "Hiragana", "Katakana"):
    enable_cjk_typography()

Supported scripts: Latin, Cyrillic, Arabic, CJK (Chinese/Japanese/Korean), Devanagari, Hebrew, Greek, Thai, Georgian, Armenian, and others.

Dialect hints

result = detect_language("I was queueing outside the colour shop")
# {"language": "en", "dialect_hint": "en-GB", "confidence": 0.89}

result = detect_language("Eu moro em São Paulo há cinco anos")
# {"language": "pt", "dialect_hint": "pt-BR", "confidence": 0.94}

This is useful for content localization — even when the language is the same, en-US and en-GB users expect different date formats, spelling, and sometimes terminology.

What it doesn't do well

Short text under 10–15 characters is genuinely hard. "OK", "Thanks", "Hola" — these are ambiguous by nature. The API will still return a result, but reliability will be low. Plan for that in your logic.

Mixed-language text (code-switching like "Привет, let me знаешь explain this") is also tricky — the API picks the dominant language, not both. If you need mixed-language detection, the alternatives array helps, but this isn't a use case the API optimizes for.

Languages supported

55+ languages. Full list: GET /languages endpoint returns the complete map with ISO codes.

Common ones: English, Russian, German, French, Spanish, Italian, Portuguese, Dutch, Polish, Turkish, Arabic, Chinese (Simplified/Traditional), Japanese, Korean, Ukrainian, Hindi, Persian, Swedish, Danish, Finnish, Norwegian, Czech, Romanian, Hungarian, and more.

Try it

Free tier on RapidAPI — search "Language Detection API." Test endpoint works without authentication.

DEV Community