DEV Community

Ahmed Mahmoud
Ahmed Mahmoud

Posted on

I Built an AI Language Tutor — Here's What I Learned About NLP

I Built an AI Language Tutor — Here's What I Learned About NLP

Building a conversational language tutor sounds straightforward until you actually do it. You imagine a sleek interface, a model that listens and responds, and users happily chatting their way to fluency. What you get instead is a humbling education in the gap between demo and production NLP.

Here's an honest technical breakdown of what I built, what broke, and what I'd do differently.

The Core Problem: Language Learning Needs More Than a Chatbot

A raw large language model is fluent. That's the problem. You're trying to teach someone Italian, and your AI responds with flawless, complex sentences that immediately overwhelm an A2 learner. The first engineering challenge isn't getting the model to speak — it's getting it to speak badly on purpose.

Vocabulary Grading

CEFR (Common European Framework of Reference) defines language proficiency in six levels: A1, A2, B1, B2, C1, C2. Each level has a corresponding vocabulary band. A1 covers roughly 500–700 words; C2 expands to 16,000+.

To grade output, I built a vocabulary filter that:

  1. Tokenises the model's response using a language-specific tokenizer (spaCy for European languages, MeCab for Japanese).
  2. Lemmatises each token to its base form.
  3. Checks each lemma against a CEFR word list (freely available from EVP — English Vocabulary Profile for English, ELP for other languages).
  4. Flags any word above the user's target CEFR band.
  5. Rewrites the prompt to the model, instructing it to replace flagged vocabulary with simpler alternatives.

This two-pass approach — generate then simplify — adds latency (roughly 300–600 ms on GPT-4o-mini) but produces dramatically more appropriate output.

def grade_vocabulary(text: str, target_level: str, lang: str) -> dict:
    doc = nlp_models[lang](text)
    lemmas = [token.lemma_.lower() for token in doc if token.is_alpha]
    above_level = [
        w for w in lemmas
        if cefr_level(w, lang) > target_level
    ]
    return {
        "flagged": above_level,
        "needs_rewrite": len(above_level) > 0,
    }
Enter fullscreen mode Exit fullscreen mode

Intent Classification

A language tutor needs to handle multiple conversation modes:

  • Free conversation — user just chats, AI responds naturally
  • Correction mode — AI corrects grammar errors
  • Vocabulary drill — spaced repetition flashcard loop
  • Pronunciation practice — AI evaluates user speech (more on this below)
  • Translation check — user submits a translation, AI grades it

I initially tried to detect intent from the user's message alone. This worked about 70% of the time and failed spectacularly the other 30%. A user saying "how do you say 'dog'?" looks like a translation question, but in context might be a free conversation turn where they forgot a word.

The fix was maintaining a session state machine — a small enum that tracks which mode the session is currently in, and only transitions based on explicit user signals (tapping a mode button) or unambiguous intent patterns (a message that's 90%+ a known vocabulary query pattern).

from enum import Enum

class SessionMode(Enum):
    FREE_CONVERSATION = "free"
    CORRECTION = "correction"
    VOCAB_DRILL = "vocab_drill"
    TRANSLATION = "translation"
    PRONUNCIATION = "pronunciation"
Enter fullscreen mode Exit fullscreen mode

State transitions are logged per-session and stored with the conversation history, which lets the model use few-shot context to stay coherent across mode switches.

The Correction Problem: How Do You Correct Without Killing Motivation?

This is where NLP meets pedagogy. Immediate, constant correction is psychologically harmful to language learners — it creates anxiety and suppresses output. But zero correction means fossilisation (permanent bad habits).

Research from Krashen's input hypothesis and subsequent work suggests delayed, selective correction is most effective. Specifically:

  • Correct only errors that impede comprehension, not stylistic differences
  • Use recasts (repeating the correct form naturally) rather than explicit metalinguistic feedback
  • Correct no more than 2–3 errors per conversational turn

I implemented this with a two-model pipeline:

  1. Error detection model: A fine-tuned classifier that labels errors by type (morphological, syntactic, lexical, pragmatic) and severity (comprehension-blocking vs. minor).
  2. Correction strategy model: Given the detected errors and the learner's level, decides which to correct and how.

For the error detection step, I initially tried prompting GPT-4 with a structured output schema. It worked but was expensive at scale. I switched to a smaller fine-tuned model (DistilBERT fine-tuned on the NUCLE corpus for English, with similar datasets for Spanish and French) that runs locally and costs nothing per inference.

def select_corrections(errors: list[dict], level: str) -> list[dict]:
    # Only return comprehension-blocking errors for A1/A2
    if level in ("A1", "A2"):
        return [e for e in errors if e["severity"] == "blocking"][:2]
    # For B1+, include common morphological errors too
    return [e for e in errors if e["severity"] in ("blocking", "morphological")][:3]
Enter fullscreen mode Exit fullscreen mode

Handling 20+ Languages: The Tokenisation Nightmare

Supporting multiple languages isn't just a UI translation problem. Every language has fundamentally different tokenisation requirements:

Language Tokenisation challenge
English Relatively easy — whitespace + punctuation
German Compound words (Donaudampfschifffahrtsgesellschaft) need decompounding
Japanese No word boundaries — requires morphological analysis (MeCab, SudachiPy)
Arabic Right-to-left, root-based morphology, heavy inflection
Chinese Word segmentation (jieba, pkuseg) required
Turkish Agglutinative — one word can express a full English sentence

I ended up with a language-router pattern:

TOKENISERS = {
    "en": lambda text: en_nlp(text),
    "de": lambda text: de_nlp(text),      # spaCy de_core_news_sm
    "ja": lambda text: mecab_tokenise(text),
    "zh": lambda text: jieba_tokenise(text),
    "ar": lambda text: cameltools_tokenise(text),
}

def tokenise(text: str, lang: str) -> list[str]:
    tokeniser = TOKENISERS.get(lang, default_tokeniser)
    return tokeniser(text)
Enter fullscreen mode Exit fullscreen mode

This adds a dependency per language, but there's no general solution. Trying to use a single tokeniser across language families will produce garbage results for CJK and Arabic.

Latency: The Real UX Killer

In a conversational app, users tolerate roughly 800–1200 ms of latency before it feels broken. My initial pipeline — tokenise, check vocabulary, call LLM, validate response — was running at 2.4s average. That's a broken app.

The optimisations that actually moved the needle:

  1. Stream the LLM response: Use server-sent events to start rendering the AI response before it's complete. Perceived latency drops by 60%+ even with identical total generation time.
  2. Cache vocabulary grade results: Vocabulary checks on the input (not the output) can be cached with a short TTL. Most users repeat similar vocabulary within a session.
  3. Run CEFR grading async on a separate thread: Don't block the main response path. If the grade check returns before the response is done, you can still intercept; if not, let it through and grade the next turn.
  4. Move error detection to a smaller local model: 8ms on DistilBERT vs 400ms on GPT-4. Not suitable for all tasks but fine for binary error flagging.

After these changes, p50 latency dropped to 680ms and p95 to 1.1s — comfortably within the threshold.

What I'd Do Differently

  1. Start with a state machine from day one. I retrofitted it after the intent classification failures. Every conversational app needs one.
  2. Invest in evaluation datasets early. Without labeled examples of good vs. bad corrections for each language level, you're flying blind. NUCLE, BEA-2019, and Lang-8 are good starting points.
  3. Separate the LLM call from the grading logic. Mixing them makes both harder to test. A clean pipeline — generate → validate → rewrite if needed — is worth the extra roundtrip.
  4. Don't underestimate language-specific engineering costs. Adding Japanese support took 3x longer than adding Spanish. Budget accordingly.

Building a language tutor is one of the most rewarding NLP projects you can take on — every part of the stack from tokenisation to pedagogy shows up in the product. The challenge is exactly what makes it worth doing.


I'm building Pocket Linguist, an AI-powered language tutor for iOS. It uses spaced repetition, camera translation, and conversational AI to help you reach conversational fluency faster. Try it free.

Top comments (0)