DEV Community

Cover image for What we learned designing an AI Tutor for language learners
Andrew
Andrew

Posted on

What we learned designing an AI Tutor for language learners

When we started building an AI Tutor at Promova, I scoped the hard problem as model selection. Pick a capable LLM, put a speech-to-text and text-to-speech layer around it, and you have a tutor. That scoping was wrong. The model turned out to be the least interesting part of the system.

We shipped in August 2025. Over the first six months, daily active users grew 30x to a peak of more than 850 unique learners a day, and more than 70,000 people have started at least one voice session. Here is what the build actually taught us, mostly about the parts that sit around the model rather than the model itself.

The model is a commodity; the loop is the product
Off-the-shelf models cleared our quality bar early. Swapping providers moved our retention numbers far less than tuning the pipeline did. The thing that separated a tutor people returned to from one they abandoned was the conversational loop: speech-to-text, inference, text-to-speech, and the orchestration between them, especially turn detection and barge-in.

If you are building voice, budget your engineering time accordingly. We spent most of ours on the span between "user stopped speaking" and "first audio plays back," not on prompt cleverness. End-to-end perceived latency is the system, and the system is what users feel.

Latency is a UX requirement, not an optimization
A two-second pause in a text app is fine. Two seconds of dead air in a spoken turn is brutal. It reads as hesitation, and a hesitating partner makes an anxious learner more anxious. Several users described the tutor as "judgmental" when the actual fault was a slow time-to-first-token.

So we treated latency as a hard requirement with a budget, not a later optimization. We optimized time-to-first-audio over total response quality: stream tokens out of the LLM, start TTS on the first sentence chunk instead of waiting for the full completion, and tighten endpointing so we are not sitting on a voice-activity timeout after the user clearly finished. A fast, slightly-less-polished turn beat a slow, perfect one in every cohort we measured. That was counterintuitive for the engineers, me included. We wanted the better completion. Users wanted the one that arrived before the silence got awkward.

The drop-off was fear, not friction
The most instructive metric we have: about one in four active users opens the AI Tutor tab on any given day, but a meaningful share never start a session. They load the screen and leave without speaking.
We initially modeled that as a funnel problem and threw funnel fixes at it. Wrong frame. The drop-off between "tab opened" and "call started" is not friction. It is fear. Producing a foreign language out loud is exposed in a way reading and writing are not. Text lets you edit; reading lets you stay silent; speech forces real-time output with no undo, and most people avoid it.

That reframed the design work. The lever was not another CTA, it was lowering the activation cost of the first utterance: seed the session with a concrete low-stakes prompt instead of an open-ended "what do you want to talk about," make retries free and obvious, and signal clearly that the audio is private. We were optimizing for nerve, not capability.

Correction is a policy problem
The naive implementation corrects every error the model detects. We built it. It was unusable. Continuous correction turns a conversation into an exam, which is the exact stressor our users were already avoiding.

So error handling became a policy, not a default. When do you interrupt, and when do you suppress a correction to keep the learner talking? The right behavior is conditional on proficiency level and session goal. A beginner ordering coffee needs the completion to succeed; an interview-prep user wants every mistake flagged. The transferable lesson for any feedback system: correction recall is not the objective. We got better outcomes by lowering correction frequency and tuning timing than by surfacing every detected error.

Instrument retention, not adoption
Like most teams, we first optimized adoption: installs, sign-ups, sessions started. Those metrics are easy to move and weakly correlated with whether the product works.
The signals that actually predicted value were unglamorous.

Completion: learners average about 2.5 lessons per sitting, and once a call starts, more than half finish the full lesson (roughly 50 to 57 percent in early 2026). Unprompted return: in a consumer app there is no teacher, grade, or assignment forcing a second session, so repeat usage is a clean signal that the product earned it. We promoted completion and cohort retention to the top of our dashboards and demoted raw sign-ups. I would reprioritize the instrumentation that way earlier next time.

Build for the boring sessions
The demo build is impressive: wide range, multilingual switching, jokes. The production build has to hold up at 7 a.m. before work, at 11 p.m. after the kids are asleep, for five minutes on a bus. Those targets pull against each other. The demo rewards range; the daily habit rewards low p95 latency, fast cold starts, and minimal friction. We optimized for the second one.

That is also why we never modeled this as teacher replacement. The tutor is good at the repetitive, low-stakes reps, which frees a human to do what the system cannot: read the room, push someone through a confidence wall, diagnose why a learner keeps churning. The design question was never "can the model replace the teacher." It was "can it get a learner to the next human conversation less afraid to speak."

The model is not your moat, the loop is. Treat latency as a feeling with a budget. Profile where users hesitate and design for that, because the hardest constraint is usually emotional, not technical. Correct less. Instrument return, not adoption. Any time you put an AI in front of a nervous human and ask them to perform in real time, the same constraints apply.

Top comments (0)