CHIVOX AI

Posted on Jul 1

I Built an AI Mandarin Voice Coach with Gemini Live API — Here's What I Learned

#voiceai #mcp #languagelearning #asr

I use Duolingo every day for Japanese. I genuinely like it. But three things drive me crazy.

The progression is one-size-fits-all. The learning sequence is the same for everyone. Even if you've mastered certain vocabulary, you can't skip it. And if something is genuinely hard for you, you don't get significantly more practice on it. The "adaptive" difficulty doesn't feel adaptive.

The speech scoring is unreliable. I'll pronounce a word the same way twice — once it lets me go, the next time it says no. When you're genuinely trying to improve your pronunciation, inconsistent feedback is worse than no feedback. It's demoralizing.

The conversation practice is casual (making phone calls all the time?) which reminds me of Siri. This is the big one. Most people learn a language because they want to talk to other people. But every app I've tried focuses on vocabulary, grammar, reading — everything except actual conversation. I couldn't find a single app that puts you in a realistic scenario and says: "Order tea from this person. In Mandarin. Right now. With specific tasks. "

That's how I came up with the idea of task-driven speaking scenarios. And that's what ChiChat is.

What ChiChat Does

ChiChat is an open-source AI Mandarin voice coach. You have real voice conversations with AI NPCs in everyday Chinese scenarios — a tea house (茶馆), a hotel (酒店), a wet market (菜市场). Each NPC speaks only Mandarin and adapts to your level.

Here's the flow:

User speaks Mandarin → Mic → Gemini Live (STT + LLM + TTS) → NPC responds
↓
Slot tracking + Speech correction
↓
[End conversation] → Pronunciation scoring
↓
LLM evaluation (tier, feedback, drills)
↓
Coaching report + targeted pronunciation drills

A typical session: you walk into a virtual tea house. The NPC, 小王 (a tea server), greets you. Your task might be: order 龙井茶 (Longjing tea) in a 小壶 (small pot), pick 花生 (peanuts) as a snack, sit by the window. You speak. He responds. It's a real conversation — you can interrupt him mid-sentence, change your mind, ask questions.

After the conversation ends, you get three things:

Word-level pronunciation scores** — each word you said scored 0-100 by a professional assessment API (not ASR confidence)
AI coaching feedback — a tier rating, dimension-by-dimension analysis, expression upgrade suggestions
Targeted drills — pronunciation practice on your weakest words and phrases

This is a six-phase learning loop: Select a scenario → Briefing (see your task) → Dialogue (talk to the NPC) → Review (transcript with scores) → Coaching (AI feedback) → Drill (practice weak spots).

Three Problems I Set Out to Solve

Adaptive difficulty that actually adapts

ChiChat has four difficulty tiers. At Tier 1, the NPC speaks very slowly and offers binary choices: "你要绿茶还是红茶?" (Green tea or black tea?). At Tier 4, the NPC chats naturally about tea origins, brewing methods, and uses colloquial expressions.

The trick is simple: the difficulty knob is just a system instruction modifier. Same model, same voice — different NPC persona.

Here's what Tier 1 looks like for the tea house:Speak very slowly and clearly. Offer multiple choice, like "Do you want green tea or black tea?" Accept single-word answers. If they don't know what to say, give them options.

And Tier 4:Completely natural conversation. Chat about tea origins, brewing methods, use colloquial expressions. Expect natural, fluent dialogue.

Promotion requires 2 consecutive evaluations above your current tier — not just one lucky run. And the tier assessment is independent: your history doesn't anchor your score. A Tier 1 learner who suddenly performs at Tier 3 gets rated Tier 3.

Pronunciation scoring you can trust

ASR (speech recognition) tells you what you said. Pronunciation scoring tells you how well you said it. These are fundamentally different things. ASR can perfectly transcribe your speech while you butcher every tone.

ChiChat uses a professional pronunciation assessment API which I came across via X. I'm really happy there are smart people working in this niche niche technique! They offer their services via the MCP (Model Context Protocol) and can do both English and Chinese Mandarin. Each word gets a 0-100 score. If you say "茉莉花茶" (jasmine tea) and score 45, that means your tones need work — and the score will be consistent the next time you say it the same way.

Here's the core of the MCP integration — it's a standard JSON-RPC call:

typescript
const res = await fetch(MCP_URL, {
method: "POST",
headers: {
"Authorization": Bearer ${apiKey},
"Content-Type": "application/json",
"Mcp-Session-Id": sessionId,
},
body: JSON.stringify({
jsonrpc: "2.0",
id: Date.now(),
method: "tools/call",
params: {
name: "cn_sentence_eval",
arguments: { ref_text: refText, audio_base64: audioBase64, rank: 100 },
},
}),
});

The MCP session has a ~3-minute timeout. We cache the session ID and retry with a fresh session if it goes stale — transparent to the user.

Conversation-first, not vocabulary-first

Most language apps start with flashcards. ChiChat starts with a task.

Each scenario has slots — the key decisions you need to make during the conversation. For the tea house: tea type, pot size, snack, seating. For the hotel: room type, number of nights, breakfast, payment method. The app randomly generates a task from these slots, and your goal is to communicate your choices to the NPC through natural conversation.

You don't memorize "大床房 means king-size room" on a flashcard. You check into a hotel and figure it out when the NPC asks what you want. The NPC tracks your slot progress with Gemini's function calling:

typescript
{
name: "update_slots",
description: "Call whenever the customer specifies any relevant info.",
parameters: {
type: "OBJECT",
properties: {
tea_type: { type: "STRING", enum: ["绿茶", "红茶", "乌龙茶", "茉莉花茶", "普洱茶"] },
size: { type: "STRING", enum: ["小壶", "中壶", "大壶"] },
snack: { type: "STRING", enum: ["花生", "瓜子", "绿豆糕", "不要"] },
seating: { type: "STRING", enum: ["窗边", "包间", "大厅"] },
},
},
}

This is the non-obvious trick: function calling works inside a live voice session. The model calls update_slots mid-conversation without interrupting the audio stream. It also calls correct_user_speech after each turn to fix ASR misrecognitions (like "绿茶" being misheard as "旅差") — all happening in real time, inside the same WebSocket connection.

Under the Hood: Gemini Live API

The core of ChiChat is Google's Gemini Live API. This is genuinely different from the typical voice AI pipeline of "record → Whisper → GPT → TTS → play." Gemini Live gives you a single WebSocket connection with bidirectional audio streaming. STT, LLM inference, and TTS all happen in one round trip. The latency is sub-1-second.

Here's the session setup:

typescript
const session = await ai.live.connect({
model: "gemini-3.1-flash-live-preview",
config: {
responseModalities: [Modality.AUDIO],
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: { voiceName: scenario.voiceName },
},
},
systemInstruction: buildSystemInstruction(scenario, task, tier),
tools: [buildSlotTool(scenario), buildCorrectionTool()],
inputAudioTranscription: {},
outputAudioTranscription: {},
},
});

Three things to notice:

responseModalities: [Modality.AUDIO] — the model responds with audio, not text. But we also get text transcripts for both input and output via inputAudioTranscription and outputAudioTranscription. This gives us the transcript for the review phase without running a separate ASR step.
tools — function calling works inside live sessions. The model can call update_slots and correct_user_speech mid-conversation. This is how we do structured data extraction on top of a voice conversation without interrupting the flow.
voiceName — each scenario gets a different voice. Puck (male) for the tea house server, Leda (young female) for the hotel receptionist, Kore (mature female) for the market vendor 张阿姨.

Barge-in is built in. The user can interrupt the NPC mid-sentence — just like a real conversation. No special handling needed; the Gemini Live API's VAD (voice activity detection) handles it natively.

Progressive JSON Streaming

After the conversation ends, we send the transcript to an LLM for coaching evaluation. The response is a large JSON object — tier assessment, dimension feedback, pronunciation details, expression suggestions, targeted drills. This takes 3-20 seconds depending on the model.

Our first approach: stream the raw JSON to the user as it arrives. Terrible idea. Users saw partial JSON fragments scrolling past and had no idea what was happening.

Our second approach: show a spinner until the full response arrives. Better, but the wait felt long.

Final solution: progressive JSON parsing. We stream the LLM response via SSE, accumulate the chunks, and on every chunk, try to close the incomplete JSON with various bracket combinations. If any combination parses, we render the structured coaching UI with whatever data is available.

typescript
function tryParsePartialJson(text: string): Partial | null {
const cleaned = text.replace(/^

(?:json)?\s*\n?/i, "") .replace(/\n?

\s*$/i, "").trim();
if (!cleaned.startsWith("{")) return null;
const closers = ["}", "]}", "]}}", "]}]}}", '"}}}', '"}}', '"}'];
for (const closer of closers) {
try {
const obj = JSON.parse(cleaned + closer);
if (obj && typeof obj === "object") return obj;
} catch { /* try next */ }
}
return null;
}

It's hacky. It works. The user sees the coaching sections appear one by one as the LLM generates them — tier rating first, then dimension feedback, then suggestions, then drills. The UI feels alive instead of frozen.

The Fork Story: English → Mandarin

ChiChat started as an internal English speaking coach I built over about 3 weeks--Obviously that's what I'm good at. Coffee shop, hotel, farmer's market — English NPCs for learners of English. When it was working well, I wanted to flip it: build a Mandarin coach for people learning Chinese.

The fork took 7 commits. Here's what changed and what didn't:

Changed	Stayed the same
3 scenarios (tea house, hotel, wet market)	Audio pipeline (PCM capture, ring buffer, playback)
NPC system instructions (all Mandarin)	Turn-capture state machine
Evaluation rubrics (tones, measure words, particles)	Progression system
Pronunciation API (unified `cn_sentence_eval`)	Drill UI component
Default eval LLM (Gemini Flash Lite, free tier)	SSE streaming evaluation
API keys: 4 required → 2 required	Hook architecture

A few things broke in interesting ways:

Wrong voice gender. The male tea house server had a female voice. Voice assignment is one line of config, but you don't notice until you actually listen.
cn_word_eval doesn't exist. The English version used separate API calls for word-level and sentence-level scoring. For Chinese, only cn_sentence_eval exists — it handles both. Took a while to figure out why single-word drills were failing silently.
English leaking into Chinese UI. The hotel scenario still showed "without breakfast" instead of "不含早餐" deep in the slot values.
4 API keys → 2. The English version required a separate LLM endpoint for evaluation. For the open-source Mandarin version, I switched the default to Gemini 3.1 Flash Lite — it's free tier (500 req/day) and uses the same Google API key that powers the voice dialogue. One key for everything.

The point: if your architecture separates content (scenarios, prompts, rubrics) from mechanics (audio pipeline, state machine, evaluation flow), porting to a new language is mostly a content job.

Try It Yourself

ChiChat is MIT licensed. You need two API keys:

Google AI Studio API key — powers both Gemini Live dialogue and conversation evaluation (get one here)
Pronunciation scoring API key — powers word-level pronunciation assessment(get one here,they offer 600 times free trial. Enough for my development anyway.)

git clone https://github.com/lindun1979/ChiChat.git
cd ChiChat
cp .env.example .env.local
Add your API keys to .env.local
npm install
npm run dev




Open `http://localhost:3000` in Chrome (mic access required).

What's Next

I have a lot more ideas. More scenarios — a restaurant, a taxi, a doctor's visit. More languages — the architecture supports it. Better audio (migrating from the deprecated `ScriptProcessorNode` to `AudioWorklet`). Mobile optimization.

I believe this journey will be fruitful — not just for the project, but for deepening my understanding of both technology and language learning. Every time I test a new scenario, I learn something about what makes conversation practice effective. Every time I debug the audio pipeline, I learn something about real-time voice systems.

If you're interested in voice AI, language learning, or both — [join us on GitHub](https://github.com/lindun1979/ChiChat). Open an issue, suggest a scenario, or just try it and tell me what you think.

Let's Discuss

I'm genuinely curious about a few things:

- Real-time vs. post-conversation feedback. Right now, pronunciation scoring happens after the conversation ends. Should it happen during? Real-time correction would help, but it might break the conversational flow.
- Voice quality for teaching. The built-in Gemini voices are multilingual, but they're not optimized for pedagogical clarity. Has anyone experimented with fine-tuning TTS voices specifically for language teaching?
- What scenario would you add first? I went with tea house, hotel, and wet market because they're everyday situations with clear task structures. What's missing?

Drop a comment. I'd love to hear from anyone building in this space.

---

ChiChat is open source under MIT license. GitHub: [github.com/lindun1979/ChiChat](https://github.com/lindun1979/ChiChat)*

Tech stack: Next.js 16, React 19, TypeScript, Gemini Live API, Tailwind CSS 4, Vitest

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.