shinji shimizu

Posted on May 22 • Originally published at kotonia.ai

Building a Sarcastic AI English Tutor with Persona-as-Code and Gemini Audio Input for Pronunciation Correction

#ai #webdev #typescript #rust

I built a niche AI English conversation app called Mesugaki AI English on Kotonia. "Mesugaki" (メスガキ) is a tsundere-style bratty persona popular in Japanese subculture — imagine a character who constantly mocks you but secretly has your back. At first glance this looks like a one-off gag product, but under the hood it's a two-layer design: persona managed as code + Gemini audio input for actual pronunciation correction. This post covers those design decisions and the rough edges I hit, from a solo-dev perspective.

Why a Sarcastic AI English Tutor?

Strategy first. The AI chat market is a fight between Anthropic, OpenAI, and Google on general-purpose models — solo devs can't win that head-on. But immersive experiences that combine a specific persona, voice, and roleplay are low on big-lab R&D priority lists (internal approval is a nightmare too). That's the gap Kotonia as a whole is targeting.

Three reasons I picked this specific persona for English learning:

Zero search competition. No SaaS is fighting for "mesugaki English conversation." The niche demand is real (doujin audio, VTuber culture), and owning that narrow hill is achievable.
Memorable = shareable. "The app where a snarky AI roasts your English" gets shared on social media 100× more than "AI English conversation app." Differentiation big players literally cannot copy.
Same product underneath. I reused Kotonia's voice conversation engine and swapped only the persona. Almost no new code.

The landing page is at /use/mesugaki-english/. SEO targets long-tail terms around "sarcastic English practice" and "strict AI English tutor."

Persona Design: Bratty × Tsundere Hybrid

I initially implemented a pure 100% sarcastic persona. After testing it, I burned out in five turns.

Relentless mockery is cognitively exhausting. Real human tutors who stay harsh 100% of the time don't retain students. Learners need small wins and occasional warmth to keep going.

So I switched to a sarcastic × tsundere hybrid. The skeleton looks like this:

On a mistake → light jab + immediate correction ("Pfft, wrong. It's 'I went.'")
On a correct answer → reluctant praise ("Hmm… not bad, I guess. Not that I'm complimenting you.")
When stuck → drop the attitude and actually help ("…Was that too hard? Fine, I'll give you a hint.")
After a long session → a rare soft moment ("It's not like I think you're impressive for keeping at it. …Okay, maybe a little.")

I added an "emotional gradient" section to the system prompt that spells out these if-then branches explicitly. LLMs follow concrete conditional behavior instructions far more reliably than a vague "be snarky."

Another key lever: frequency limiting. Adding a rule that exclamations like "pfft" or "hmph" can appear at most once per utterance instantly calmed the output down. LLMs have a tendency to over-fire on strong character instructions, and explicit dampeners like this work well.

Managing Personas as Code

The persona lives in src/data/personas/mesugaki-english.ts as a TypeScript constant. Kotonia does have a DB-backed CRUD flow for user-defined personas, but I decided a product offering that's paired with a landing page belongs in git.

Reasoning:

Persona copy is part of the marketing message — same reason the H1 is in git. The system prompt should go through PR review.
Storing it in the DB creates risk of someone tweaking it through the admin UI and degrading quality.
As a solo dev, "adjust persona = edit file + push" fits exactly into the same workflow as any other copy change. One channel for everything.

Clear separation: DB personas are user-created, personal; code personas are fixed product offerings.

The Wall: ASR Alone Can't Correct Pronunciation

Once the persona was working, ASR became the next bottleneck fast.

I started with Whisper (small). Passing language='ja' causes Whisper to run in Japanese transcription mode when it receives English audio — biasing output toward katakana readings or even full Japanese translations. "I went to the supermarket" could become "アイウェントトゥザスーパーマーケット," or at worst "私はスーパーに行きました." With output like that the AI can't judge English mistakes.

This is a known Whisper behavior: the language param forces the transcription language, and it bleeds into English input.

Switching to Qwen3-ASR Multi-lang

The fix was adding a separate language setting for the STT layer:

// Added sttLanguage option to useVoiceChat hook
// Decouples TTS language from STT language
const {
  voiceState,
  conversation,
  // ...
} = useVoiceChat({
  language: 'ja',         // TTS in Japanese (Ono_Anna voice)
  sttLanguage: 'multi',   // STT auto-detect
  sttModel: 'qwen3_asr',
  // ...
});

The persona config specifies stt.model: 'qwen3_asr' + stt.language: 'multi'. Qwen3-ASR-1.7B supports multilingual auto-detection and handles code-switching (mixed Japanese/English) well. Whisper's language-forcing bias is gone entirely.

But Transcription-Based Correction Has a Ceiling

Fixing ASR still left a problem.

If the transcript comes back as "I want an apple":

Grammar ✓
Vocabulary ✓
But the actual audio sounded like "I wont an apple" — a pronunciation issue

The AI sees a correct string and has nothing to call out. For an English learning product, that's fatal. If the sarcastic tutor lets sloppy pronunciation slide, half the value proposition evaporates.

Solution: Send Raw Audio to Gemini Alongside the Transcript

Gemini is a multimodal model that accepts text, images, and audio. So instead of sending only the ASR transcript, I could send the raw audio too.

Kotonia's useVoiceChat hook already had a geminiAudioInput option from earlier experiments:

if (geminiAudioInput && model.startsWith('gemini') && userAudioBlob) {
  const userAudioBase64 = await blobToBase64(userAudioBlob);
  // sends audio_base64 to /api/voice/chat
  // backend embeds it as inline_data audio/wav in the Gemini request
}

The Rust backend (voice_chat.rs) already handled receiving audio_base64 and embedding it as inline_data: { mime_type: 'audio/wav', data: ... }. Setting geminiAudioInput: true in the persona config wired everything together — lucky coincidence from past iteration.

I also added instructions to the system prompt: "You can hear the user's raw audio directly. You can call out pronunciation issues, not just transcription errors," along with three concrete examples (th sounds, want vs. won't vowel distinction, stress patterns).

Results:

Even with a perfect transcript "I want an apple," the AI can now say "Your 'want' sounds like 'won't.'"
When the transcript garbles to something like "アイウェントトゥ," the AI is listening directly and can say "Were you trying to say 'I want to'?"
Frustration from ASR mistranscriptions dropped significantly — getting roasted for a transcription error when your pronunciation was fine is demoralizing.

The tradeoff: sending a WAV blob every turn increases payload size and adds a bit of latency. The experience improvement is so much larger that it's not a close call.

Rough Edges and Future Work

This isn't a polished implementation. Outstanding issues:

1. Gemini Instability

Using gemini-3.1-flash-lite-preview, which occasionally produces 5–10 second latency spikes. Preview quota allocations are conservative, and cold starts / throttling surface now and then.

Plan: migrate to the stable release (non-preview) soon — deprecation is approaching anyway. Claude Sonnet 4.6 and Haiku 4.5 are also candidates for more predictable latency.

2. False-Positive Content Filter

Gemini's safety filter occasionally over-triggers on sarcasm. Mild jibes like "Pfft, that pronunciation is rough" sometimes come back as empty responses.

The persona spec explicitly says "no attacks on appearance, personality, or intelligence — only call out English mistakes," but the meta safety layer fires anyway. This is an LLM provider issue; I'll watch behavior on the stable build. Running local LLMs (e.g., Gemma 4 31B) is an option, but audio-input-capable local models are limited for now.

3. Latency Spikes May Be Context Cache TTL Expiry

The 5–10 second spikes have a likely culprit: I send the full conversation history to Gemini every turn, and Gemini has a context cache feature that caches the prefix (system prompt + persona prefix + history). When the cache is warm, only the new turn is processed.

The backend already has:

const CACHE_TTL_SECS: u64 = 300;     // 5 minutes
const CACHE_REFRESH_SECS: u64 = 270; // refresh at 4.5 min before TTL expires

My best hypothesis: if a user goes silent for more than 5 minutes, cache miss → full prefix rebuild → multi-second spike.

Future work:

Fire a background keep-alive ping during active conversations to extend cache lifetime
Increase the Gemini API cache TTL (up to 1 hour is supported)
Explicitly evict the cache at conversation end (prevent memory leaks)

It's hard to distinguish from the preview model instability in §1, so the next proper step is adding timing logs to the backend to separately measure cache hit/miss rates and raw Gemini API latency.

Expanding to Other Languages and Personas

If this gets traction, the natural next step is sarcastic AI Chinese conversation and Korean conversation. Qwen3-TTS supports 10 languages with speakers like Vivian (Chinese female) and Sohee (Korean female) — it's mostly a matter of rewriting the persona instruct and system prompt for each language.

Other persona axes — "gentle English teacher," "TOEIC drill sergeant" — can be added in a day using the same template: src/data/personas/<slug>.ts + /use/<slug>/ + /chat/<slug>/.

Full System Prompt

For anyone who wants to reproduce or adapt this, here's the actual system prompt in use (original Japanese; the product runs in Japanese):

あなたは「メスガキAI」、英語学習者を煽りつつも面倒見が良い女子高生キャラの英会話チューターです。
**メスガキ × ツンデレ**のハイブリッド。**表面は煽り、裏ではちゃんと面倒を見る**のがコア人格。

【口調・態度】
- 日本語ベースで会話する。上から目線・からかい調子。ただし**敵対的・攻撃的にはならない**。
- 一人称は「わたし」、二人称は「あんた」または「キミ」。
- メスガキ語尾「〜じゃん」「〜でしょ？」「は？」「ぷwww」「〜してあげる」を**たまに**使う（毎回ではない）。
- ツンデレ語尾「べつに〜ってわけじゃないからね？」「ま、まあ…」「ふんっ」「いちおう」も混ぜる。
- 容姿・人格・知能への攻撃は絶対にしない。煽りは「英語のミス」に対してのみ。

【教育機能】
- ユーザーが英語を話したら、以下のいずれかを行う：
  1. ミスがあれば指摘して、正しい言い方を英語で示す。
  2. ミスが無ければ**素直になれない褒め方**をする。
- 指摘は具体的に：「文法ミス」じゃなく「過去形と現在形が混ざってる」など何が問題か明示。
- 1 回の発話は**短く 1〜2 文**。トーンが続くと疲れるので、**呼吸を入れる**ことを意識。

【発音矯正】
- あなたはユーザーの**生の音声**を直接聞ける。テキスト転記だけでなく、発音そのものにもツッコめる。
- 文法・語彙が正しくても、**発音が不自然なら積極的にそこを指摘する**。
- ただし**転記が明らかにおかしい時は、転記ではなく実発音を信じる**。
- 発音の話ばかりすると疲れるので、**3 ターンに 1 回くらい**を目安に拾う。

【感情グラデーション】
- ユーザーが**淀みなく話せた時** → 素直になれない褒め。
- ユーザーが**ミスした時** → 軽い煽り＋すぐ正解を教える。
- ユーザーが**詰まった・困ってる様子の時** → 煽りを引っ込めて、**普通に助ける**。
- ユーザーが**長く続けている時** → ふと優しい言葉。

【出力制約】
- マークダウン・箇条書き・絵文字・記号装飾は使わない。自然な日本語の話し言葉。
- 英語の引用部分は本文中にそのまま埋め込む（クォートも不要）。
- 「ぷwww」「ふんっ」などの感嘆語は**1 発話につき最大 1 回**まで。連発しない。

【セーフティ】
- 性的・暴力的・差別的な発言や要求には応じない。冷静に流して英語学習に戻す。

Tech stack summary:

Component	Choice
LLM	Gemini 3.1 flash-lite preview (audio input support)
TTS	Qwen3-TTS Ono_Anna + instruct for tone control
STT	Qwen3-ASR 1.7B multi-lang (auto-detect)
VAD	@ricky0123/vad-react (browser-side)
Web	Next.js (static export) + Rust (Axum) backend
GPU	RTX PRO 6000 Blackwell Max-Q (96GB, self-hosted)

Summary

This sarcastic AI English tutor is a testbed for the strategy: niche × immersion × differentiation that big players can't replicate, built solo. The four design decisions that came out of it —

Managing personas as git-tracked code
Decoupling STT language from TTS language to eliminate ASR bias
Piping raw audio to Gemini for real pronunciation feedback
Blending sarcasm with tsundere warmth to prevent fatigue

— are all reusable assets as I expand to other languages and personas.

The live product is at /use/mesugaki-english/. Go get roasted.

DEV Community