Kumaraswamy Chavvakula

Posted on Mar 16

Building LinguaLive: A Real-Time AI Language Tutor with Gemini Live API

#gemini #googlecloud #ai #geminiliveagentchallenge

This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem: Language Learning Feels Disconnected

We've all downloaded Duolingo, done the first week religiously, and then... stopped. Why? Because language learning apps are fundamentally disconnected from real life. You're matching words on a screen when what you actually need is someone patient sitting next to you, pointing at things, and helping you build vocabulary from your own world.

Immersion is widely regarded as one of the most effective approaches to language acquisition — but it typically requires expensive human tutors or living abroad. What if AI could bring that immersive experience to everyone?

The Idea: Point Your Camera, Learn a Language

LinguaLive is a real-time AI language tutor named Luna that:

Sees through your camera and teaches you words for objects in your environment
Hears your pronunciation and gives specific, actionable feedback
Speaks back with native-sounding voices via the Gemini Live API
Generates custom visual flashcards using Imagen 3
Adapts to the learner's pace — the system prompt instructs Luna to simplify when the learner struggles and increase difficulty when they're doing well

No text boxes. No multiple choice. Just a natural conversation where you point your camera at your kitchen and Luna teaches you cooking vocabulary in Spanish.

The Tech Stack

Here's what powers LinguaLive:

Component	Technology
Real-time AI	Gemini 2.0 Flash Live API (bidirectional audio/video streaming)
Agent Definition	Google ADK (Agent Development Kit) for agent structure and tool registration
Live Streaming	Google GenAI SDK (`client.aio.live.connect()`) for real-time bidirectional streaming
Image Generation	Imagen 3 on Vertex AI
Backend	Python 3.11, FastAPI, WebSocket
Data Persistence	Cloud Firestore
Asset Storage	Cloud Storage
Hosting	Cloud Run (auto-scaling, session affinity)
CI/CD	Cloud Build
Frontend	Vanilla HTML/JS, Web Audio API, MediaDevices API

How It Works: The Multimodal Loop

The core of LinguaLive is a multimodal streaming loop:

Voice In → User speaks in their target language (PCM 16kHz via Web Audio API)
Camera In → Browser captures JPEG frames at ~1fps and sends to Gemini
Voice Out → Gemini responds with native audio (PCM 24kHz)
Image Out → Imagen 3 generates flashcard illustrations for key vocabulary on demand

This happens over a single WebSocket connection. The browser captures audio via AudioWorklet (with a ScriptProcessor fallback) and camera frames via getUserMedia(). The FastAPI backend bridges these to the Gemini Live API's bidirectional streaming endpoint.

Key Technical Decisions

Why WebSocket Instead of REST?

The Gemini Live API uses bidiGenerateContent — a bidirectional streaming endpoint. REST would add significant latency for real-time conversation. Our WebSocket carries audio chunks (~250ms each) and video frames interleaved, keeping the conversation feeling natural and responsive.

ADK + GenAI SDK: Why Both?

We use ADK to define the agent — Luna's persona, system instruction, and 7 registered tools. But for the actual Live API streaming, ADK's standard runner doesn't support real-time bidirectional audio/video, so we use the GenAI SDK's client.aio.live.connect() directly. This gives us:

Real-time PCM audio streaming in both directions
Live video frame ingestion
Function calling mid-stream
Input and output audio transcription

Luna's 5 active Gemini tools (the ones declared to the model):

get_session_progress — returns real-time learning stats
get_vocabulary_quiz — generates adaptive quizzes from learned words
detect_scene — identifies environments for themed vocabulary lessons
identify_objects_in_view — processes camera object detection
generate_flashcard_image — creates Imagen 3 visual flashcards

(Vocabulary and pronunciation tracking happen automatically via output transcription to avoid interrupting the audio stream with tool calls.)

Grounding to Reduce Hallucinations

A language tutor that invents translations is worse than no tutor at all. We added explicit grounding rules to Luna's system prompt: only teach words she's confident about, only identify camera objects she can clearly see, and acknowledge uncertainty rather than guessing. This doesn't eliminate hallucination entirely, but it significantly reduces it in practice.

Firestore for Returning Learners

Session data persists to Cloud Firestore, enabling a "welcome back" experience. When a learner returns, Luna knows what words they learned last time and builds on that foundation rather than starting over.

Keeping the Audio Stream Smooth

In a real-time voice app, anything that blocks the event loop causes audible stuttering. Two patterns were key to keeping audio smooth:

Async-safe Firestore initialization. Multiple WebSocket connections can arrive simultaneously at startup. Without protection, each could try to create a Firestore client at the same time. We used asyncio.Lock() with a double-check pattern inside _init_firestore() to ensure the client is created exactly once, without blocking the event loop.

Background flashcard generation. Imagen 3 takes 3–8 seconds to generate an image. If we awaited that inside the receive loop, audio would freeze. Instead, we respond to Gemini immediately with a "pending" status and spin up the actual generation as a background task via asyncio.create_task(). When the image is ready, it's pushed to the client over the WebSocket independently of the audio stream.

The Hardest Part: Audio Reliability

Getting real-time audio working reliably across browsers was the biggest challenge. Key issues we solved:

AudioWorklet vs ScriptProcessor — AudioWorklet runs off the main thread for better performance. We use it as the primary approach with a ScriptProcessor fallback for broader compatibility.
Sample Rate Mismatch — Requesting 16kHz from the browser doesn't guarantee it. We added runtime resampling in the AudioWorklet to ensure Gemini always receives 16kHz PCM regardless of the device's native sample rate.
Barge-in Handling — When the user interrupts Luna mid-speech, we immediately stop audio playback, clear the queue, and let the new response stream through. We also suppress mic forwarding while the model is speaking to prevent speaker echo from causing false interruptions.
Receive Loop Re-entry — We discovered that the Live API's receive() generator completes after each model turn. The fix is to re-enter it in a while True loop for multi-turn conversations.

What I Learned

The Gemini Live API is remarkably capable — bidirectional audio + video + function calling in a single streaming session opens up experiences that weren't possible before.
Grounding matters more for educational AI — users trust a tutor implicitly. Teaching a wrong translation erodes that trust fast, so explicit anti-hallucination prompting is essential.
Imagen 3 adds a visual dimension — generated flashcard illustrations make vocabulary tangible and give learners something to revisit later.
Cloud Run with session affinity works well for WebSocket-based apps — the session affinity flag ensures long-lived WebSocket connections stick to the same instance. One thing to watch: the in-memory session cache works perfectly with sticky sessions, but if you ever scale to multiple instances without affinity, you'd need to handle cache coherence with Firestore.
The Live API's receive() generator ending per turn was the most subtle bug — it looked like sessions were dropping after one exchange until we figured out the re-entry pattern.

Try It Yourself

The code is open source: github.com/kumarsparkz/lingualive

git clone https://github.com/kumarsparkz/lingualive.git
cd lingualive
pip install -r requirements.txt
gcloud auth application-default login
python -m app.main

Or deploy to Cloud Run with the automated script:

export GCP_PROJECT_ID=your-project-id
./deploy.sh

Built for the Gemini Live Agent Challenge 2026. #GeminiLiveAgentChallenge

DEV Community