This blog post was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
The Problem: Language Learning Feels Disconnected
We've all downloaded Duolingo, done the first week religiously, and then... stopped. Why? Because language learning apps are fundamentally disconnected from real life. You're matching words on a screen when what you actually need is someone patient sitting next to you, pointing at things, and helping you build vocabulary from your own world.
Immersion is widely regarded as one of the most effective approaches to language acquisition — but it typically requires expensive human tutors or living abroad. What if AI could bring that immersive experience to everyone?
The Idea: Point Your Camera, Learn a Language
LinguaLive is a real-time AI language tutor named Luna that:
- Sees through your camera and teaches you words for objects in your environment
- Hears your pronunciation and gives specific, actionable feedback
- Speaks back with native-sounding voices via the Gemini Live API
- Generates custom visual flashcards using Imagen 3
- Adapts to the learner's pace — the system prompt instructs Luna to simplify when the learner struggles and increase difficulty when they're doing well
No text boxes. No multiple choice. Just a natural conversation where you point your camera at your kitchen and Luna teaches you cooking vocabulary in Spanish.
The Tech Stack
Here's what powers LinguaLive:
| Component | Technology |
|---|---|
| Real-time AI | Gemini 2.0 Flash Live API (bidirectional audio/video streaming) |
| Agent Definition | Google ADK (Agent Development Kit) for agent structure and tool registration |
| Live Streaming | Google GenAI SDK (client.aio.live.connect()) for real-time bidirectional streaming |
| Image Generation | Imagen 3 on Vertex AI |
| Backend | Python 3.11, FastAPI, WebSocket |
| Data Persistence | Cloud Firestore |
| Asset Storage | Cloud Storage |
| Hosting | Cloud Run (auto-scaling, session affinity) |
| CI/CD | Cloud Build |
| Frontend | Vanilla HTML/JS, Web Audio API, MediaDevices API |
How It Works: The Multimodal Loop
The core of LinguaLive is a multimodal streaming loop:
- Voice In → User speaks in their target language (PCM 16kHz via Web Audio API)
- Camera In → Browser captures JPEG frames at ~1fps and sends to Gemini
- Voice Out → Gemini responds with native audio (PCM 24kHz)
- Image Out → Imagen 3 generates flashcard illustrations for key vocabulary on demand
This happens over a single WebSocket connection. The browser captures audio via AudioWorklet (with a ScriptProcessor fallback) and camera frames via getUserMedia(). The FastAPI backend bridges these to the Gemini Live API's bidirectional streaming endpoint.
Key Technical Decisions
Why WebSocket Instead of REST?
The Gemini Live API uses bidiGenerateContent — a bidirectional streaming endpoint. REST would add significant latency for real-time conversation. Our WebSocket carries audio chunks (~250ms each) and video frames interleaved, keeping the conversation feeling natural and responsive.
ADK + GenAI SDK: Why Both?
We use ADK to define the agent — Luna's persona, system instruction, and 7 registered tools. But for the actual Live API streaming, ADK's standard runner doesn't support real-time bidirectional audio/video, so we use the GenAI SDK's client.aio.live.connect() directly. This gives us:
- Real-time PCM audio streaming in both directions
- Live video frame ingestion
- Function calling mid-stream
- Input and output audio transcription
Luna's 5 active Gemini tools (the ones declared to the model):
-
get_session_progress— returns real-time learning stats -
get_vocabulary_quiz— generates adaptive quizzes from learned words -
detect_scene— identifies environments for themed vocabulary lessons -
identify_objects_in_view— processes camera object detection -
generate_flashcard_image— creates Imagen 3 visual flashcards
(Vocabulary and pronunciation tracking happen automatically via output transcription to avoid interrupting the audio stream with tool calls.)
Grounding to Reduce Hallucinations
A language tutor that invents translations is worse than no tutor at all. We added explicit grounding rules to Luna's system prompt: only teach words she's confident about, only identify camera objects she can clearly see, and acknowledge uncertainty rather than guessing. This doesn't eliminate hallucination entirely, but it significantly reduces it in practice.
Firestore for Returning Learners
Session data persists to Cloud Firestore, enabling a "welcome back" experience. When a learner returns, Luna knows what words they learned last time and builds on that foundation rather than starting over.
Keeping the Audio Stream Smooth
In a real-time voice app, anything that blocks the event loop causes audible stuttering. Two patterns were key to keeping audio smooth:
Async-safe Firestore initialization. Multiple WebSocket connections can arrive simultaneously at startup. Without protection, each could try to create a Firestore client at the same time. We used asyncio.Lock() with a double-check pattern inside _init_firestore() to ensure the client is created exactly once, without blocking the event loop.
Background flashcard generation. Imagen 3 takes 3–8 seconds to generate an image. If we awaited that inside the receive loop, audio would freeze. Instead, we respond to Gemini immediately with a "pending" status and spin up the actual generation as a background task via asyncio.create_task(). When the image is ready, it's pushed to the client over the WebSocket independently of the audio stream.
The Hardest Part: Audio Reliability
Getting real-time audio working reliably across browsers was the biggest challenge. Key issues we solved:
- AudioWorklet vs ScriptProcessor — AudioWorklet runs off the main thread for better performance. We use it as the primary approach with a ScriptProcessor fallback for broader compatibility.
- Sample Rate Mismatch — Requesting 16kHz from the browser doesn't guarantee it. We added runtime resampling in the AudioWorklet to ensure Gemini always receives 16kHz PCM regardless of the device's native sample rate.
- Barge-in Handling — When the user interrupts Luna mid-speech, we immediately stop audio playback, clear the queue, and let the new response stream through. We also suppress mic forwarding while the model is speaking to prevent speaker echo from causing false interruptions.
-
Receive Loop Re-entry — We discovered that the Live API's
receive()generator completes after each model turn. The fix is to re-enter it in awhile Trueloop for multi-turn conversations.
What I Learned
- The Gemini Live API is remarkably capable — bidirectional audio + video + function calling in a single streaming session opens up experiences that weren't possible before.
- Grounding matters more for educational AI — users trust a tutor implicitly. Teaching a wrong translation erodes that trust fast, so explicit anti-hallucination prompting is essential.
- Imagen 3 adds a visual dimension — generated flashcard illustrations make vocabulary tangible and give learners something to revisit later.
- Cloud Run with session affinity works well for WebSocket-based apps — the session affinity flag ensures long-lived WebSocket connections stick to the same instance. One thing to watch: the in-memory session cache works perfectly with sticky sessions, but if you ever scale to multiple instances without affinity, you'd need to handle cache coherence with Firestore.
-
The Live API's
receive()generator ending per turn was the most subtle bug — it looked like sessions were dropping after one exchange until we figured out the re-entry pattern.
Try It Yourself
The code is open source: github.com/kumarsparkz/lingualive
git clone https://github.com/kumarsparkz/lingualive.git
cd lingualive
pip install -r requirements.txt
gcloud auth application-default login
python -m app.main
Or deploy to Cloud Run with the automated script:
export GCP_PROJECT_ID=your-project-id
./deploy.sh
Built for the Gemini Live Agent Challenge 2026. #GeminiLiveAgentChallenge
Top comments (0)