Peter Williams

Posted on Mar 16

"Building a Real-Time AI Tutor with Gemini Live"

#gemini #aihackathon #geminiliveagentchallenge

The Idea
Most AI tutors are still built around one-way explanation. They deliver information, but they do not really force the learner to explain, defend, or retrieve what they know.

TeachBack flips that dynamic. It is a real-time voice learning app where the user chooses a study topic, a learning mode, and an AI persona, then enters a live conversation with the tutor. Instead of passively consuming answers, the user has to talk through concepts out loud while the AI listens, challenges weak reasoning, asks follow-up questions, and scores the session at the end.

The goal is to make learning active rather than passive.

The Stack
TeachBack is built around Google’s AI and cloud tooling.

For the live conversation layer, I used the Gemini Live API with gemini-2.5-flash-native-audio-preview-12-2025. That powers the real-time voice session: the user speaks, Gemini responds with audio, and both sides are transcribed live during the session. This is what gives the app its conversational feel.

For the non-live tasks, I used gemini-2.5-flash. That model handles things like topic generation, study material preparation, and fallback scoring when I need a structured evaluation outside the live loop.

The backend is built with FastAPI and the Google GenAI SDK (google-genai). The frontend is built with React + Vite, using the Web Audio API to capture microphone input and stream PCM audio over WebSocket to the backend. On the cloud side, the app is deployed on Google Cloud Run, preset study content is stored in Google Cloud Storage, and deployment is automated through Cloud Build and a small deploy.sh script.

How It Works
The app starts with a preset study trail. Each trail contains prepared grounding material plus the original source PDFs. When a user selects a trail, the backend prepares a session and the frontend opens a live WebSocket connection for the tutoring run.

From there, the browser captures mic audio, encodes it as PCM, and sends it to the backend. The backend acts as the bridge between the browser and Gemini Live, forwarding user audio, receiving model audio/transcripts, handling session events, and returning everything back to the UI in real time.

I also built several tutoring behaviors on top of that live loop:

Four learning modes

Explain Mode for Feynman-style explanation
Socratic Mode for guided questioning
Recall Mode for conversational retrieval practice
Teach Mode for interactive instruction
Three personas with distinct voices and behaviors

Curious Kid
Skeptical Peer
Tough Professor
Each persona changes not just the wording of the tutor, but the tone, pacing, and conversational style of the session.

The session booth includes two types of grounding:

an expandable prepared-material panel showing the normalized source-of-truth text the tutor is using
an in-browser PDF viewer so the original source documents can also be inspected directly during the session
That second piece turned out to be important because it makes the system much easier to trust.

The Most Interesting Feature
One of the most interesting parts of the project is the interruption sidecar.

During a live session, the main tutoring flow can pause while a focused correction flow takes over. That correction path can gather clarification, resolve a misunderstanding, and then feed the result back into the main tutoring session so the lesson continues with the right context.

What made this technically interesting was not just showing a new UI window. The hard part was preserving state cleanly across the pause and resume boundary: stopping the main agent at the right moment, disabling the main mic path, collecting the correction context, relaying that information back into the active session, and then resuming without making the whole conversation feel broken.

That interruption-and-recovery behavior is a big part of what I think real-time AI tutors need in order to feel genuinely useful.

What Was Hard
The hardest part was making the real-time session architecture stable.

Once you move beyond static prompt/response interactions, the hard problems change. It becomes much more about:

audio transport
sample-rate handling
turn boundaries
transcript accuracy
playback coordination
session lifecycle
interruption state
One of the biggest lessons from this project was that the model is only part of the challenge. The surrounding systems are where most of the engineering complexity shows up.

I spent a lot of time making sure the app could survive real multi-turn conversation, keep the transcript faithful to what was actually being answered, and recover cleanly when the session needed to pause or redirect.

Why I Built It This Way
I wanted the project to feel like a real learning product rather than just a technical demo of voice AI.

That meant focusing on things like:

grounded study material instead of vague free-form chatting
learning modes based on real pedagogical techniques
personas that feel different to interact with
source visibility so the user can inspect what the tutor is grounded on
correction and interruption behavior so the session can adapt instead of continuing blindly
In other words, I wanted the app to show what a voice-native learning companion could actually feel like when built around understanding rather than just answer generation.

Try It
The app is live here:

https://teachback-ig3hrbcina-uc.a.run.app

Tested in Chrome and Safari.

Built for the Gemini Live Agent Challenge.

GeminiLiveAgentChallenge

Try it
The app is live at https://teachback-ig3hrbcina-uc.a.run.app. Works in Chrome and Safari.

Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

Top comments (2)

Victor Okefie • Mar 18

The interruption sidecar is the part that matters. Most voice AI treats conversation as a straight line, input, output, repeat. You built the ability to pause, correct, and resume without losing the thread. That's not a feature. That's the difference between a lecture and a conversation.

Feroz Khairy • Mar 17

I used it, and I actually love it. Good work!