A few weeks ago I was watching my younger cousin struggle through a physics worksheet.
She kept typing questions into ChatGPT, getting a wall of text back, and still looking confused. It hit me!
why can't she just show the problem to an AI and have it talk her through it like a real tutor would?
That question became VisionSolve(SolveTutor), and I built it for the Gemini Live Agent Challenge.
The Idea: What if AI Could See and Speak?
Most AI tutoring tools work through text boxes. You type your question, you get a text response. But that's not how tutoring works in real life. A real tutor looks at your paper, listens to your confusion, and talks you through it step by step. They notice when you're lost and adjust.
So I wanted to build exactly that — an AI tutor that:
- Sees your homework through your camera
- Listens to your questions through your microphone
- Speaks explanations back to you, naturally
No typing required. Just point your phone at a math problem and start talking.
Why Gemini Live API Was Perfect for This
I'd been playing around with different LLM APIs, and when I found the Gemini Live API, it clicked immediately. Most APIs are request-response — you send text, you get text back. Gemini's Live API is completely different. It opens a persistent bidirectional stream where you can send audio and video frames continuously, and the model responds in real-time audio.
The killer feature for me was native audio output. The model doesn't generate text that gets piped through a TTS engine — it produces audio directly. The voice sounds natural, with proper pacing and intonation. When Sol (my tutor agent) explains a math concept, it actually sounds like someone talking to you, not a robot reading a script.
The other thing that sold me was barge-in support. Students interrupt. That's normal. They'll say "wait, what?" in the middle of an explanation. With other APIs you'd have to manage complex state to handle that. With Gemini Live, the student can just... talk. The model handles it gracefully.
The Stack
Here's what I ended up using:
Backend:
- Google ADK (Agent Development Kit) — this was huge. Instead of wiring up raw API calls, I defined my agent with a system instruction, gave it tools, and ADK handled the session management. The agent framework made it easy to add Google Search grounding so Sol can verify facts on the fly.
- FastAPI + WebSockets — the frontend connects over a WebSocket, and the backend proxies audio/video to Gemini Live and streams audio responses back.
- Firebase Firestore — for storing session transcripts so students can review past sessions.
Frontend:
- Next.js + TypeScript — nothing fancy here, just a clean mobile responsive interface with a webcam feed, audio visualizer, and chat transcript.
- Firebase Auth — Google Sign-In for authentication.
Model: gemini-2.5-flash-native-audio — the latest native audio model. Fast enough for real-time conversation, capable enough to understand handwritten math from a shaky phone camera.
Things That Surprised Me
The vision capabilities are seriously good
I expected the model to struggle with handwritten math, especially messy student handwriting. It doesn't. I tested it with all kinds of problems scribbled algebra, printed calculus, even chemistry diagrams — and it identified them correctly almost every time. It even handles when the camera is slightly angled or the lighting isn't great.
i even tried it with my cousin's messy algebra homework, a printed calculus worksheet, and even a badly drawn chemistry diagram.
Natural interruptions just work
This was the feature I was most nervous about. In my head, I had this complex state machine planned out for handling when a student interrupts. Turns out, I didn't need any of it. The Live API's barge-in support means when the student starts talking, the model stops, listens, and responds. It's seamless.
The hardest part wasn't the AI
Honestly, the AI side was smoother than I expected thanks to ADK and the Live API.
The hardest part wasn't the AI at all.
It was WebSocket audio streaming.
Browsers hate it. Microphone permissions break. Safari behaves weirdly. Classic web dev pain.
Google Cloud Deployment
I deployed the whole thing on Google Cloud:
- Backend runs on Cloud Run with Vertex AI integration
- Frontend is on Firebase Hosting
- CI/CD pipeline through GitHub Actions — push a tag and everything deploys automatically
The deploy workflow builds a Docker image, pushes it to GCR, deploys to Cloud Run, then builds the frontend with the new backend URL injected and deploys it to Firebase. The whole pipeline takes about 4 minutes, which is pretty nice for not having to think about deployments at all.
You can check out the full pipeline here.
What I'd Do Differently
If I had more time, I'd add:
- Drawing/annotation support — let Sol highlight parts of the image while explaining
- Progress tracking — track which topics the student struggles with over time
Try It Out
The project is open source: github.com/dev-phantom/VisionSolve
The README has full setup instructions if you want to run it locally. You'll need a Gemini API key (free from Google AI Studio) and a Firebase project.
If you're thinking about building something with the Gemini Live API, I'd say go for it. The combination of real-time audio + vision is genuinely different from anything else available right now. It opens up use cases that just weren't possible with traditional request-response APIs.
Built for the #GeminiLiveAgentChallenge using Google Gemini, ADK, Firebase, and Cloud Run.
Top comments (0)