Elimihele God's favour

Posted on Mar 12

I Built an AI Tutor That Actually Sees Your Homework — Here's How

#gemini #googlecloud #ai #geminiliveagentchallenge

A few weeks ago I was watching my younger cousin struggle through a physics worksheet.
She kept typing questions into ChatGPT, getting a wall of text back, and still looking confused. It hit me!
why can't she just show the problem to an AI and have it talk her through it like a real tutor would?

That question became VisionSolve(SolveTutor), and I built it for the Gemini Live Agent Challenge.

The Idea: What if AI Could See and Speak?

Most AI tutoring tools work through text boxes. You type your question, you get a text response. But that's not how tutoring works in real life. A real tutor looks at your paper, listens to your confusion, and talks you through it step by step. They notice when you're lost and adjust.

So I wanted to build exactly that — an AI tutor that:

Sees your homework through your camera
Listens to your questions through your microphone
Speaks explanations back to you, naturally

No typing required. Just point your phone at a math problem and start talking.

Why Gemini Live API Was Perfect for This

I'd been playing around with different LLM APIs, and when I found the Gemini Live API, it clicked immediately. Most APIs are request-response — you send text, you get text back. Gemini's Live API is completely different. It opens a persistent bidirectional stream where you can send audio and video frames continuously, and the model responds in real-time audio.

The killer feature for me was native audio output. The model doesn't generate text that gets piped through a TTS engine — it produces audio directly. The voice sounds natural, with proper pacing and intonation. When Sol (my tutor agent) explains a math concept, it actually sounds like someone talking to you, not a robot reading a script.

The other thing that sold me was barge-in support. Students interrupt. That's normal. They'll say "wait, what?" in the middle of an explanation. With other APIs you'd have to manage complex state to handle that. With Gemini Live, the student can just... talk. The model handles it gracefully.

The Stack

Here's what I ended up using:

Backend:

Google ADK (Agent Development Kit) — this was huge. Instead of wiring up raw API calls, I defined my agent with a system instruction, gave it tools, and ADK handled the session management. The agent framework made it easy to add Google Search grounding so Sol can verify facts on the fly.
FastAPI + WebSockets — the frontend connects over a WebSocket, and the backend proxies audio/video to Gemini Live and streams audio responses back.
Firebase Firestore — for storing session transcripts so students can review past sessions.

Frontend:

Next.js + TypeScript — nothing fancy here, just a clean mobile responsive interface with a webcam feed, audio visualizer, and chat transcript.
Firebase Auth — Google Sign-In for authentication.

Model: gemini-2.5-flash-native-audio — the latest native audio model. Fast enough for real-time conversation, capable enough to understand handwritten math from a shaky phone camera.

Things That Surprised Me

The vision capabilities are seriously good

I expected the model to struggle with handwritten math, especially messy student handwriting. It doesn't. I tested it with all kinds of problems scribbled algebra, printed calculus, even chemistry diagrams — and it identified them correctly almost every time. It even handles when the camera is slightly angled or the lighting isn't great.
i even tried it with my cousin's messy algebra homework, a printed calculus worksheet, and even a badly drawn chemistry diagram.

Natural interruptions just work

This was the feature I was most nervous about. In my head, I had this complex state machine planned out for handling when a student interrupts. Turns out, I didn't need any of it. The Live API's barge-in support means when the student starts talking, the model stops, listens, and responds. It's seamless.

The hardest part wasn't the AI

Honestly, the AI side was smoother than I expected thanks to ADK and the Live API.
The hardest part wasn't the AI at all.
It was WebSocket audio streaming.
Browsers hate it. Microphone permissions break. Safari behaves weirdly. Classic web dev pain.

Google Cloud Deployment

I deployed the whole thing on Google Cloud:

Backend runs on Cloud Run with Vertex AI integration
Frontend is on Firebase Hosting
CI/CD pipeline through GitHub Actions — push a tag and everything deploys automatically

The deploy workflow builds a Docker image, pushes it to GCR, deploys to Cloud Run, then builds the frontend with the new backend URL injected and deploys it to Firebase. The whole pipeline takes about 4 minutes, which is pretty nice for not having to think about deployments at all.

You can check out the full pipeline here.

What I'd Do Differently

If I had more time, I'd add:

Drawing/annotation support — let Sol highlight parts of the image while explaining
Progress tracking — track which topics the student struggles with over time

Try It Out

The project is open source: github.com/dev-phantom/VisionSolve

The README has full setup instructions if you want to run it locally. You'll need a Gemini API key (free from Google AI Studio) and a Firebase project.

If you're thinking about building something with the Gemini Live API, I'd say go for it. The combination of real-time audio + vision is genuinely different from anything else available right now. It opens up use cases that just weren't possible with traditional request-response APIs.

Built for the #GeminiLiveAgentChallenge using Google Gemini, ADK, Firebase, and Cloud Run.

DEV Community