This article was created for the purposes of entering the **Gemini Live Agent Challenge* hackathon. #GeminiLiveAgentChallenge*
The Problem Worth Solving
Every student deserves a patient, always-available tutor. But private tutoring costs $50–$150/hour and is completely out of reach for most families worldwide.
I kept asking myself: what if AI could replicate the experience of sitting next to a real tutor? Not a chatbot you type at — but one that sees your notebook, talks you through the problem, and responds in your own language.
When I discovered the Gemini Live API's native audio capabilities, I knew I could finally build it. That's how EduNova was born.
What EduNova Does
EduNova is a real-time, multimodal AI tutor where students can:
- 🗣️ Speak naturally and get spoken responses — no text-to-speech lag, native audio via Gemini Live API
- 📸 Point their camera at homework or upload an image — the tutor sees the problem and talks through it
- 🌐 Learn in 20+ languages — Hindi, Spanish, French, and more
- ⚡ Interrupt anytime — just like a real conversation
- 📚 Get structured help — practice problems, concept explanations, step-by-step walkthroughs
Subjects covered: Math, Physics, Chemistry, Biology, CS, Language Arts, and History.
Architecture: The "Sees & Speaks" Pipeline
The core insight was building a bidirectional streaming bridge that fuses voice and vision:
Browser (Mic + Camera)
│ WebSocket (wss://)
▼
FastAPI + WebSocket Server (Cloud Run)
│
├─► Gemini 2.5 Flash Native Audio ◄── Voice in/out (Live API)
│
└─► Gemini 2.5 Flash Vision ◄── Image analysis → injected as context
Here's the key architectural decision: the native audio model doesn't accept image input directly. So I built a hybrid pipeline:
- Audio flows through the Live API's native audio model for low-latency real-time conversation
- Camera frames go to a separate Gemini 2.5 Flash vision call
- The vision result is injected back into the live session as context text
- The student just sees a tutor that can both hear and see — seamlessly
# Simplified hybrid vision injection
async def analyze_image_and_inject(session, image_bytes):
# Vision model analyzes the image
vision_result = await gemini_flash.generate_content([
"Describe this homework problem in detail:",
Part.from_bytes(image_bytes, mime_type="image/jpeg")
])
# Inject into live audio session as context
await session.send(
f"[Student just showed their homework: {vision_result.text}]"
)
Tech Stack
| Layer | Technology |
|---|---|
| AI Voice | Gemini 2.5 Flash Native Audio (Live API) |
| AI Vision | Gemini 2.5 Flash |
| Agent Framework | Google ADK (Agent Development Kit) |
| SDK | Google GenAI SDK (google-genai v1.x) |
| Backend | Python 3.12, FastAPI, uvicorn, WebSockets |
| Database | Google Cloud Firestore |
| Frontend | Vanilla HTML/CSS/JS |
| Infra | Cloud Run + Terraform + Cloud Build |
The Hardest Challenges
1. Audio Format Wrangling
Browsers output PCM audio at 48kHz (Float32). Gemini expects 16kHz (Int16). Getting this wrong gives you garbled audio or complete silence.
The resampling ratio is 48000 / 16000 = 3x downsampling. In practice this meant carefully converting the Float32 PCM stream from the browser's AudioWorklet, resampling to 16kHz, converting to Int16, and forwarding in real time over the WebSocket.
2. WebSocket Lifecycle Management
There are two async WebSocket connections to manage simultaneously:
- Client ↔ Server: Browser's WebSocket to the FastAPI backend
- Server ↔ Gemini: Live API session (a persistent streaming connection)
When either side disconnects, the other must be cleaned up gracefully — without leaking sessions or leaving Gemini sessions dangling. Getting the async teardown right with Python's asyncio took significant iteration.
3. Interruption Handling
When a student starts speaking while the tutor is mid-sentence, the experience must feel natural. This required:
- Detecting incoming audio while outgoing audio is still streaming
- Flushing the audio output buffer
- Sending an interrupt signal to the Gemini Live session
- Resuming in a coherent conversational state
Gemini's Live API handles much of this natively, but wiring it correctly through the WebSocket bridge took careful work.
ADK Agent Tools
Beyond free-form conversation, I used Google ADK to give the tutor structured capabilities it can invoke mid-conversation:
@tool
def generate_practice_problem(subject: str, topic: str, difficulty: str) -> dict:
"""Generate a practice problem for the student."""
...
@tool
def create_study_plan(subject: str, weak_areas: list[str], days: int) -> dict:
"""Create a personalized study plan."""
...
@tool
def check_solution(problem: str, student_answer: str) -> dict:
"""Evaluate the student's answer with detailed feedback."""
...
This means the tutor doesn't just chat — it can proactively generate targeted practice, build study plans, and evaluate solutions in a structured way.
What Worked Remarkably Well
Gemini's native audio quality was the biggest surprise. The latency is low enough that it genuinely feels conversational — not like talking to a voice assistant, but like talking to a person. The Socratic teaching style in the system prompt ("guide first, answer second") made the tutor feel pedagogically sound, not just a homework answer machine.
The hybrid vision approach works seamlessly from the student's perspective. They point the camera, the tutor says "I can see you have a quadratic equation here — let's work through it step by step." They have no idea two models are collaborating behind the scenes.
Deployment: One Command to Cloud Run
The entire deployment is automated via Terraform + Cloud Build:
# One-command deploy
./deploy/deploy.sh YOUR_PROJECT_ID us-central1
# Or with Terraform
terraform apply -var="project_id=YOUR_PROJECT_ID"
The Terraform config provisions: Cloud Run service, Firestore database, IAM roles, and all required APIs — fully reproducible infrastructure from scratch.
Try It Yourself
GitHub: https://github.com/Sumit231292/Gemini_AI_Tutor
git clone https://github.com/Sumit231292/Gemini_AI_Tutor.git
cd Gemini_AI_Tutor
pip install -r backend/requirements.txt
# Add your API key
echo "GOOGLE_API_KEY=your-key-here" > .env
# Run
cd backend && python -m uvicorn app.main:app --port 8000
Then open http://localhost:8000, create an account, pick a subject, and start talking!
What's Next
- Real-time whiteboard — draw and solve math problems collaboratively
- Progress tracking — session-to-session mastery tracking via Firestore
- Curriculum alignment — map to Common Core / CBSE / ICSE standards
- Google OAuth — one-click login
- Multi-agent collaboration — specialized sub-agents per subject
Built with love using Google Gemini Live API · ADK · Google Cloud
#GeminiLiveAgentChallenge
Top comments (0)