DEV Community

Sumit231292
Sumit231292

Posted on

EduNova — Building a Real-Time AI Tutor That Sees & Speaks with Gemini Live API

This article was created for the purposes of entering the **Gemini Live Agent Challenge* hackathon. #GeminiLiveAgentChallenge*


The Problem Worth Solving

Every student deserves a patient, always-available tutor. But private tutoring costs $50–$150/hour and is completely out of reach for most families worldwide.

I kept asking myself: what if AI could replicate the experience of sitting next to a real tutor? Not a chatbot you type at — but one that sees your notebook, talks you through the problem, and responds in your own language.

When I discovered the Gemini Live API's native audio capabilities, I knew I could finally build it. That's how EduNova was born.


What EduNova Does

EduNova is a real-time, multimodal AI tutor where students can:

  • 🗣️ Speak naturally and get spoken responses — no text-to-speech lag, native audio via Gemini Live API
  • 📸 Point their camera at homework or upload an image — the tutor sees the problem and talks through it
  • 🌐 Learn in 20+ languages — Hindi, Spanish, French, and more
  • Interrupt anytime — just like a real conversation
  • 📚 Get structured help — practice problems, concept explanations, step-by-step walkthroughs

Subjects covered: Math, Physics, Chemistry, Biology, CS, Language Arts, and History.


Architecture: The "Sees & Speaks" Pipeline

The core insight was building a bidirectional streaming bridge that fuses voice and vision:

Browser (Mic + Camera)
        │ WebSocket (wss://)
        ▼
FastAPI + WebSocket Server (Cloud Run)
        │
        ├─► Gemini 2.5 Flash Native Audio  ◄── Voice in/out (Live API)
        │
        └─► Gemini 2.5 Flash Vision        ◄── Image analysis → injected as context
Enter fullscreen mode Exit fullscreen mode

Here's the key architectural decision: the native audio model doesn't accept image input directly. So I built a hybrid pipeline:

  1. Audio flows through the Live API's native audio model for low-latency real-time conversation
  2. Camera frames go to a separate Gemini 2.5 Flash vision call
  3. The vision result is injected back into the live session as context text
  4. The student just sees a tutor that can both hear and see — seamlessly
# Simplified hybrid vision injection
async def analyze_image_and_inject(session, image_bytes):
    # Vision model analyzes the image
    vision_result = await gemini_flash.generate_content([
        "Describe this homework problem in detail:",
        Part.from_bytes(image_bytes, mime_type="image/jpeg")
    ])

    # Inject into live audio session as context
    await session.send(
        f"[Student just showed their homework: {vision_result.text}]"
    )
Enter fullscreen mode Exit fullscreen mode

Tech Stack

Layer Technology
AI Voice Gemini 2.5 Flash Native Audio (Live API)
AI Vision Gemini 2.5 Flash
Agent Framework Google ADK (Agent Development Kit)
SDK Google GenAI SDK (google-genai v1.x)
Backend Python 3.12, FastAPI, uvicorn, WebSockets
Database Google Cloud Firestore
Frontend Vanilla HTML/CSS/JS
Infra Cloud Run + Terraform + Cloud Build

The Hardest Challenges

1. Audio Format Wrangling

Browsers output PCM audio at 48kHz (Float32). Gemini expects 16kHz (Int16). Getting this wrong gives you garbled audio or complete silence.

The resampling ratio is 48000 / 16000 = 3x downsampling. In practice this meant carefully converting the Float32 PCM stream from the browser's AudioWorklet, resampling to 16kHz, converting to Int16, and forwarding in real time over the WebSocket.

2. WebSocket Lifecycle Management

There are two async WebSocket connections to manage simultaneously:

  • Client ↔ Server: Browser's WebSocket to the FastAPI backend
  • Server ↔ Gemini: Live API session (a persistent streaming connection)

When either side disconnects, the other must be cleaned up gracefully — without leaking sessions or leaving Gemini sessions dangling. Getting the async teardown right with Python's asyncio took significant iteration.

3. Interruption Handling

When a student starts speaking while the tutor is mid-sentence, the experience must feel natural. This required:

  • Detecting incoming audio while outgoing audio is still streaming
  • Flushing the audio output buffer
  • Sending an interrupt signal to the Gemini Live session
  • Resuming in a coherent conversational state

Gemini's Live API handles much of this natively, but wiring it correctly through the WebSocket bridge took careful work.


ADK Agent Tools

Beyond free-form conversation, I used Google ADK to give the tutor structured capabilities it can invoke mid-conversation:

@tool
def generate_practice_problem(subject: str, topic: str, difficulty: str) -> dict:
    """Generate a practice problem for the student."""
    ...

@tool
def create_study_plan(subject: str, weak_areas: list[str], days: int) -> dict:
    """Create a personalized study plan."""
    ...

@tool
def check_solution(problem: str, student_answer: str) -> dict:
    """Evaluate the student's answer with detailed feedback."""
    ...
Enter fullscreen mode Exit fullscreen mode

This means the tutor doesn't just chat — it can proactively generate targeted practice, build study plans, and evaluate solutions in a structured way.


What Worked Remarkably Well

Gemini's native audio quality was the biggest surprise. The latency is low enough that it genuinely feels conversational — not like talking to a voice assistant, but like talking to a person. The Socratic teaching style in the system prompt ("guide first, answer second") made the tutor feel pedagogically sound, not just a homework answer machine.

The hybrid vision approach works seamlessly from the student's perspective. They point the camera, the tutor says "I can see you have a quadratic equation here — let's work through it step by step." They have no idea two models are collaborating behind the scenes.


Deployment: One Command to Cloud Run

The entire deployment is automated via Terraform + Cloud Build:

# One-command deploy
./deploy/deploy.sh YOUR_PROJECT_ID us-central1

# Or with Terraform
terraform apply -var="project_id=YOUR_PROJECT_ID"
Enter fullscreen mode Exit fullscreen mode

The Terraform config provisions: Cloud Run service, Firestore database, IAM roles, and all required APIs — fully reproducible infrastructure from scratch.


Try It Yourself

GitHub: https://github.com/Sumit231292/Gemini_AI_Tutor

git clone https://github.com/Sumit231292/Gemini_AI_Tutor.git
cd Gemini_AI_Tutor
pip install -r backend/requirements.txt

# Add your API key
echo "GOOGLE_API_KEY=your-key-here" > .env

# Run
cd backend && python -m uvicorn app.main:app --port 8000
Enter fullscreen mode Exit fullscreen mode

Then open http://localhost:8000, create an account, pick a subject, and start talking!


What's Next

  • Real-time whiteboard — draw and solve math problems collaboratively
  • Progress tracking — session-to-session mastery tracking via Firestore
  • Curriculum alignment — map to Common Core / CBSE / ICSE standards
  • Google OAuth — one-click login
  • Multi-agent collaboration — specialized sub-agents per subject

Built with love using Google Gemini Live API · ADK · Google Cloud

#GeminiLiveAgentChallenge

Top comments (0)