<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sumit231292</title>
    <description>The latest articles on DEV Community by Sumit231292 (@sumit231292).</description>
    <link>https://dev.to/sumit231292</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3825447%2F35f27c92-7393-465a-8121-6b97b3f4c85a.png</url>
      <title>DEV Community: Sumit231292</title>
      <link>https://dev.to/sumit231292</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sumit231292"/>
    <language>en</language>
    <item>
      <title>EduNova — Building a Real-Time AI Tutor That Sees &amp; Speaks with Gemini Live API</title>
      <dc:creator>Sumit231292</dc:creator>
      <pubDate>Sun, 15 Mar 2026 14:35:10 +0000</pubDate>
      <link>https://dev.to/sumit231292/edunova-building-a-real-time-ai-tutor-that-sees-speaks-with-gemini-live-api-4im3</link>
      <guid>https://dev.to/sumit231292/edunova-building-a-real-time-ai-tutor-that-sees-speaks-with-gemini-live-api-4im3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;This article was created for the purposes of entering the **Gemini Live Agent Challenge&lt;/em&gt;* hackathon. #GeminiLiveAgentChallenge*&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Problem Worth Solving
&lt;/h2&gt;

&lt;p&gt;Every student deserves a patient, always-available tutor. But private tutoring costs $50–$150/hour and is completely out of reach for most families worldwide.&lt;/p&gt;

&lt;p&gt;I kept asking myself: &lt;strong&gt;what if AI could replicate the experience of sitting next to a real tutor?&lt;/strong&gt; Not a chatbot you type at — but one that &lt;em&gt;sees your notebook&lt;/em&gt;, &lt;em&gt;talks you through the problem&lt;/em&gt;, and responds in your own language.&lt;/p&gt;

&lt;p&gt;When I discovered the Gemini Live API's native audio capabilities, I knew I could finally build it. That's how &lt;strong&gt;EduNova&lt;/strong&gt; was born.&lt;/p&gt;




&lt;h2&gt;
  
  
  What EduNova Does
&lt;/h2&gt;

&lt;p&gt;EduNova is a real-time, multimodal AI tutor where students can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🗣️ &lt;strong&gt;Speak naturally&lt;/strong&gt; and get spoken responses — no text-to-speech lag, native audio via Gemini Live API&lt;/li&gt;
&lt;li&gt;📸 &lt;strong&gt;Point their camera&lt;/strong&gt; at homework or upload an image — the tutor &lt;em&gt;sees&lt;/em&gt; the problem and talks through it&lt;/li&gt;
&lt;li&gt;🌐 &lt;strong&gt;Learn in 20+ languages&lt;/strong&gt; — Hindi, Spanish, French, and more&lt;/li&gt;
&lt;li&gt;⚡ &lt;strong&gt;Interrupt anytime&lt;/strong&gt; — just like a real conversation&lt;/li&gt;
&lt;li&gt;📚 &lt;strong&gt;Get structured help&lt;/strong&gt; — practice problems, concept explanations, step-by-step walkthroughs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Subjects covered: Math, Physics, Chemistry, Biology, CS, Language Arts, and History.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture: The "Sees &amp;amp; Speaks" Pipeline
&lt;/h2&gt;

&lt;p&gt;The core insight was building a &lt;strong&gt;bidirectional streaming bridge&lt;/strong&gt; that fuses voice and vision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser (Mic + Camera)
        │ WebSocket (wss://)
        ▼
FastAPI + WebSocket Server (Cloud Run)
        │
        ├─► Gemini 2.5 Flash Native Audio  ◄── Voice in/out (Live API)
        │
        └─► Gemini 2.5 Flash Vision        ◄── Image analysis → injected as context
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the key architectural decision: &lt;strong&gt;the native audio model doesn't accept image input directly&lt;/strong&gt;. So I built a hybrid pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audio flows through the &lt;strong&gt;Live API's native audio model&lt;/strong&gt; for low-latency real-time conversation&lt;/li&gt;
&lt;li&gt;Camera frames go to a separate &lt;strong&gt;Gemini 2.5 Flash vision call&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;The vision result is injected back into the live session as context text&lt;/li&gt;
&lt;li&gt;The student just sees a tutor that can both hear and see — seamlessly
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified hybrid vision injection
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_image_and_inject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;image_bytes&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Vision model analyzes the image
&lt;/span&gt;    &lt;span class="n"&gt;vision_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;gemini_flash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Describe this homework problem in detail:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# Inject into live audio session as context
&lt;/span&gt;    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Student just showed their homework: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;vision_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Voice&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash Native Audio (Live API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Vision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemini 2.5 Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Agent Framework&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google ADK (Agent Development Kit)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google GenAI SDK (&lt;code&gt;google-genai&lt;/code&gt; v1.x)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python 3.12, FastAPI, uvicorn, WebSockets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Database&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Cloud Firestore&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Frontend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vanilla HTML/CSS/JS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Infra&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Run + Terraform + Cloud Build&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Hardest Challenges
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Audio Format Wrangling
&lt;/h3&gt;

&lt;p&gt;Browsers output PCM audio at 48kHz (Float32). Gemini expects 16kHz (Int16). Getting this wrong gives you garbled audio or complete silence.&lt;/p&gt;

&lt;p&gt;The resampling ratio is 48000 / 16000 = 3x downsampling. In practice this meant carefully converting the Float32 PCM stream from the browser's AudioWorklet, resampling to 16kHz, converting to Int16, and forwarding in real time over the WebSocket.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. WebSocket Lifecycle Management
&lt;/h3&gt;

&lt;p&gt;There are &lt;em&gt;two&lt;/em&gt; async WebSocket connections to manage simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Client ↔ Server&lt;/strong&gt;: Browser's WebSocket to the FastAPI backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server ↔ Gemini&lt;/strong&gt;: Live API session (a persistent streaming connection)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When either side disconnects, the other must be cleaned up gracefully — without leaking sessions or leaving Gemini sessions dangling. Getting the async teardown right with Python's &lt;code&gt;asyncio&lt;/code&gt; took significant iteration.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Interruption Handling
&lt;/h3&gt;

&lt;p&gt;When a student starts speaking &lt;em&gt;while the tutor is mid-sentence&lt;/em&gt;, the experience must feel natural. This required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detecting incoming audio while outgoing audio is still streaming&lt;/li&gt;
&lt;li&gt;Flushing the audio output buffer&lt;/li&gt;
&lt;li&gt;Sending an interrupt signal to the Gemini Live session&lt;/li&gt;
&lt;li&gt;Resuming in a coherent conversational state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Gemini's Live API handles much of this natively, but wiring it correctly through the WebSocket bridge took careful work.&lt;/p&gt;




&lt;h2&gt;
  
  
  ADK Agent Tools
&lt;/h2&gt;

&lt;p&gt;Beyond free-form conversation, I used &lt;strong&gt;Google ADK&lt;/strong&gt; to give the tutor structured capabilities it can invoke mid-conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_practice_problem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;difficulty&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate a practice problem for the student.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_study_plan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weak_areas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;days&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create a personalized study plan.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;check_solution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;problem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;student_answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Evaluate the student&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s answer with detailed feedback.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the tutor doesn't just chat — it can proactively generate targeted practice, build study plans, and evaluate solutions in a structured way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Worked Remarkably Well
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gemini's native audio quality&lt;/strong&gt; was the biggest surprise. The latency is low enough that it genuinely feels conversational — not like talking to a voice assistant, but like talking to a person. The Socratic teaching style in the system prompt ("guide first, answer second") made the tutor feel pedagogically sound, not just a homework answer machine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hybrid vision approach&lt;/strong&gt; works seamlessly from the student's perspective. They point the camera, the tutor says "I can see you have a quadratic equation here — let's work through it step by step." They have no idea two models are collaborating behind the scenes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deployment: One Command to Cloud Run
&lt;/h2&gt;

&lt;p&gt;The entire deployment is automated via Terraform + Cloud Build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One-command deploy&lt;/span&gt;
./deploy/deploy.sh YOUR_PROJECT_ID us-central1

&lt;span class="c"&gt;# Or with Terraform&lt;/span&gt;
terraform apply &lt;span class="nt"&gt;-var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"project_id=YOUR_PROJECT_ID"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Terraform config provisions: Cloud Run service, Firestore database, IAM roles, and all required APIs — fully reproducible infrastructure from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Sumit231292/Gemini_AI_Tutor" rel="noopener noreferrer"&gt;https://github.com/Sumit231292/Gemini_AI_Tutor&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Sumit231292/Gemini_AI_Tutor.git
&lt;span class="nb"&gt;cd &lt;/span&gt;Gemini_AI_Tutor
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; backend/requirements.txt

&lt;span class="c"&gt;# Add your API key&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"GOOGLE_API_KEY=your-key-here"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .env

&lt;span class="c"&gt;# Run&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;backend &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; python &lt;span class="nt"&gt;-m&lt;/span&gt; uvicorn app.main:app &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then open &lt;code&gt;http://localhost:8000&lt;/code&gt;, create an account, pick a subject, and start talking!&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time whiteboard&lt;/strong&gt; — draw and solve math problems collaboratively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Progress tracking&lt;/strong&gt; — session-to-session mastery tracking via Firestore&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curriculum alignment&lt;/strong&gt; — map to Common Core / CBSE / ICSE standards&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google OAuth&lt;/strong&gt; — one-click login&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent collaboration&lt;/strong&gt; — specialized sub-agents per subject&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Built with love using Google Gemini Live API · ADK · Google Cloud&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#GeminiLiveAgentChallenge&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>googleai</category>
      <category>gemini</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
