<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jay</title>
    <description>The latest articles on DEV Community by Jay (@jayiscoding).</description>
    <link>https://dev.to/jayiscoding</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3828463%2Fcbb562a4-2e76-47de-a2ee-d03836741f69.jpeg</url>
      <title>DEV Community: Jay</title>
      <link>https://dev.to/jayiscoding</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jayiscoding"/>
    <language>en</language>
    <item>
      <title>Building a Real-Time AI Piano Coach with Gemini Live API</title>
      <dc:creator>Jay</dc:creator>
      <pubDate>Tue, 17 Mar 2026 03:44:28 +0000</pubDate>
      <link>https://dev.to/jayiscoding/building-a-real-time-ai-piano-coach-with-gemini-live-api-4h08</link>
      <guid>https://dev.to/jayiscoding/building-a-real-time-ai-piano-coach-with-gemini-live-api-4h08</guid>
      <description>&lt;h1&gt;
  
  
  Building a Real-Time AI Piano Coach with Gemini Live API
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Created for the Gemini Live Agent Challenge hackathon&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  GeminiLiveAgentChallenge
&lt;/h1&gt;




&lt;h2&gt;
  
  
  The Problem: Piano Practice is Lonely
&lt;/h2&gt;

&lt;p&gt;Music practice has always been a solitary endeavor. You sit alone with your instrument, a metronome, and maybe a YouTube tutorial. Even modern piano apps feel like rhythm games -- they grade you on note accuracy but can't see your hands, can't hear your tone quality, and certainly can't have a conversation about your playing.&lt;/p&gt;

&lt;p&gt;When Google announced the &lt;strong&gt;Gemini Live Agent Challenge&lt;/strong&gt;, I saw the chance to build something different: an AI that doesn't just listen to notes, but truly &lt;em&gt;coaches&lt;/em&gt; -- seeing your hands, hearing your music, reading your MIDI data, and talking to you naturally in real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is PianoQuest Live?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PianoQuest Live&lt;/strong&gt; is a real-time multimodal AI piano coach powered by the Gemini 2.5 Flash Native Audio model. It processes three simultaneous input streams:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Voice and Audio (Microphone):&lt;/strong&gt; Talk to Gemini naturally -- ask questions about technique, request feedback, or just have a conversation about music theory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vision (Camera):&lt;/strong&gt; A phone or tablet camera watches your hands at the piano. Gemini sees finger position, wrist angle, and hand form in real time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MIDI Data (USB Piano):&lt;/strong&gt; Every note you play is captured with precise velocity (0-127) and timing data via the Web MIDI API, giving Gemini quantitative evidence to ground its coaching.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The magic is that Gemini processes all three streams simultaneously in a single Live API session, allowing it to correlate what it &lt;strong&gt;sees&lt;/strong&gt; (a tense wrist) with what the MIDI &lt;strong&gt;shows&lt;/strong&gt; (velocity dropping on those notes) and what it &lt;strong&gt;hears&lt;/strong&gt; (a loss of tonal clarity).&lt;/p&gt;




&lt;h2&gt;
  
  
  The Multi-Device Architecture
&lt;/h2&gt;

&lt;p&gt;One of the biggest technical challenges was that a piano player needs their hands free. You can't hold a phone and play piano at the same time. Our solution: a &lt;strong&gt;multi-device room system&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Desktop PC (primary)          Phone/Tablet (secondary)
|- Browser UI                 |- Camera -&amp;gt; JPEG 1fps
|- Web MIDI API               |- MediaPipe HandLandmarker
|- Mic (PCM 16kHz)            |- Mic (optional)
|- All visualizations
         |                              |
         |---- WebSocket Room ----------|
                      |
              Google Cloud Run
              (TypeScript/Express)
                      |
              Gemini Live API
              (native audio model)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Open PianoQuest Live on your &lt;strong&gt;desktop&lt;/strong&gt; -- it connects to Gemini and displays a room code + QR code.&lt;/li&gt;
&lt;li&gt;Scan the QR with your &lt;strong&gt;phone&lt;/strong&gt; -- it joins the same room and starts streaming camera + hand tracking.&lt;/li&gt;
&lt;li&gt;Play your &lt;strong&gt;USB MIDI piano&lt;/strong&gt; -- the desktop captures every note via Web MIDI API.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Talk naturally&lt;/strong&gt; -- Gemini responds with voice, coaching tips, and technique observations.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All devices share a single Gemini Live session. The server merges audio, video, and MIDI into one coherent stream that Gemini processes holistically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack: Gemini + ADK + Cloud Run
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AI Model:&lt;/strong&gt; &lt;code&gt;gemini-2.5-flash-native-audio-preview&lt;/code&gt; -- the native audio model enables true real-time voice conversation with sub-second latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SDK:&lt;/strong&gt; &lt;code&gt;@google/genai&lt;/code&gt; for the Live API connection (&lt;code&gt;live.connect()&lt;/code&gt;) + &lt;code&gt;@google/adk&lt;/code&gt; for structured agent tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; TypeScript, Express, and &lt;code&gt;ws&lt;/code&gt; (WebSocket) -- handling multi-device rooms, audio routing, and Gemini session management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Vanilla HTML/JS with Web Audio API (PCM capture/playback), Web MIDI API (piano input), and Canvas (piano roll visualization)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hand Tracking:&lt;/strong&gt; MediaPipe HandLandmarker running on-device on the phone -- 21 hand landmarks at 30fps, rendered as a skeleton overlay&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment:&lt;/strong&gt; Google Cloud Run via Cloud Build, with Docker containerization and WebSocket session affinity&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Agent Tools (Google ADK)
&lt;/h3&gt;

&lt;p&gt;We defined two structured tools that Gemini can call during conversation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;set_coaching_focus&lt;/code&gt;&lt;/strong&gt;: Updates a visible "coaching tip" card on the UI with an actionable suggestion (e.g., "Curve your 4th finger more on the G")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;report_technique&lt;/code&gt;&lt;/strong&gt;: Reports a correlated observation combining what the AI sees (EYE) with what the MIDI shows (EAR) -- bridging visual and digital feedback&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Hardest Problems We Solved
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Echo Suppression Without Hardware
&lt;/h3&gt;

&lt;p&gt;When Gemini speaks through the laptop speaker, the microphone picks it up and sends it back -- creating an infinite echo loop. We implemented a multi-layer gate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;isBotSpeaking gate:&lt;/strong&gt; Completely mutes mic transmission while Gemini audio is playing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;turn_complete signal:&lt;/strong&gt; Only resets the gate when Gemini signals it's done speaking (not on individual audio chunk boundaries)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech recognition grace period:&lt;/strong&gt; 1.5-second buffer after Gemini finishes before processing speech recognition results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser echoCancellation:&lt;/strong&gt; Hardware-level AEC as a final safety net&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. MIDI Context Without Triggering Responses
&lt;/h3&gt;

&lt;p&gt;Gemini needs continuous MIDI data to give informed feedback, but sending data shouldn't make Gemini start talking unprompted. We solved this using the Live API's &lt;code&gt;turnComplete&lt;/code&gt; flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Background context -- Gemini absorbs but doesn't respond&lt;/span&gt;
&lt;span class="nx"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sendClientContent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;turns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;midiSummary&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="na"&gt;turnComplete&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;  &lt;span class="c1"&gt;// key: no response triggered&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the user actually asks "how was my playing?", Gemini already has the MIDI context and can give quantitative feedback like "Your tempo drifted +12% in bars 4-6" without having been babbling the whole time.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Real-Time Audio Streaming
&lt;/h3&gt;

&lt;p&gt;The Gemini Live API expects 16kHz PCM audio. We capture from the browser's &lt;code&gt;getUserMedia&lt;/code&gt;, downsample in a ScriptProcessor, and stream raw PCM chunks over WebSocket at around 100ms intervals. Gemini's responses come back as 24kHz PCM chunks that we decode and queue for gapless playback. The entire round-trip feels conversational -- like talking to someone in the room.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The Live API is genuinely magical.&lt;/strong&gt; Being able to &lt;code&gt;sendRealtimeInput&lt;/code&gt; with both audio and JPEG frames, and have Gemini reason about them together in real time, feels like the future of AI interaction.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi-device is hard but worth it.&lt;/strong&gt; Managing WebSocket rooms, role assignment, and synchronized audio across devices added significant complexity, but the result -- hands-free piano coaching -- is only possible with this architecture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native audio changes everything.&lt;/strong&gt; Previous approaches required speech-to-text then LLM then text-to-speech pipelines with cumulative latency. Gemini's native audio model handles the full loop in one step, making real-time conversation actually feel real-time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MIDI + Vision + Audio is the trifecta.&lt;/strong&gt; Any one input alone gives incomplete information. MIDI gives precise timing and velocity. Vision shows hand form and technique. Audio captures tonal quality. Together, they enable coaching that rivals a human teacher sitting next to you.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Drill System:&lt;/strong&gt; Structured exercises with specific targets (tempo, dynamics) that Gemini monitors and scores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On-Demand Recording:&lt;/strong&gt; Capture a passage, then get deep multimodal analysis correlating hand form with MIDI data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sheet Music Practice:&lt;/strong&gt; Notes flow toward you in a waterfall display -- play along while Gemini silently observes, then delivers a comprehensive end-of-piece analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Historical Tracking:&lt;/strong&gt; Compare sessions over time to identify persistent habits and measure improvement&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="https://pianoquest-live-604879855725.us-central1.run.app" rel="noopener noreferrer"&gt;https://pianoquest-live-604879855725.us-central1.run.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Jay-Network/PianoQuest-Live" rel="noopener noreferrer"&gt;https://github.com/Jay-Network/PianoQuest-Live&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PianoQuest Live is built for the Gemini Live Agent Challenge hackathon. It uses Google Cloud Run, the Gemini 2.5 Flash Native Audio model, the Google GenAI SDK, and the Google Agent Development Kit (ADK).&lt;/p&gt;

&lt;h1&gt;
  
  
  GeminiLiveAgentChallenge #GoogleAI #GeminiAPI #ADK #CloudRun
&lt;/h1&gt;

</description>
      <category>gemini</category>
      <category>ai</category>
      <category>googlecloud</category>
      <category>hackathon</category>
    </item>
  </channel>
</rss>
