<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Perill</title>
    <description>The latest articles on DEV Community by Perill (@periculousmerin).</description>
    <link>https://dev.to/periculousmerin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3805779%2F37b7e2a3-4968-4240-9072-83190644d741.png</url>
      <title>DEV Community: Perill</title>
      <link>https://dev.to/periculousmerin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/periculousmerin"/>
    <language>en</language>
    <item>
      <title>Building ARCHITECT: Real-Time AI Interior Design with Gemini Live API + Google ADK</title>
      <dc:creator>Perill</dc:creator>
      <pubDate>Thu, 05 Mar 2026 07:26:43 +0000</pubDate>
      <link>https://dev.to/periculousmerin/building-architect-real-time-ai-interior-design-with-gemini-live-api-google-adk-2mp5</link>
      <guid>https://dev.to/periculousmerin/building-architect-real-time-ai-interior-design-with-gemini-live-api-google-adk-2mp5</guid>
      <description>&lt;p&gt;&lt;em&gt;This post was created for the purposes of entering the &lt;a href="https://geminiliveagentchallenge.devpost.com/" rel="noopener noreferrer"&gt;Gemini Live Agent Challenge&lt;/a&gt; hackathon. #GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ARCHITECT&lt;/strong&gt; is a real-time AI interior design assistant. You point your phone camera at any room, talk to the agent naturally, and it generates photorealistic redesigns — all in real-time, all through voice.&lt;/p&gt;

&lt;p&gt;The core premise: what if you had a talented interior designer who could literally &lt;em&gt;see&lt;/em&gt; your room, understand your style preferences from a conversation, and instantly show you a reimagined version? That's ARCHITECT.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Antimatter543/architect" rel="noopener noreferrer"&gt;https://github.com/Antimatter543/architect&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gemini Live API Was the Right Choice
&lt;/h2&gt;

&lt;p&gt;Most AI voice assistants are turn-based: you speak, you wait, it responds. Gemini's Live API is different — it's a persistent bidirectional stream where audio, video frames, and tool calls all flow simultaneously. This enabled an interaction pattern that wasn't possible before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User walks through their living room while talking&lt;/li&gt;
&lt;li&gt;Agent &lt;em&gt;sees&lt;/em&gt; the room continuously (camera frames streamed at 1fps)&lt;/li&gt;
&lt;li&gt;Agent calls &lt;code&gt;analyze_room()&lt;/code&gt; to capture spatial data while still listening&lt;/li&gt;
&lt;li&gt;User says "make it Japandi" mid-sentence&lt;/li&gt;
&lt;li&gt;Agent immediately starts generating a redesign image while responding vocally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The single WebSocket carries everything: 16kHz PCM audio in, 24kHz PCM audio out, JPEG frames in, JSON events, and binary image payloads out. There's no "please hold while I process" — it's genuinely live.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Backend: FastAPI + Google ADK
&lt;/h3&gt;

&lt;p&gt;The agent is built with Google's ADK (&lt;code&gt;LlmAgent&lt;/code&gt;) wrapping Gemini 2.0 Flash Live as the underlying model. ADK handles the agent loop; Gemini handles multimodal understanding and tool call orchestration.&lt;/p&gt;

&lt;p&gt;Five &lt;code&gt;FunctionTool&lt;/code&gt; instances hang off the agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@FunctionTool&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;analyze_room&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;str, style_tags: list[str]) -&amp;gt; dict:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Analyze the visible room and extract spatial/design data.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Stores analysis to Firestore, namespaced by user_id
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@FunctionTool&lt;/span&gt;  
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_redesign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room_analysis_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generate a photorealistic redesign using Imagen 3.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Calls gemini-2.0-flash-exp-image-generation
&lt;/span&gt;    &lt;span class="c1"&gt;# Uploads to Cloud Storage, returns public URL
&lt;/span&gt;    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="nd"&gt;@FunctionTool&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;search_furniture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;style&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Find matching furniture with prices from real retailers.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ADK's docstring-based schema inference is underrated — you write a clear docstring and it generates the JSON schema for tool calling automatically. No manual &lt;code&gt;tools&lt;/code&gt; array.&lt;/p&gt;

&lt;h3&gt;
  
  
  The WebSocket Protocol
&lt;/h3&gt;

&lt;p&gt;The interesting architectural detail is the binary framing. Everything goes over one WebSocket:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[JSON header bytes] [0x00 null byte] [payload bytes]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For audio frames: header is &lt;code&gt;{"type":"audio"}&lt;/code&gt;, payload is raw PCM.&lt;br&gt;
For camera frames: header is &lt;code&gt;{"type":"frame"}&lt;/code&gt;, payload is JPEG bytes.&lt;br&gt;
For server-to-client audio: same protocol in reverse.&lt;/p&gt;

&lt;p&gt;This lets the frontend handle audio, video, and events all in one &lt;code&gt;onmessage&lt;/code&gt; handler without multiplexing connections.&lt;/p&gt;
&lt;h3&gt;
  
  
  Frontend: React + AudioWorklets
&lt;/h3&gt;

&lt;p&gt;The audio pipeline was the most technically demanding piece. The browser captures microphone audio at 48kHz; Gemini expects 16kHz PCM. The playback side does the reverse: 24kHz → 48kHz.&lt;/p&gt;

&lt;p&gt;Both conversions run in AudioWorklets — dedicated audio threads that don't block the main thread. This keeps the UI responsive while audio streams continuously.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBNSUNbIvCfjqQgTWljXG40OGtIeiJdIC0tPnwxMjggc2FtcGxlcy9jaHVua3wgQ1dbIkNhcHR1cmVXb3JrbGV0XG4zOjEgZG93bnNhbXBsZSJdCiAgICBDVyAtLT58MTZrSHogUENNfCBXUzFbIldlYlNvY2tldCJdCiAgICBXUzEgLS0%2BIEdbIkdlbWluaSBMaXZlIEFQSSJdCiAgICBHIC0tPnwyNGtIeiBQQ018IFdTMlsiV2ViU29ja2V0Il0KICAgIFdTMiAtLT4gUFdbIlBsYXliYWNrV29ya2xldFxuMToyIHVwc2FtcGxlIl0KICAgIFBXIC0tPnw0OGtIeiBQQ018IFNQS1si8J%2BUiiBTcGVha2VyIl0%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBNSUNbIvCfjqQgTWljXG40OGtIeiJdIC0tPnwxMjggc2FtcGxlcy9jaHVua3wgQ1dbIkNhcHR1cmVXb3JrbGV0XG4zOjEgZG93bnNhbXBsZSJdCiAgICBDVyAtLT58MTZrSHogUENNfCBXUzFbIldlYlNvY2tldCJdCiAgICBXUzEgLS0%2BIEdbIkdlbWluaSBMaXZlIEFQSSJdCiAgICBHIC0tPnwyNGtIeiBQQ018IFdTMlsiV2ViU29ja2V0Il0KICAgIFdTMiAtLT4gUFdbIlBsYXliYWNrV29ya2xldFxuMToyIHVwc2FtcGxlIl0KICAgIFBXIC0tPnw0OGtIeiBQQ018IFNQS1si8J%2BUiiBTcGVha2VyIl0%3D" alt="Audio Pipeline: Mic 48kHz → CaptureWorklet 3:1 downsample → 16kHz PCM → WebSocket → Gemini → 24kHz PCM → PlaybackWorklet → Speaker 48kHz" width="1904" height="90"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Capture worklet: 48kHz → 16kHz downsampling&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CaptureProcessor&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AudioWorkletProcessor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// 128 samples at 48kHz&lt;/span&gt;
    &lt;span class="c1"&gt;// Downsample 3:1 with simple averaging&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;downsampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;postMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Google Cloud Services Used
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Run&lt;/strong&gt; — hosts the FastAPI backend with session affinity (essential for persistent WebSocket connections — without &lt;code&gt;--session-affinity&lt;/code&gt;, load balancing breaks them)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Storage&lt;/strong&gt; — stores Imagen 3-generated redesign images, served via public URLs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Firestore&lt;/strong&gt; — persists room analyses, design history, and shopping lists per user session&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Build&lt;/strong&gt; — automated deployment pipeline in &lt;code&gt;deploy/cloudbuild.yaml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secret Manager&lt;/strong&gt; — stores API keys and Auth0 credentials, injected into Cloud Run at deploy time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deployment is fully automated — one &lt;code&gt;gcloud builds submit&lt;/code&gt; command builds the Docker image, pushes it to Container Registry, deploys to Cloud Run with all secrets wired in, and builds + deploys the React frontend to a Cloud Storage static site.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gemini Live API's multimodal simultaneity is genuinely new.&lt;/strong&gt; Most voice APIs handle audio only. Most vision APIs are stateless image uploads. Gemini Live lets you send audio &lt;em&gt;and&lt;/em&gt; video frames &lt;em&gt;and&lt;/em&gt; receive audio &lt;em&gt;and&lt;/em&gt; trigger tool calls &lt;em&gt;all in the same session&lt;/em&gt;. The design space this opens up is significant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ADK's &lt;code&gt;FunctionTool&lt;/code&gt; pattern is clean.&lt;/strong&gt; The docstring → JSON schema inference means your tool documentation &lt;em&gt;is&lt;/em&gt; your tool definition. There's no separate schema to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud Run session affinity is not optional for WebSocket apps.&lt;/strong&gt; First deployment worked fine locally, broke in production because Cloud Run was load balancing across instances mid-session. &lt;code&gt;--session-affinity&lt;/code&gt; flag fixed it — but it's buried in the docs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AudioWorklet precision matters for speech recognition.&lt;/strong&gt; Naive downsampling (taking every Nth sample) introduced aliasing artifacts that degraded Gemini's speech recognition. Averaging 3 samples per output sample before the 3:1 ratio downsample made a noticeable difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBzdWJncmFwaCBCcm93c2VyWyJCcm93c2VyIC0gUmVhY3QgKyBWaXRlIl0KICAgICAgICBBQ1sidXNlQXVkaW9DYXB0dXJlXG40OGtIeiB0byAxNmtIeiJdCiAgICAgICAgQVBbInVzZUF1ZGlvUGxheWJhY2tcbjI0a0h6IHRvIDQ4a0h6Il0KICAgICAgICBDQ1sidXNlQ2FtZXJhQ2FwdHVyZVxuMWZwcyBKUEVHIl0KICAgIGVuZAogICAgV1NbIldlYlNvY2tldCAvd3Mve3Nlc3Npb25faWR9Il0KICAgIHN1YmdyYXBoIENSWyJGYXN0QVBJIG9uIENsb3VkIFJ1biJdCiAgICAgICAgSldUWyJKV1QgdmVyaWZ5IEF1dGgwIl0KICAgICAgICBBU1siQURLIExsbUFnZW50XG5HZW1pbmkgMi4wIEZsYXNoIExpdmUiXQogICAgICAgIFQxWyJhbmFseXplX3Jvb20iXQogICAgICAgIFQyWyJnZW5lcmF0ZV9yZWRlc2lnbiJdCiAgICAgICAgVDNbInNlYXJjaF9mdXJuaXR1cmUiXQogICAgZW5kCiAgICBGU1soIkZpcmVzdG9yZSIpXQogICAgR0NTWygiQ2xvdWQgU3RvcmFnZSIpXQogICAgSU1HWyJJbWFnZW4gMyJdCiAgICBCcm93c2VyIC0tPnxhdXRoICsgYXVkaW8gKyB2aWRlb3wgV1MKICAgIFdTIC0tPiBKV1QgLS0%2BIEFTCiAgICBBUyAtLT4gVDEgJiBUMiAmIFQzCiAgICBUMSAtLT4gRlMKICAgIFQyIC0tPiBJTUcgLS0%2BIEdDUwogICAgQVMgLS0%2BfGF1ZGlvICsgZXZlbnRzICsgaW1hZ2VzfCBXUwogICAgV1MgLS0%2BIEJyb3dzZXI%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBzdWJncmFwaCBCcm93c2VyWyJCcm93c2VyIC0gUmVhY3QgKyBWaXRlIl0KICAgICAgICBBQ1sidXNlQXVkaW9DYXB0dXJlXG40OGtIeiB0byAxNmtIeiJdCiAgICAgICAgQVBbInVzZUF1ZGlvUGxheWJhY2tcbjI0a0h6IHRvIDQ4a0h6Il0KICAgICAgICBDQ1sidXNlQ2FtZXJhQ2FwdHVyZVxuMWZwcyBKUEVHIl0KICAgIGVuZAogICAgV1NbIldlYlNvY2tldCAvd3Mve3Nlc3Npb25faWR9Il0KICAgIHN1YmdyYXBoIENSWyJGYXN0QVBJIG9uIENsb3VkIFJ1biJdCiAgICAgICAgSldUWyJKV1QgdmVyaWZ5IEF1dGgwIl0KICAgICAgICBBU1siQURLIExsbUFnZW50XG5HZW1pbmkgMi4wIEZsYXNoIExpdmUiXQogICAgICAgIFQxWyJhbmFseXplX3Jvb20iXQogICAgICAgIFQyWyJnZW5lcmF0ZV9yZWRlc2lnbiJdCiAgICAgICAgVDNbInNlYXJjaF9mdXJuaXR1cmUiXQogICAgZW5kCiAgICBGU1soIkZpcmVzdG9yZSIpXQogICAgR0NTWygiQ2xvdWQgU3RvcmFnZSIpXQogICAgSU1HWyJJbWFnZW4gMyJdCiAgICBCcm93c2VyIC0tPnxhdXRoICsgYXVkaW8gKyB2aWRlb3wgV1MKICAgIFdTIC0tPiBKV1QgLS0%2BIEFTCiAgICBBUyAtLT4gVDEgJiBUMiAmIFQzCiAgICBUMSAtLT4gRlMKICAgIFQyIC0tPiBJTUcgLS0%2BIEdDUwogICAgQVMgLS0%2BfGF1ZGlvICsgZXZlbnRzICsgaW1hZ2VzfCBXUwogICAgV1MgLS0%2BIEJyb3dzZXI%3D" alt="System Architecture: Browser React hooks → WebSocket → FastAPI Cloud Run → ADK LlmAgent → Gemini Live → Firestore + Cloud Storage" width="716" height="1201"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  WebSocket Auth + Data Flow
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBwYXJ0aWNpcGFudCBGRSBhcyBGcm9udGVuZAogICAgcGFydGljaXBhbnQgV1MgYXMgV2ViU29ja2V0CiAgICBwYXJ0aWNpcGFudCBCRSBhcyBGYXN0QVBJK0FESwogICAgcGFydGljaXBhbnQgRyBhcyBHZW1pbmkgTGl2ZQogICAgRkUtPj5XUzogY29ubmVjdAogICAgRkUtPj5XUzogYXV0aCBKV1QgdG9rZW4KICAgIFdTLT4%2BQkU6IHZlcmlmeSB2aWEgQXV0aDAgSldLUwogICAgQkUtLT4%2BRkU6IHNlc3Npb25fcmVhZHkKICAgIGxvb3AgTGl2ZSBTZXNzaW9uCiAgICAgICAgRkUtPj5XUzogUENNIGF1ZGlvIGNodW5rCiAgICAgICAgRkUtPj5XUzogSlBFRyBjYW1lcmEgZnJhbWUKICAgICAgICBXUy0%2BPkc6IGF1ZGlvICsgdmlkZW8gc3RyZWFtCiAgICAgICAgRy0tPj5CRTogdG9vbF9jYWxsIGFuYWx5emVfcm9vbQogICAgICAgIEJFLS0%2BPkc6IHJvb20gYW5hbHlzaXMKICAgICAgICBHLS0%2BPldTOiBhdWRpbyByZXNwb25zZSBQQ00KICAgICAgICBXUy0tPj5GRTogUENNIGF1ZGlvCiAgICAgICAgRy0tPj5CRTogdG9vbF9jYWxsIGdlbmVyYXRlX3JlZGVzaWduCiAgICAgICAgQkUtLT4%2BV1M6IGltYWdlIFVSTCBldmVudAogICAgICAgIFdTLS0%2BPkZFOiBpbWFnZSBkaXNwbGF5ZWQKICAgIGVuZA%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2Fc2VxdWVuY2VEaWFncmFtCiAgICBwYXJ0aWNpcGFudCBGRSBhcyBGcm9udGVuZAogICAgcGFydGljaXBhbnQgV1MgYXMgV2ViU29ja2V0CiAgICBwYXJ0aWNpcGFudCBCRSBhcyBGYXN0QVBJK0FESwogICAgcGFydGljaXBhbnQgRyBhcyBHZW1pbmkgTGl2ZQogICAgRkUtPj5XUzogY29ubmVjdAogICAgRkUtPj5XUzogYXV0aCBKV1QgdG9rZW4KICAgIFdTLT4%2BQkU6IHZlcmlmeSB2aWEgQXV0aDAgSldLUwogICAgQkUtLT4%2BRkU6IHNlc3Npb25fcmVhZHkKICAgIGxvb3AgTGl2ZSBTZXNzaW9uCiAgICAgICAgRkUtPj5XUzogUENNIGF1ZGlvIGNodW5rCiAgICAgICAgRkUtPj5XUzogSlBFRyBjYW1lcmEgZnJhbWUKICAgICAgICBXUy0%2BPkc6IGF1ZGlvICsgdmlkZW8gc3RyZWFtCiAgICAgICAgRy0tPj5CRTogdG9vbF9jYWxsIGFuYWx5emVfcm9vbQogICAgICAgIEJFLS0%2BPkc6IHJvb20gYW5hbHlzaXMKICAgICAgICBHLS0%2BPldTOiBhdWRpbyByZXNwb25zZSBQQ00KICAgICAgICBXUy0tPj5GRTogUENNIGF1ZGlvCiAgICAgICAgRy0tPj5CRTogdG9vbF9jYWxsIGdlbmVyYXRlX3JlZGVzaWduCiAgICAgICAgQkUtLT4%2BV1M6IGltYWdlIFVSTCBldmVudAogICAgICAgIFdTLS0%2BPkZFOiBpbWFnZSBkaXNwbGF5ZWQKICAgIGVuZA%3D%3D" alt="WebSocket Auth + Data Flow sequence: Frontend connects, sends JWT, backend verifies, live session with audio/video/tool calls" width="917" height="858"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Backend&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;backend
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env  &lt;span class="c"&gt;# fill in GOOGLE_API_KEY, etc.&lt;/span&gt;
uvicorn app.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt; &lt;span class="nt"&gt;--port&lt;/span&gt; 8080

&lt;span class="c"&gt;# Frontend  &lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;frontend
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env  &lt;span class="c"&gt;# fill in VITE_AUTH0_* vars&lt;/span&gt;
npm run dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloud deployment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One-time setup&lt;/span&gt;
bash deploy/setup.sh

&lt;span class="c"&gt;# Deploy&lt;/span&gt;
gcloud builds submit &lt;span class="nt"&gt;--config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;deploy/cloudbuild.yaml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--substitutions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;_AUTH0_CLIENT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-client-id &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;ARCHITECT is a submission for the &lt;strong&gt;Gemini Live Agent Challenge&lt;/strong&gt; — building agents that truly see, hear, and create in real-time. The full source is at &lt;a href="https://github.com/Antimatter543/architect" rel="noopener noreferrer"&gt;https://github.com/Antimatter543/architect&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;#GeminiLiveAgentChallenge&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemini</category>
      <category>ai</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Building an AI Synesthesia Engine with Gemini Live API and ADK</title>
      <dc:creator>Perill</dc:creator>
      <pubDate>Wed, 04 Mar 2026 12:26:00 +0000</pubDate>
      <link>https://dev.to/periculousmerin/building-an-ai-synesthesia-engine-with-gemini-live-api-and-adk-2a22</link>
      <guid>https://dev.to/periculousmerin/building-an-ai-synesthesia-engine-with-gemini-live-api-and-adk-2a22</guid>
      <description>&lt;p&gt;How we built MUSE, a real-time multimodal agent that translates between senses using Gemini 2.5 Flash Native Audio, ADK multi-agent orchestration, and some surprisingly tricky WebSocket plumbing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Idea: Synesthesia as an AI Paradigm
&lt;/h2&gt;

&lt;p&gt;Synesthesia is a neurological condition where stimulation of one sense automatically triggers another. A synesthete might hear colors, see sounds, or taste shapes. For most people it's involuntary, poetic, and hard to explain. For an AI that processes multiple modalities simultaneously, it should be native.&lt;/p&gt;

&lt;p&gt;That realization was the seed of MUSE, the Multimodal Synesthetic Experience Engine.&lt;/p&gt;

&lt;p&gt;The premise: instead of asking an AI to describe a painting, ask it to hear the painting. Instead of transcribing a melody, ask it to see the melody. MUSE does not just process inputs and produce outputs. It performs cross-modal translation as its core function. Every visual input becomes a sonic description. Every audio input becomes a visual one. And throughout, it generates art from those translations in real time.&lt;/p&gt;

&lt;p&gt;This is a meaningful departure from standard multimodal AI usage. Most pipelines treat modalities in isolation: image captioning, speech-to-text, text-to-image. MUSE treats the modalities as a continuous, interwoven experience, closer to how an actual mind handles sensory input than a series of API calls.&lt;/p&gt;

&lt;p&gt;What made this possible right now is Gemini's Native Audio model. We're not doing speech-to-text and then feeding text to a vision model. The audio and visual context are genuinely live, simultaneous, and bidirectional. That's what makes the synesthesia metaphor feel real rather than simulated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;MUSE has three layers that talk to each other continuously:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBCcm93c2VyWyJCcm93c2VyIFJlYWN0Il0KICAgIEFXWyJDYXB0dXJlV29ya2xldFxuMTZrSHogUENNIl0KICAgIENBTVsiQ2FtZXJhIEpQRUcgZnJhbWVzIl0KICAgIFBCWyJQbGF5YmFja1dvcmtsZXRcbjI0a0h6IFBDTSJdCiAgICBXU1siV2ViU29ja2V0IC93cy9zZXNzaW9uIl0KICAgIEZhc3RBUElbIkZhc3RBUEkgU2VydmVyIl0KICAgIE9yY2hbIk9yY2hlc3RyYXRvciBBZ2VudFxuZ2VtaW5pLTIuNS1mbGFzaC1uYXRpdmUtYXVkaW8iXQogICAgVmlzdWFsWyJWaXN1YWxBZ2VudCJdCiAgICBBdWRpb1siQXVkaW9BZ2VudCJdCiAgICBTa2V0Y2hbIlNrZXRjaEFnZW50Il0KICAgIEltZ0dlblsiSW1hZ2UgR2VuZXJhdGlvblxuZ2VtaW5pLTIuMC1mbGFzaC1leHAiXQogICAgQnJvd3NlciAtLT4gQVcKICAgIEJyb3dzZXIgLS0%2BIENBTQogICAgQVcgLS0%2BfFBDTSBiaW5hcnkgZnJhbWVzfCBXUwogICAgQ0FNIC0tPnxKU09OIGJhc2U2NCBmcmFtZXN8IFdTCiAgICBXUyA8LS0%2BfGxpdmUgc2Vzc2lvbnwgRmFzdEFQSQogICAgRmFzdEFQSSAtLT4gT3JjaAogICAgT3JjaCAtLT4gVmlzdWFsCiAgICBPcmNoIC0tPiBBdWRpbwogICAgT3JjaCAtLT4gU2tldGNoCiAgICBPcmNoIC0tPnxpbWFnZSBwcm9tcHR8IEltZ0dlbgogICAgSW1nR2VuIC0tPnxnZW5lcmF0ZWRfaW1hZ2UgSlNPTnwgV1MKICAgIEZhc3RBUEkgLS0%2BfFBDTSBhdWRpbyBieXRlc3wgUEI%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IFRECiAgICBCcm93c2VyWyJCcm93c2VyIFJlYWN0Il0KICAgIEFXWyJDYXB0dXJlV29ya2xldFxuMTZrSHogUENNIl0KICAgIENBTVsiQ2FtZXJhIEpQRUcgZnJhbWVzIl0KICAgIFBCWyJQbGF5YmFja1dvcmtsZXRcbjI0a0h6IFBDTSJdCiAgICBXU1siV2ViU29ja2V0IC93cy9zZXNzaW9uIl0KICAgIEZhc3RBUElbIkZhc3RBUEkgU2VydmVyIl0KICAgIE9yY2hbIk9yY2hlc3RyYXRvciBBZ2VudFxuZ2VtaW5pLTIuNS1mbGFzaC1uYXRpdmUtYXVkaW8iXQogICAgVmlzdWFsWyJWaXN1YWxBZ2VudCJdCiAgICBBdWRpb1siQXVkaW9BZ2VudCJdCiAgICBTa2V0Y2hbIlNrZXRjaEFnZW50Il0KICAgIEltZ0dlblsiSW1hZ2UgR2VuZXJhdGlvblxuZ2VtaW5pLTIuMC1mbGFzaC1leHAiXQogICAgQnJvd3NlciAtLT4gQVcKICAgIEJyb3dzZXIgLS0%2BIENBTQogICAgQVcgLS0%2BfFBDTSBiaW5hcnkgZnJhbWVzfCBXUwogICAgQ0FNIC0tPnxKU09OIGJhc2U2NCBmcmFtZXN8IFdTCiAgICBXUyA8LS0%2BfGxpdmUgc2Vzc2lvbnwgRmFzdEFQSQogICAgRmFzdEFQSSAtLT4gT3JjaAogICAgT3JjaCAtLT4gVmlzdWFsCiAgICBPcmNoIC0tPiBBdWRpbwogICAgT3JjaCAtLT4gU2tldGNoCiAgICBPcmNoIC0tPnxpbWFnZSBwcm9tcHR8IEltZ0dlbgogICAgSW1nR2VuIC0tPnxnZW5lcmF0ZWRfaW1hZ2UgSlNPTnwgV1MKICAgIEZhc3RBUEkgLS0%2BfFBDTSBhdWRpbyBieXRlc3wgUEI%3D" alt="MUSE system architecture: Browser sends audio and camera frames over WebSocket to FastAPI; ADK Orchestrator delegates to VisualAgent, AudioAgent, SketchAgent; async image generation runs in parallel" width="940" height="782"&gt;&lt;/a&gt;&lt;br&gt;
The browser captures microphone audio and camera frames, sending them over a single WebSocket connection as a mix of binary (audio PCM) and JSON (images as base64, control messages) frames. The FastAPI server unwraps these and pushes them into an ADK &lt;code&gt;LiveRequestQueue&lt;/code&gt;. An ADK runner processes the queue in a live session using a multi-agent orchestrator. When the orchestrator determines image generation is warranted, it delegates to a generation step using &lt;code&gt;google.genai.Client&lt;/code&gt; directly, then sends the result back through the WebSocket to the browser.&lt;/p&gt;

&lt;p&gt;The entire flow, audio in, audio out, image generation, text, happens without interrupting the live session. The conversation is continuous.&lt;/p&gt;


&lt;h2&gt;
  
  
  ADK Setup: The Multi-Agent Orchestrator
&lt;/h2&gt;

&lt;p&gt;We're using &lt;code&gt;google-adk&lt;/code&gt; 1.26.0. The core agent setup looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.agents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LlmAgent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.runners&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;InMemorySessionService&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LiveRequestQueue&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.adk.agents.run_config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StreamingMode&lt;/span&gt;

&lt;span class="n"&gt;orchestrator_agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LlmAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;muse_orchestrator&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash-native-audio-preview-12-2025&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;instruction&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    You are MUSE, a synesthetic AI. Your purpose is to translate between senses.
    When shown visual input, describe what you hear in it: sounds, music, tone.
    When given audio input, describe what you see: colors, shapes, movement.
    After a synesthetic translation, generate art from that translation.
    Speak naturally, as if experiencing these things genuinely.
    You may initiate conversation when a live session begins.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sub_agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;visual_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sketch_agent&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;session_service&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemorySessionService&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;runner&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Runner&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;orchestrator_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;muse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;session_service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Session creation is async. This caught us early:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wrong: will silently fail or raise in newer ADK versions
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;muse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Correct
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;muse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;session_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The live loop is built around &lt;code&gt;runner.run_live()&lt;/code&gt;, which accepts a &lt;code&gt;LiveRequestQueue&lt;/code&gt; and yields events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;live_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LiveRequestQueue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;runner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run_live&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;live_request_queue&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;run_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;RunConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streaming_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;StreamingMode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BIDI&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;continue&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# PCM audio bytes for playback
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Text response - send as JSON frame
&lt;/span&gt;            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;LiveRequestQueue&lt;/code&gt; is the push-in side. When audio arrives from the browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;genai_types&lt;/span&gt;

&lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_realtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;genai_types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/pcm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For camera frames:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_realtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;genai_types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jpeg_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The WebSocket Handler: Binary and JSON in One Connection
&lt;/h2&gt;

&lt;p&gt;One design decision that simplified the browser significantly: use a single WebSocket for everything. Audio PCM comes in as binary frames, images and control messages come in as JSON frames. The server distinguishes by frame type:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBCcm93c2VyQXVkaW9bIkJyb3dzZXJcbmF1ZGlvIFBDTSJdCiAgICBCcm93c2VySW1nWyJCcm93c2VyXG5jYW1lcmEgZnJhbWVzIl0KICAgIFdTWyJXZWJTb2NrZXRcbnNpbmdsZSBjb25uZWN0aW9uIl0KICAgIExpdmVRWyJMaXZlUmVxdWVzdFF1ZXVlIl0KICAgIEFES1siQURLIFJ1bm5lciJdCiAgICBJbWdUYXNrWyJhc3luYyBpbWFnZSB0YXNrXG5nZW1pbmktMi4wLWZsYXNoLWV4cCJdCiAgICBCcm93c2VyQXVkaW8gLS0%2BfGJpbmFyeSBmcmFtZXN8IFdTCiAgICBCcm93c2VySW1nIC0tPnxKU09OIHR5cGU9aW1hZ2V8IFdTCiAgICBXUyAtLT58Ynl0ZXMgLT4gYXVkaW8vcGNtfCBMaXZlUQogICAgV1MgLS0%2BfHR5cGU9aW1hZ2UgLT4gaW1hZ2UvanBlZ3wgTGl2ZVEKICAgIFdTIC0tPnx0eXBlPWdlbmVyYXRlX2ltYWdlfCBJbWdUYXNrCiAgICBMaXZlUSAtLT4gQURLCiAgICBBREsgLS0%2BfGF1ZGlvIGJ5dGVzfCBXUwogICAgQURLIC0tPnx0eXBlPXRleHR8IFdTCiAgICBJbWdUYXNrIC0tPnx0eXBlPWdlbmVyYXRlZF9pbWFnZXwgV1M%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBCcm93c2VyQXVkaW9bIkJyb3dzZXJcbmF1ZGlvIFBDTSJdCiAgICBCcm93c2VySW1nWyJCcm93c2VyXG5jYW1lcmEgZnJhbWVzIl0KICAgIFdTWyJXZWJTb2NrZXRcbnNpbmdsZSBjb25uZWN0aW9uIl0KICAgIExpdmVRWyJMaXZlUmVxdWVzdFF1ZXVlIl0KICAgIEFES1siQURLIFJ1bm5lciJdCiAgICBJbWdUYXNrWyJhc3luYyBpbWFnZSB0YXNrXG5nZW1pbmktMi4wLWZsYXNoLWV4cCJdCiAgICBCcm93c2VyQXVkaW8gLS0%2BfGJpbmFyeSBmcmFtZXN8IFdTCiAgICBCcm93c2VySW1nIC0tPnxKU09OIHR5cGU9aW1hZ2V8IFdTCiAgICBXUyAtLT58Ynl0ZXMgLT4gYXVkaW8vcGNtfCBMaXZlUQogICAgV1MgLS0%2BfHR5cGU9aW1hZ2UgLT4gaW1hZ2UvanBlZ3wgTGl2ZVEKICAgIFdTIC0tPnx0eXBlPWdlbmVyYXRlX2ltYWdlfCBJbWdUYXNrCiAgICBMaXZlUSAtLT4gQURLCiAgICBBREsgLS0%2BfGF1ZGlvIGJ5dGVzfCBXUwogICAgQURLIC0tPnx0eXBlPXRleHR8IFdTCiAgICBJbWdUYXNrIC0tPnx0eXBlPWdlbmVyYXRlZF9pbWFnZXwgV1M%3D" alt="WebSocket multiplexing: binary frames route to LiveRequestQueue as audio/pcm, JSON image frames route as image/jpeg, generate_image triggers async image generation task" width="1390" height="301"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@app.websocket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ws/{session_id}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;websocket_endpoint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session_service&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_session&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;muse&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;session_id&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;live_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LiveRequestQueue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Start the ADK live loop in the background
&lt;/span&gt;    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;run_live_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="c1"&gt;# Raw PCM audio from browser AudioWorklet
&lt;/span&gt;                &lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_realtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;genai_types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/pcm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bytes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;frame&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;jpeg_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                    &lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_realtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="n"&gt;genai_types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Blob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image/jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jpeg_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generate_image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="c1"&gt;# Trigger image generation outside the live loop
&lt;/span&gt;                    &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                        &lt;span class="nf"&gt;generate_and_send_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;WebSocketDisconnect&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;live_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Image generation runs as a separate async task so it doesn't block the live audio stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_and_send_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash-exp-image-generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;GenerateContentConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;response_modalities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;candidates&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;image_b64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;generated_image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;image_b64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mime_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;part&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;inline_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mime_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The AudioWorklet: PCM In and Out at Different Sample Rates
&lt;/h2&gt;

&lt;p&gt;This was the most technically finicky part of the project. Gemini's native audio model expects 16kHz PCM input and outputs 24kHz PCM. The browser's &lt;code&gt;AudioContext&lt;/code&gt; often runs at 44.1kHz or 48kHz. AudioWorklet is the right tool, but it takes some care.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBNaWNbIk1pY1xuNDQtNDhrSHoiXQogICAgQ1dbIkNhcHR1cmVXb3JrbGV0XG5kb3dzYW1wbGUgMzoxIl0KICAgIFBDTTE2WyIxNmtIeiBQQ01cbmJpbmFyeSBXUyBmcmFtZXMiXQogICAgR2VtaW5pWyJHZW1pbmkgTGl2ZVxuZ2VtaW5pLTIuNS1mbGFzaCJdCiAgICBQQ00yNFsiMjRrSHogUENNXG5iaW5hcnkgV1MgZnJhbWVzIl0KICAgIFBXWyJQbGF5YmFja1dvcmtsZXRcbnF1ZXVlIGFuZCBkcmFpbiJdCiAgICBTcGVha2VyWyJTcGVha2VyXG4yNGtIeiBjb250ZXh0Il0KICAgIE1pYyAtLT4gQ1cKICAgIENXIC0tPiBQQ00xNgogICAgUENNMTYgLS0%2BIEdlbWluaQogICAgR2VtaW5pIC0tPiBQQ00yNAogICAgUENNMjQgLS0%2BIFBXCiAgICBQVyAtLT4gU3BlYWtlcg%3D%3D" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmermaid.ink%2Fimg%2FZmxvd2NoYXJ0IExSCiAgICBNaWNbIk1pY1xuNDQtNDhrSHoiXQogICAgQ1dbIkNhcHR1cmVXb3JrbGV0XG5kb3dzYW1wbGUgMzoxIl0KICAgIFBDTTE2WyIxNmtIeiBQQ01cbmJpbmFyeSBXUyBmcmFtZXMiXQogICAgR2VtaW5pWyJHZW1pbmkgTGl2ZVxuZ2VtaW5pLTIuNS1mbGFzaCJdCiAgICBQQ00yNFsiMjRrSHogUENNXG5iaW5hcnkgV1MgZnJhbWVzIl0KICAgIFBXWyJQbGF5YmFja1dvcmtsZXRcbnF1ZXVlIGFuZCBkcmFpbiJdCiAgICBTcGVha2VyWyJTcGVha2VyXG4yNGtIeiBjb250ZXh0Il0KICAgIE1pYyAtLT4gQ1cKICAgIENXIC0tPiBQQ00xNgogICAgUENNMTYgLS0%2BIEdlbWluaQogICAgR2VtaW5pIC0tPiBQQ00yNAogICAgUENNMjQgLS0%2BIFBXCiAgICBQVyAtLT4gU3BlYWtlcg%3D%3D" alt="Audio pipeline: Mic at 44-48kHz, CaptureWorklet downsamples 3:1 to 16kHz PCM binary frames, Gemini Live processes and returns 24kHz PCM, PlaybackWorklet queues and drains to Speaker" width="1904" height="88"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The capture worklet resamples to 16kHz before sending:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// capture-worklet.js&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CaptureProcessor&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AudioWorkletProcessor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// mono&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Downsample from sampleRate to 16000&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;sampleRate&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;outLength&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;downsampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outLength&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;outLength&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Convert Float32 to Int16 PCM&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pcm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Int16Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;pcm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;32767&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;downsampled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;postMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pcm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;pcm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;registerProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;capture-processor&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CaptureProcessor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The playback worklet receives 24kHz Int16 PCM from the WebSocket and plays it back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// playback-worklet.js&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PlaybackProcessor&lt;/span&gt; &lt;span class="kd"&gt;extends&lt;/span&gt; &lt;span class="nc"&gt;AudioWorkletProcessor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;super&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;port&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;onmessage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pcm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Int16Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pcm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;pcm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;float&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pcm&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;float&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;outputs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_queue&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;toCopy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subarray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;toCopy&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;offset&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;toCopy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;toCopy&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_queue&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subarray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;toCopy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;registerProcessor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playback-processor&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;PlaybackProcessor&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The playback context needs to be initialized at 24kHz to avoid a second resampling step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;playbackContext&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AudioContext&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;sampleRate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;24000&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;playbackContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;audioWorklet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addModule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/playback-worklet.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;playbackNode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;AudioWorkletNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;playbackContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playback-processor&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;playbackNode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;playbackContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;destination&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Getting Images to the Browser During a Live Conversation
&lt;/h2&gt;

&lt;p&gt;The challenge here is that image generation is not part of the live audio stream. It's a separate API call to &lt;code&gt;gemini-2.0-flash-exp-image-generation&lt;/code&gt;. But you don't want to interrupt the conversation to do it.&lt;/p&gt;

&lt;p&gt;Our solution: the orchestrator agent, during its text response, emits a structured signal when it wants an image generated. The server parses this signal from the event stream and fires off an async image generation task without touching the &lt;code&gt;LiveRequestQueue&lt;/code&gt;. The result comes back through the WebSocket as a JSON frame with type &lt;code&gt;generated_image&lt;/code&gt;, and the browser renders it in a side panel.&lt;/p&gt;

&lt;p&gt;This keeps the audio conversation flowing while images appear asynchronously, usually within 6 to 10 seconds of the trigger point.&lt;/p&gt;

&lt;p&gt;The key insight is that the WebSocket is multiplexed. Binary frames are always audio. JSON frames carry everything else: generated images, text overlays, UI state updates. The browser routes them by &lt;code&gt;type&lt;/code&gt; field.&lt;/p&gt;




&lt;h2&gt;
  
  
  What We Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ADK's &lt;code&gt;run_live()&lt;/code&gt; is genuinely powerful but sparsely documented.&lt;/strong&gt; The async iterator pattern is clean once you understand it, but the &lt;code&gt;event.content.parts[]&lt;/code&gt; structure took time to get right. Not all events have content, not all parts have &lt;code&gt;inline_data&lt;/code&gt;, and audio parts use &lt;code&gt;inline_data.data&lt;/code&gt; while the MIME type identifies whether it's audio or something else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Native audio models want to talk, not just respond.&lt;/strong&gt; &lt;code&gt;gemini-2.5-flash-native-audio-preview-12-2025&lt;/code&gt; will proactively generate speech when the session is established and there's context in the system prompt. This is what makes MUSE's greeting feel natural. No special logic required; the model just does it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AudioWorklet has sharp edges.&lt;/strong&gt; The buffer management in the playback worklet needs to be careful about underruns and the queue growing unbounded if generation is faster than playback. We added a simple queue length cap that drops oldest frames when the buffer exceeds about 3 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image generation latency is acceptable but visible.&lt;/strong&gt; At 6 to 10 seconds, users notice the wait. We added a shimmer loading state over the image panel the moment the orchestrator signals intent to generate, which makes the wait feel shorter.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;MUSE in its current form is a proof of concept with a clear path forward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent sessions&lt;/strong&gt;: replace &lt;code&gt;InMemorySessionService&lt;/code&gt; with a database-backed session store so conversations and their generated art persist across reconnects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Style memory&lt;/strong&gt;: let MUSE learn a user's aesthetic preferences over time and carry them across sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export&lt;/strong&gt;: bundle a session's generated pieces into a downloadable gallery&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mobile&lt;/strong&gt;: the AudioWorklet approach works on mobile browsers; a native app would give us better camera control for the environment walking mode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared sessions&lt;/strong&gt;: two people, one synesthetic experience, collaborative sensory translation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The broader idea here, using AI to build cross-modal translation as a first-class experience rather than a feature, feels like it has legs well beyond this project. There's something genuinely interesting about an AI that doesn't just process your senses but translates between them.&lt;/p&gt;




&lt;p&gt;MUSE was built for the Gemini Live Agent Challenge. The full source is available on GitHub. Built with &lt;code&gt;google-adk&lt;/code&gt; 1.26.0, Gemini 2.5 Flash Native Audio, FastAPI, and more lines of AudioWorklet debugging than we'd like to admit.&lt;/p&gt;




&lt;p&gt;This post was created as part of my entry to the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge&lt;/p&gt;

</description>
      <category>gemini</category>
      <category>ai</category>
      <category>python</category>
      <category>react</category>
    </item>
  </channel>
</rss>
