<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aaron Melton</title>
    <description>The latest articles on DEV Community by Aaron Melton (@aaron_melton_0601b97a4b57).</description>
    <link>https://dev.to/aaron_melton_0601b97a4b57</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1808610%2Fb2aa4f3d-f557-4353-9f15-94f4d786ff75.jpg</url>
      <title>DEV Community: Aaron Melton</title>
      <link>https://dev.to/aaron_melton_0601b97a4b57</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aaron_melton_0601b97a4b57"/>
    <language>en</language>
    <item>
      <title>What Building Voxitale for the Gemini Live Contest Taught Me About Working With Multiple AI Tools</title>
      <dc:creator>Aaron Melton</dc:creator>
      <pubDate>Fri, 13 Mar 2026 04:41:48 +0000</pubDate>
      <link>https://dev.to/aaron_melton_0601b97a4b57/what-building-voxitale-for-the-gemini-live-contest-taught-me-about-working-with-multiple-ai-tools-3bjb</link>
      <guid>https://dev.to/aaron_melton_0601b97a4b57/what-building-voxitale-for-the-gemini-live-contest-taught-me-about-working-with-multiple-ai-tools-3bjb</guid>
      <description>&lt;p&gt;For the Gemini Live contest I built &lt;strong&gt;Voxitale&lt;/strong&gt;, a voice-first storytelling app for young children.&lt;/p&gt;

&lt;p&gt;A child talks to a character named Amelia directly in the browser. They guide the adventure out loud. Illustrated scenes appear as the story unfolds. At the end the system produces a short storybook-style movie based on what happened in the session.&lt;/p&gt;

&lt;p&gt;The strange part?&lt;/p&gt;

&lt;p&gt;My favorite moment during the entire project was fixing the WiFi on my Raspberry Pi.&lt;/p&gt;

&lt;p&gt;Let me explain.&lt;/p&gt;

&lt;p&gt;First, I hate consultant talk. I cannot stand polished language that sounds impressive but says nothing. So I am not going to pretend this project was some elegant engineering journey. It was messy. It was fast. I used a pile of AI tools. Some parts were genuinely exciting. Some parts felt like moving logs between terminals for hours.&lt;/p&gt;

&lt;p&gt;Somewhere in the middle of all that I actually learned something useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Voxitale Looks Like
&lt;/h2&gt;

&lt;p&gt;Before getting into the engineering, here is what the experience actually looks like.&lt;/p&gt;

&lt;p&gt;A child speaks to Amelia and guides the story with their voice. The system generates illustrated scenes and narration in real time.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpo55u45qcblz9zphjkx.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpo55u45qcblz9zphjkx.jpeg" alt=" " width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As the story progresses, pages are generated and eventually assembled into a storybook-style experience.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc3lhgx21sj9ruhabt6oo.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc3lhgx21sj9ruhabt6oo.jpeg" alt=" " width="800" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Parents can control things like story mood, pacing, narrator voice, and optional smart lighting effects.&lt;/p&gt;

&lt;p&gt;The goal is to make storytelling feel &lt;strong&gt;interactive instead of passive&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I Entered the Gemini Live Contest
&lt;/h2&gt;

&lt;p&gt;I entered the Gemini Live contest because I wanted an excuse to build something around &lt;strong&gt;live interaction&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before Voxitale I had already experimented with Gemini Live on a customer service prototype. It could perform RAG lookups, help users navigate a website, and even control a video player.&lt;/p&gt;

&lt;p&gt;It worked.&lt;/p&gt;

&lt;p&gt;But it did not excite me.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;interactive storytelling category&lt;/strong&gt; did.&lt;/p&gt;

&lt;p&gt;A live storyteller has to feel present. It has to respond quickly. It has to handle interruptions. It has to keep the illusion alive.&lt;/p&gt;

&lt;p&gt;Around the same time a contract fell through, which meant I suddenly had the one thing most side projects never get from me: uninterrupted time.&lt;/p&gt;

&lt;p&gt;I had touched Gemini Live before, so I thought this would be manageable.&lt;/p&gt;

&lt;p&gt;I was wrong.&lt;/p&gt;

&lt;p&gt;Real-time storytelling is much harder than it looks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;Voxitale ended up becoming a system with &lt;strong&gt;two very different tempos running at once&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The first tempo was the &lt;strong&gt;live conversation loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A child speaks in the browser. The app captures microphone audio and streams it over WebSocket to a FastAPI backend. That backend runs a Google ADK live agent using Gemini native audio so Amelia can respond in real time.&lt;/p&gt;

&lt;p&gt;The goal was to make it feel like talking to a character rather than interacting with a chatbot waiting for turns.&lt;/p&gt;

&lt;p&gt;The second tempo was a &lt;strong&gt;creative generation pipeline&lt;/strong&gt; running alongside that live voice interaction.&lt;/p&gt;

&lt;p&gt;As the story evolves the system generates illustrated scenes and captions describing what just happened. At the end of the session those pieces are assembled into a short storybook movie.&lt;/p&gt;

&lt;p&gt;Optional integrations like ElevenLabs narration and Home Assistant lighting effects can add immersion to the experience.&lt;/p&gt;

&lt;p&gt;This meant the system had to support two very different workloads at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;low latency voice interaction&lt;/li&gt;
&lt;li&gt;slower media generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The easiest way to understand the system is to look at the architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6p8uxzr7imv32xqa2317.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6p8uxzr7imv32xqa2317.png" alt=" " width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Client Layer
&lt;/h2&gt;

&lt;p&gt;The browser runs a React / Next.js interface that captures microphone audio using audio worklets and streams it to the backend over WebSockets. This allows the child to speak naturally and interrupt the story when they want.&lt;/p&gt;

&lt;h2&gt;
  
  
  Application Layer
&lt;/h2&gt;

&lt;p&gt;The backend runs on Google Cloud Run using FastAPI. This service manages WebSocket connections, API routing, and orchestration of the storytelling session.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent and Model Layer
&lt;/h2&gt;

&lt;p&gt;The agent runs through Google ADK using Gemini Live and Vertex models.&lt;/p&gt;

&lt;p&gt;This layer handles storytelling logic, prompt rules, and tool execution. It generates prompts for scenes, triggers image generation, and coordinates integrations like ElevenLabs audio and Home Assistant lighting.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data and Media Layer
&lt;/h2&gt;

&lt;p&gt;Generated scenes and assets are stored in Google Cloud Storage while session metadata and feedback are stored in Firestore.&lt;/p&gt;

&lt;p&gt;At the end of the session a Cloud Run job assembles the scenes and narration into a final MP4 storybook video.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Development Workflow
&lt;/h2&gt;

&lt;p&gt;Interestingly, I did not use Gemini Live to code Voxitale.&lt;/p&gt;

&lt;p&gt;Gemini powered the product experience, but my development workflow used multiple AI tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google Anti-Gravity with Gemini Pro / Flash&lt;/li&gt;
&lt;li&gt;OpenAI Codex with GPT-5.4&lt;/li&gt;
&lt;li&gt;Anthropic Opus and Sonnet early in development&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I basically &lt;strong&gt;vibe-coded&lt;/strong&gt; large parts of the system.&lt;/p&gt;

&lt;p&gt;Gemini helped with frontend UI ideas and brainstorming features.&lt;/p&gt;

&lt;p&gt;OpenAI Codex handled most of the backend work and debugging.&lt;/p&gt;

&lt;p&gt;Once GPT-5.4 released I found it extremely strong for backend reasoning, and by the end roughly &lt;strong&gt;90% of the backend work involved GPT-5.4 in some way&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feeding the AI the Right Context
&lt;/h2&gt;

&lt;p&gt;AI coding tools are only as good as the context you give them.&lt;/p&gt;

&lt;p&gt;WebSockets, Gemini Live, Google ADK, reconnect logic, and streaming pipelines are not areas where models can improvise reliably.&lt;/p&gt;

&lt;p&gt;So I pulled documentation directly into the repository and placed it in a &lt;strong&gt;docs folder&lt;/strong&gt; so the models could reference it.&lt;/p&gt;

&lt;p&gt;Logging also became critical.&lt;/p&gt;

&lt;p&gt;Most debugging followed a simple loop:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;explain the issue&lt;/li&gt;
&lt;li&gt;provide backend logs&lt;/li&gt;
&lt;li&gt;provide frontend logs&lt;/li&gt;
&lt;li&gt;let the model analyze the failure&lt;/li&gt;
&lt;li&gt;test the fix&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;AI made debugging faster.&lt;/p&gt;

&lt;p&gt;But it was still debugging.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Technical Problem
&lt;/h2&gt;

&lt;p&gt;The hardest part was making the &lt;strong&gt;live system feel stable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When people hear “interactive storyteller” they imagine the fun parts:&lt;/p&gt;

&lt;p&gt;character voices&lt;/p&gt;

&lt;p&gt;illustrations&lt;/p&gt;

&lt;p&gt;kids guiding the plot&lt;/p&gt;

&lt;p&gt;But the real work was everything underneath.&lt;/p&gt;

&lt;p&gt;From an architecture perspective there were really two systems running together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a real-time conversational system&lt;/li&gt;
&lt;li&gt;a creative media generation pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project only worked when those two systems stayed synchronized.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Raspberry Pi Moment
&lt;/h2&gt;

&lt;p&gt;I had an old Raspberry Pi that I needed to revive for the Home Assistant part of the project.&lt;/p&gt;

&lt;p&gt;After upgrading it the WiFi stopped working.&lt;/p&gt;

&lt;p&gt;I spent about four hours debugging it.&lt;/p&gt;

&lt;p&gt;Eventually I realized the issue came from running a &lt;strong&gt;32-bit OS instead of the 64-bit version&lt;/strong&gt; needed for Home Assistant and Weave.&lt;/p&gt;

&lt;p&gt;Ironically that debugging session was the most enjoyable engineering moment of the entire project.&lt;/p&gt;

&lt;p&gt;Not because it was glamorous.&lt;/p&gt;

&lt;p&gt;Because it felt like I actually owned the solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Contest Taught Me
&lt;/h2&gt;

&lt;p&gt;AI-assisted development is incredibly powerful.&lt;/p&gt;

&lt;p&gt;It compresses time and expands what one developer can build.&lt;/p&gt;

&lt;p&gt;But output and ownership are not the same thing.&lt;/p&gt;

&lt;p&gt;AI can help produce a working system quickly. Voxitale exists because of that.&lt;/p&gt;

&lt;p&gt;But the parts that felt most rewarding were still the parts where I understood the system deeply enough to reason through it myself.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try Voxitale
&lt;/h2&gt;

&lt;p&gt;Voxitale is currently running as a limited prototype for the Gemini Live contest.&lt;/p&gt;

&lt;p&gt;Because the system relies on live voice AI and media generation services that incur real compute costs, I cannot open the demo to unlimited public traffic right now.&lt;/p&gt;

&lt;p&gt;If you would like to try Voxitale, you can request access here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://forms.gle/f9BMGs38EDy3FxaK7" rel="noopener noreferrer"&gt;https://forms.gle/f9BMGs38EDy3FxaK7&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you are interested in the technical architecture or code behind the project, the contest prototype is available here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Smone5/back_to_someping" rel="noopener noreferrer"&gt;https://github.com/Smone5/back_to_someping&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Building Voxitale reminded me that modern development is less about writing every line of code and more about coordinating systems.&lt;/p&gt;

&lt;p&gt;Models&lt;/p&gt;

&lt;p&gt;Frameworks&lt;/p&gt;

&lt;p&gt;Infrastructure&lt;/p&gt;

&lt;p&gt;Pipelines&lt;/p&gt;

&lt;p&gt;Timing&lt;/p&gt;

&lt;p&gt;But it also reminded me of something simple.&lt;/p&gt;

&lt;p&gt;The part I still enjoy most is the part where I understand what is happening.&lt;/p&gt;

&lt;p&gt;And sometimes that moment comes from fixing WiFi on a Raspberry Pi.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>vibecoding</category>
      <category>vertexai</category>
      <category>geminiliveagentchallenge</category>
    </item>
  </channel>
</rss>
