Aaron Melton

Posted on Mar 13

What Building Voxitale for the Gemini Live Contest Taught Me About Working With Multiple AI Tools

#ai #vibecoding #vertexai #geminiliveagentchallenge

For the Gemini Live contest I built Voxitale, a voice-first storytelling app for young children.

A child talks to a character named Amelia directly in the browser. They guide the adventure out loud. Illustrated scenes appear as the story unfolds. At the end the system produces a short storybook-style movie based on what happened in the session.

The strange part?

My favorite moment during the entire project was fixing the WiFi on my Raspberry Pi.

Let me explain.

First, I hate consultant talk. I cannot stand polished language that sounds impressive but says nothing. So I am not going to pretend this project was some elegant engineering journey. It was messy. It was fast. I used a pile of AI tools. Some parts were genuinely exciting. Some parts felt like moving logs between terminals for hours.

Somewhere in the middle of all that I actually learned something useful.

What Voxitale Looks Like

Before getting into the engineering, here is what the experience actually looks like.

A child speaks to Amelia and guides the story with their voice. The system generates illustrated scenes and narration in real time.

As the story progresses, pages are generated and eventually assembled into a storybook-style experience.

Parents can control things like story mood, pacing, narrator voice, and optional smart lighting effects.

The goal is to make storytelling feel interactive instead of passive.

Why I Entered the Gemini Live Contest

I entered the Gemini Live contest because I wanted an excuse to build something around live interaction.

Before Voxitale I had already experimented with Gemini Live on a customer service prototype. It could perform RAG lookups, help users navigate a website, and even control a video player.

It worked.

But it did not excite me.

The interactive storytelling category did.

A live storyteller has to feel present. It has to respond quickly. It has to handle interruptions. It has to keep the illusion alive.

Around the same time a contract fell through, which meant I suddenly had the one thing most side projects never get from me: uninterrupted time.

I had touched Gemini Live before, so I thought this would be manageable.

I was wrong.

Real-time storytelling is much harder than it looks.

What I Built

Voxitale ended up becoming a system with two very different tempos running at once.

The first tempo was the live conversation loop.

A child speaks in the browser. The app captures microphone audio and streams it over WebSocket to a FastAPI backend. That backend runs a Google ADK live agent using Gemini native audio so Amelia can respond in real time.

The goal was to make it feel like talking to a character rather than interacting with a chatbot waiting for turns.

The second tempo was a creative generation pipeline running alongside that live voice interaction.

As the story evolves the system generates illustrated scenes and captions describing what just happened. At the end of the session those pieces are assembled into a short storybook movie.

Optional integrations like ElevenLabs narration and Home Assistant lighting effects can add immersion to the experience.

This meant the system had to support two very different workloads at once:

low latency voice interaction
slower media generation

The easiest way to understand the system is to look at the architecture.

Client Layer

The browser runs a React / Next.js interface that captures microphone audio using audio worklets and streams it to the backend over WebSockets. This allows the child to speak naturally and interrupt the story when they want.

Application Layer

The backend runs on Google Cloud Run using FastAPI. This service manages WebSocket connections, API routing, and orchestration of the storytelling session.

Agent and Model Layer

The agent runs through Google ADK using Gemini Live and Vertex models.

This layer handles storytelling logic, prompt rules, and tool execution. It generates prompts for scenes, triggers image generation, and coordinates integrations like ElevenLabs audio and Home Assistant lighting.

Data and Media Layer

Generated scenes and assets are stored in Google Cloud Storage while session metadata and feedback are stored in Firestore.

At the end of the session a Cloud Run job assembles the scenes and narration into a final MP4 storybook video.

My Development Workflow

Interestingly, I did not use Gemini Live to code Voxitale.

Gemini powered the product experience, but my development workflow used multiple AI tools:

Google Anti-Gravity with Gemini Pro / Flash
OpenAI Codex with GPT-5.4
Anthropic Opus and Sonnet early in development

I basically vibe-coded large parts of the system.

Gemini helped with frontend UI ideas and brainstorming features.

OpenAI Codex handled most of the backend work and debugging.

Once GPT-5.4 released I found it extremely strong for backend reasoning, and by the end roughly 90% of the backend work involved GPT-5.4 in some way.

Feeding the AI the Right Context

AI coding tools are only as good as the context you give them.

WebSockets, Gemini Live, Google ADK, reconnect logic, and streaming pipelines are not areas where models can improvise reliably.

So I pulled documentation directly into the repository and placed it in a docs folder so the models could reference it.

Logging also became critical.

Most debugging followed a simple loop:

explain the issue
provide backend logs
provide frontend logs
let the model analyze the failure
test the fix

AI made debugging faster.

But it was still debugging.

The Hardest Technical Problem

The hardest part was making the live system feel stable.

When people hear “interactive storyteller” they imagine the fun parts:

character voices

illustrations

kids guiding the plot

But the real work was everything underneath.

From an architecture perspective there were really two systems running together:

a real-time conversational system
a creative media generation pipeline

The project only worked when those two systems stayed synchronized.

The Raspberry Pi Moment

I had an old Raspberry Pi that I needed to revive for the Home Assistant part of the project.

After upgrading it the WiFi stopped working.

I spent about four hours debugging it.

Eventually I realized the issue came from running a 32-bit OS instead of the 64-bit version needed for Home Assistant and Weave.

Ironically that debugging session was the most enjoyable engineering moment of the entire project.

Not because it was glamorous.

Because it felt like I actually owned the solution.

What the Contest Taught Me

AI-assisted development is incredibly powerful.

It compresses time and expands what one developer can build.

But output and ownership are not the same thing.

AI can help produce a working system quickly. Voxitale exists because of that.

But the parts that felt most rewarding were still the parts where I understood the system deeply enough to reason through it myself.

Try Voxitale

Voxitale is currently running as a limited prototype for the Gemini Live contest.

Because the system relies on live voice AI and media generation services that incur real compute costs, I cannot open the demo to unlimited public traffic right now.

If you would like to try Voxitale, you can request access here:

https://forms.gle/f9BMGs38EDy3FxaK7

If you are interested in the technical architecture or code behind the project, the contest prototype is available here:

https://github.com/Smone5/back_to_someping

Closing

Building Voxitale reminded me that modern development is less about writing every line of code and more about coordinating systems.

Models

Frameworks

Infrastructure

Pipelines

Timing

But it also reminded me of something simple.

The part I still enjoy most is the part where I understand what is happening.

And sometimes that moment comes from fixing WiFi on a Raspberry Pi.

DEV Community