I Stopped Typing Text Prompts and Started Talking and Sketching to Code Editor
Right now, the hottest way to build an app with AI is... typing.
You open a chat window. Write a five-paragraph essay describing what you want. Hit enter. Pray.
We went from writing code to writing about code.
So I built Monet, a real-time canvas where you talk and draw, and it builds actual, working React apps live in front of you. No typing. No prompts. Just your voice and a sketch.
This post was created as part of my participation in the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge
The "Wait, What?" Demo
Before I explain anything, let me just show you.
My nephew's birthday was coming up. He likes dinosaurs. Instead of buying him a book, I opened Monet and said:
"Hey Monet, let's build an interactive storybook about a little dinosaur named Boba who goes on an adventure to find a golden egg."
Then I drew a rough dinosaur on the canvas.
Monet generated a polished cartoon character from my terrible sketch, built a multi-page storybook UI in React, and I clicked through the pages, all while having a voice conversation about what to change next.
One minute. Done.
Then my friend said he was bored at class, so I made him a space shooter game.
Drew a spaceship. Said "make the aliens zigzag." Interrupted Monet mid-sentence to add explosions.
Because explosions are essential.
Playable. In the browser. Built from a conversation and some doodles.
Why Voice + Sketch?
Text prompts are lossy.
Try describing a layout in words:
"Put the image on the left, the text on the right, with a card below that spans the full width, but not on mobile where it should stack."
By the time you've typed that, you could've just... drawn it. In two seconds. With a squiggly line.
And voice? Voice is how humans naturally explain things. You don't type instructions to a colleague at a whiteboard. You talk and point.
Monet combines both. You speak your ideas while sketching on a canvas, and the AI sees everything at once: your voice, your drawings, your reference images.
Multimodal context turns out to be way more powerful than any single input alone.
The Architecture
Agent 1: The Orchestrator
The brain. Runs on Gemini Live 2.5 Flash with native audio, using BIDI streaming for real-time voice I/O.
It listens to you speak, sees your canvas, processes uploaded images, and decides what to do. It follows a plan-then-approve workflow: it always tells you what it's about to build and waits for your "go ahead."
Agent 2: The Code Builder
Powered by Gemini 3 Flash. Generates and edits React + TypeScript + Tailwind files using ADK tool calls: write_file, edit_file, read_file, list_files, delete_file.
Has a fast mode using Gemini 3.1 Flash Lite for simpler changes. Speed matters when someone is literally watching.
Agent 3: The Image Artist
Draw rough. Get polished. The Image Agent (running Gemini 3.1 Flash Image) treats your sketch as a compositional guide, not a literal blueprint. Your terrible stick figure becomes a beautiful illustration.
Streaming Tools: The Secret Sauce
Nobody warns you about this when building voice agents:
Tool calls block the conversation.
Agent calls a tool. Voice goes silent. User stares at a spinner. 10-20 seconds. In a voice-first experience, that silence is death.
Our fix: streaming tools. The orchestrator delegates to sub-agents, but the conversation keeps going. Monet narrates what it's doing, acknowledges your input, takes new instructions, all while code generates in the background.
This is the difference between a collaborator and a chatbot.
![IMAGE: Diagram showing the streaming tool flow. Orchestrator speaks to user while Code Agent generates in parallel. Caption: "The conversation never stops."]
The Canvas is the Interface
We use tldraw for the freehand canvas. Blue pen annotations become spatial context for the AI.
This unlocks interactions that are impossible with text:
- Circle an element and say "make this bigger"
- Draw a rough 3-column layout and say "put cards here"
- Sketch a rough scene and say "generate this as an image"
- Drop in a screenshot and say "make it look like this"
Vague gestures become precise spatial context.
Things That Broke
Voice UX is unforgiving. Text prompts let you edit and rephrase. Voice is real-time. "Make it, uh, like... bigger? The header part" needs to just work. The plan-then-approve workflow saved us.
Silence during generation. Streaming tools fixed it, but required rethinking the entire tool execution model. The default "call tool, wait, resume" pattern doesn't work for voice.
Multimodal state coordination. Voice, canvas, and images arrive simultaneously. Getting fresh canvas state to the code agent at execution time required a custom live state registry outside ADK's default state management.
What I Learned
Sketching is underrated. The moment you circle something and say "change this," you realize text was never the right interface.
Multimodal > sum of parts. Voice alone is vague. Canvas alone is silent. Together, the AI gets far richer understanding than either alone.
ADK handles the plumbing. Runner and LiveRequestQueue managed concurrent tools, session state, and streaming. I focused on the product.
Streaming is non-negotiable for voice. BIDI streaming + binary WebSocket frames keep latency low enough for natural conversation. Any buffering breaks the experience.
If you're building voice-first AI, I hope this gives you useful patterns. The biggest lesson?
Stop thinking in text boxes.
Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge





Top comments (0)