TL;DR
I built Brew — a real-time, voice-first AI ordering system for coffee shop drive-thrus. Customers talk to an AI barista through their microphone, and it takes their order through natural conversation. No buttons, no typing, just speech. The AI listens, understands complex orders with modifiers, handles interruptions, and updates a live on-screen menu and receipt as the conversation flows.
GitHub: github.com/thilak15/Brew
The Problem I Wanted to Solve
Traditional drive-thru ordering is broken. Long wait times, order inaccuracies, and staffing challenges plague the industry. Human operators handle one car at a time, miscommunication leads to wrong orders, and during peak hours, lines stretch around the block.
I wanted to see if a live voice AI agent could do this better — not a chatbot with text-to-speech bolted on, but a genuinely conversational agent that handles the full complexity of real ordering: sizes, modifiers, corrections, interruptions, and multi-item requests.
What Brew Does
Brew replaces the human operator at a drive-thru speaker box with an AI barista. Here's what it handles:
- Natural speech understanding — "Can I get a grande iced latte with oat milk and an extra shot?" works exactly as you'd expect
- Interruptions (barge-in) — Change your mind mid-sentence. The AI stops speaking and listens
- Real-time UI updates — The menu highlights relevant categories and the receipt builds live as items are confirmed
- Complex order management — Modifiers (syrups, milk swaps, toppings, ice levels, warming), undo, batch operations, and running totals
- Multilingual support — Speak in Spanish, Hindi, or any language Gemini understands, and the agent mirrors your language automatically
- Session persistence — Cart state survives Cloud Run instance restarts via Firestore
The menu has 22 items across 3 categories (Drinks, Breakfast, Desserts) with a full modifier system.
The Tech Stack — All Google AI and Cloud
This project is built end-to-end on Google's AI and Cloud platform. Here's every piece:
| Layer | Technology | Purpose |
|---|---|---|
| AI Model | Gemini 2.5 Flash Native Audio | Real-time voice conversation with function calling |
| Agent Framework | Google Agent Development Kit (ADK) | Agent orchestration, tool management, live streaming |
| Backend | Python 3.11, FastAPI | WebSocket server, session management |
| Frontend | Next.js 14, React 18, TypeScript | Dynamic UI with real-time state updates |
| Audio | Web Audio API (AudioWorklet) | Low-latency audio capture and playback |
| Transport | WebSockets | Bidirectional PCM audio + JSON state streaming |
| Session Persistence | Google Cloud Firestore | Cart state across instances |
| Deployment | Google Cloud Run | Serverless containers for backend and frontend |
| Container Registry | Google Artifact Registry | Docker image storage |
| CI/CD | GitHub Actions + Workload Identity Federation | Keyless automated deployment to GCP |
How I Built It — Architecture Deep Dive
The system has four layers:
1. Browser (Customer Device)
The Next.js frontend captures microphone audio via the Web Audio API using an AudioWorklet processor. Raw PCM audio at 16kHz streams to the backend over a WebSocket. The frontend receives two things back: audio response bytes (played through another AudioWorklet) and JSON state updates that drive the UI.
Three main components power the interface:
- SmartMenu — A dynamic tabbed menu that auto-switches categories (ordering a "Cake Pop" flips the view to Desserts)
- LiveReceipt — A real-time order panel showing items, modifiers, and a running price total
- AudioVisualizer — Visual feedback during the conversation
2. Backend Server (Cloud Run)
A Python/FastAPI WebSocket server running on Cloud Run. It manages the bidirectional audio stream between the browser and Gemini. The key responsibilities:
- Hosts the ADK
Runnerthat orchestrates the agent lifecycle - Implements a tool gate mechanism — blocks user audio while the AI executes tool calls, preventing race conditions where the model hears its own confirmations
- Handles upstream (browser → Gemini) and downstream (Gemini → browser) as concurrent async tasks
- Proactive session reconnection before the 10-minute Live API hard limit
3. Agent Layer (Google ADK)
The agent is defined using Google's Agent Development Kit with 14 tools for order management:
root_agent = Agent(
name="brew_agent",
model="gemini-2.5-flash-native-audio-preview-12-2025",
description="Drive-thru barista that takes beverage orders with modifiers.",
instruction=get_system_prompt(),
tools=[
add_item, add_items,
remove_item, remove_items,
add_modifier, add_modifiers,
remove_modifier, set_modifier,
set_ice_level, undo_last_change,
clear_order, set_menu_view,
get_order_summary,
],
)
ADK's run_live() method establishes a persistent bidirectional stream with the Gemini Live API. Tools are plain Python functions with detailed docstrings that the model uses for function calling.
4. AI Model (Gemini Live API)
The gemini-2.5-flash-native-audio-preview-12-2025 model handles everything in a single streaming session: receives raw audio, processes speech, decides when to call tools, and generates spoken responses. The system prompt injects the full menu (items, prices, sizes, modifiers) so the model is grounded in real data.
Data Flow
Customer speaks into mic
→ Browser captures PCM audio via AudioWorklet
→ WebSocket sends binary audio frames to backend
→ Backend forwards audio to Gemini via ADK run_live()
→ Gemini processes speech, decides to call tools or respond
→ If tool call: ADK executes tool → updates OrderState → syncs to Firestore
→ Gemini generates audio response
→ Backend streams audio bytes back over WebSocket
→ Browser plays audio via AudioWorklet
→ Backend sends JSON order state updates
→ Frontend re-renders SmartMenu + LiveReceipt in real time
Google Cloud Services in Detail
Gemini Live API (via Google GenAI SDK)
This is the core of Brew. The Live API provides native audio streaming — the model receives raw audio and produces audio responses directly, without separate speech-to-text or text-to-speech steps. Combined with function calling, this means the model can hear "add oat milk to both drinks," call the add_modifiers batch tool, and speak a confirmation — all in one streaming session with sub-second latency.
Google Agent Development Kit (ADK)
ADK handles the agent lifecycle. The run_live() method manages the persistent WebSocket connection to Gemini, routes tool calls to my Python functions, and handles the back-and-forth of a multi-turn conversation. I defined 14 tools with detailed docstrings, and ADK + Gemini handle the rest.
Cloud Run
Both the backend (FastAPI) and frontend (Next.js) are deployed as separate Cloud Run services. Session affinity is critical for WebSocket connections — without it, requests hit different instances that don't have the session state.
Firestore
Cart state is persisted to Firestore after every order change. This means if Cloud Run scales horizontally or an instance restarts, the customer's order survives. I built a custom FirestoreSessionService that wraps ADK's InMemorySessionService with Firestore persistence.
Artifact Registry + Workload Identity Federation
Docker images are stored in Artifact Registry. CI/CD uses Workload Identity Federation for keyless authentication from GitHub Actions to GCP — no service account keys stored anywhere.
Hard Problems I Solved
1. The Tool Gate Problem
Without intervention, the model would hear its own tool-call confirmations as user input, creating infinite loops. I implemented a tool gate that blocks user audio forwarding while the AI is executing tools. This was the single most impactful fix for reliability.
2. The 10-Minute Session Limit
The Gemini Live API has a hard 10-minute session limit. Brew proactively reconnects at 8 minutes, injecting the current order context into the new session so the AI seamlessly continues the conversation without re-greeting the customer.
3. Model Hallucinating Tool Arguments
Native audio models sometimes hallucinate tool arguments — inventing item IDs that don't exist. I switched from UUIDs to sequential integer IDs (item_1, item_2, ...) which dramatically reduced hallucination. I also added _resolve_item_id() that handles numeric shorthand and positional references.
4. Batch Operations for Latency
Without batch tools, the model makes sequential tool calls with separate confirmations for each item in a multi-item order. I added add_items, remove_items, and add_modifiers batch tools that handle everything in a single call, cutting latency significantly.
5. Idempotency Guards
The model sometimes retries tool calls during transient errors. Without idempotency guards, this would add duplicate modifiers. Every add_modifier call checks for existing identical modifiers before applying.
Key Learnings
Tool docstrings are the primary interface. Clear, specific docstrings with examples produce dramatically better tool-calling accuracy than vague descriptions. I iterated on these more than any other part of the codebase.
AudioWorklet is non-negotiable. The deprecated
ScriptProcessorNodeintroduces unpredictable latency. AudioWorklet provides consistent low-latency audio processing.Session affinity on Cloud Run is essential for WebSocket connections. Without it, subsequent requests hit different instances.
Native audio models behave differently than text models. They're more prone to hallucinating tool arguments, more sensitive to background noise, and need explicit instructions about when NOT to respond (e.g., to background noise or their own echoes).
Firestore for session persistence is a perfect fit for serverless deployments. The read/write latency is low enough that it doesn't impact the real-time experience.
Running It Yourself
Brew is fully open source. You can run it locally with Docker:
git clone https://github.com/thilak15/Brew.git
cd Brew
cp backend/.env.example backend/.env
# Add your GOOGLE_API_KEY to backend/.env
docker compose up -d --build
Then open http://localhost:3000 in Chrome, click "Drive Up," allow mic access, and start ordering.
GitHub: github.com/thilak15/Brew
What's Next
Menu-agnostic deployment. The menu loads from a JSON file. Swap it out, and Brew becomes a taco shop, a pizza place, or a pharmacy pickup counter. The next step is a pipeline that takes any restaurant's menu and auto-generates a ready-to-deploy voice ordering agent.
Multilingual real-time language switching. Gemini's native audio model already understands multiple languages. The goal is automatic language detection mid-conversation — if a customer starts in English and switches to Spanish, the agent follows without any button press.
The hard part was proving that a live voice agent can handle complex, modifier-heavy ordering with interruptions, corrections, and batch operations — correctly and reliably. That's done. Now it's about making it work for anyone, in any language.
This project was created for the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge
Built by Thilak Daggula
Top comments (3)
Curious where this goes in the next 6 months. The pace is wild.
Nice perspective. Will keep this in mind.
Helpful. Bookmarking this.