Building a Voice AI Agent with Groq Whisper and Gemini in 4 Hours

#agents #ai #gemini #tutorial

The Problem

The Mem0 internship assignment was simple on the surface: build a voice-controlled local AI agent. But "voice-controlled" is where things get interesting. Mem0's core thesis is persistent memory for AI agents — agents that remember context across sessions. For that to work, the interaction layer has to be fast, natural, and frictionless. Voice is the most natural interface humans have. The challenge isn't just transcription — it's building a pipeline that classifies intent, routes to the right tool, and executes actions fast enough that the user doesn't feel the latency. If the pipeline takes 10 seconds, voice becomes painful. If it takes under 2, it feels like magic. That's what I set out to build.

Architecture Decision: Why Not Local Whisper
The first decision was where to run speech-to-text. Whisper Large V3 is the gold standard for accuracy — but running it locally on standard hardware is a trap as the VRAM Required in groq was 0.
Groq LPU delivers the same model at 50–100x faster inference with zero local compute. For a voice-first product, latency is the product.

For intent classification, I chose Gemini 2.5 Flash — ~400ms for structured JSON output, excellent cost-to-performance ratio, and reliable JSON when configured with response_mime_type="application/json". For code generation specifically, I escalated to Gemini 2.5 Pro. Code needs deeper reasoning, edge case handling, and idiomatic quality that Flash's speed-optimized architecture trades away. Crucially, Pro is only invoked for write_code intent — the human-in-the-loop confirmation step makes its higher latency invisible to the user.

The Hardest Part: Gradio State for Human-in-the-Loop

This is the part that actually took time to solve correctly.
Destructive operations — writing files, generating code — need a confirmation step. The user should see what the AI classified and approve before anything hits disk. In a traditional web server, this is trivial. In Gradio, it's genuinely awkward.
Gradio's event model is fire-and-return. process_audio() runs, returns output, and Gradio is done. There's no native "pause and wait for user input" primitive. But I needed the pipeline to:

Transcribe audio
Classify intent
Show the user what was classified
Wait for a confirm button click
Then execute

The solution:

gr.State.
I used a state component to persist the classified intent_data object between two separate event handlers. process_audio() stores the intent in state and shows a confirmation panel with gr.update(visible=True). When the user clicks "Confirm", confirm_execution() reads from state and executes.

# process_audio stores intent, shows confirm button
def process_audio(audio, confirm_enabled, state):
    intent_data = classify_intent(transcription)
    if confirm_enabled:
        return display_intent(intent_data), gr.update(visible=True), intent_data
    return execute(intent_data), gr.update(visible=False), None

# confirm_execution reads from state
def confirm_execution(state):
    return execute(state)

The confirmation panel visibility is toggled via gr.update(visible=...). It's an inelegant hack around Gradio's event model — but it works cleanly, and the UX feels like a real human-in-the-loop checkpoint, not a bolted-on popup.

What I'd Do Differently

The obvious next integration is Mem0's memory SDK. Right now, the chat intent maintains 5 turns of in-session history — but the moment the user closes the tab, context is gone. With Mem0, I'd store conversation history and user preferences across sessions: the agent would remember that this user prefers TypeScript over Python, that their project sandbox is at ~/projects/agent, that they always want docstrings in Google format. That's the difference between a tool and an assistant. I'd also add a benchmarking layer — systematic evaluation across Gemini Flash, Pro, GPT-4o, and Claude Sonnet to compare intent classification accuracy and code quality scores, not just latency. Right now model selection is an educated guess. It should be data.

Results

The final pipeline achieves end-to-end latency of ~2.1s for a typical voice command: Groq Whisper transcription at <1s, Gemini Flash intent classification at ~400ms, tool execution at ~700ms. All four bonus features from the assignment spec are implemented: Human-in-the-Loop confirmation for destructive ops, compound command detection (multi-action parsing), session chat memory (5-turn sliding window), and path traversal protection (aggressive sandbox enforcement via safe_path()). The agent handles all four intent categories cleanly — create_file, write_code, summarize, and chat — and degrades gracefully when any pipeline stage fails. Files are sandboxed to output/, audio uploads are blocked above 25MB client-side, and all API calls have a 30s timeout to prevent hanging UI threads.

GitHub: github.com/RohanSinghJaglan/mem_voice_ai_agent
Live Demo: https://www.loom.com/share/c807fb18d2b046268294acb9fa79f2cb