How I Built a Voice-Controlled AI Agent in 2 Days Using Groq, LangChain, and React

#webdev #programming #ai

For my Mem0 AI/ML Developer Intern assignment, I had to build a local AI agent that accepts voice input, detects intent, and executes real actions on the machine — generating code, creating files, and summarizing text. Here's how I built it, what I chose and why, and the bugs that almost broke me.

What the Agent Does

You speak or type a command. The agent:

Transcribes your audio using Whisper
Classifies your intent using LLaMA 3
Routes to the right tool (code gen, file ops, summarizer, or chat)
Asks for your confirmation before writing anything to disk
Saves the result to a sandboxed output/ folder

The whole thing runs with one Groq API key — no GPU required.

Architecture

Architecture from connecting frontend to backend

Local AI Agent working Architecture

Backend: FastAPI (Python) handles all endpoints — /chat, /transcribe, /execute, /memory.
Frontend: React with plain CSS. No UI library. The design mirrors the Blue ChatGPT mockup from the assignment.

Both terminals running side by side

Why I Used Groq Instead of Running Models Locally

The assignment preferred Ollama for local LLM inference and local Whisper for STT. I tried both. Here's what happened:
My machine has 8 GB RAM and no dedicated GPU. Running Whisper base plus llama3.2 via Ollama simultaneously pushed memory usage over 11 GB — the system killed the process before a single transcription completed. CPU-only Whisper inference took 45–90 seconds per clip, which made iterating on the pipeline impossible.
Groq solved both problems. They host whisper-large-v3 (exactly the model the assignment recommends) and Meta's LLaMA 3 family (same weights Ollama would serve) on custom LPU hardware. The same transcription that took 60 seconds on my CPU takes under 2 seconds on Groq. It's free, fast, and requires one API key.
The models are identical — the only difference is where inference runs.

The Models I Chose and Why

Whisper Large v3 for STT — most accurate Whisper variant, handles accents and noisy audio well. Groq hosts it natively.
llama-3.1-8b-instant for intent classification — fast and cheap. Temperature 0 for deterministic output. The intent prompt is carefully structured with explicit rules and examples to avoid misclassification.
llama-3.3-70b-versatile for code generation — stronger model for better code quality. Temperature 0.3 for slightly creative but consistent output.
llama-3.1-8b-instant for general chat — fast enough for conversational response, temperature 0.7 for natural replies.

Intent Detection — The Tricky Part

Getting intent classification right was harder than I expected. The first version used a simple keyword classifier. It failed on anything slightly rephrased.
I switched to LLM-based classification with a structured prompt:

system = """
You are an intent classifier. Reply with ONLY one label:
write_code | create_file | summarize_and_save | summarize | general_chat

CRITICAL RULES:
1. If message has BOTH 'summarize' AND 'save' → summarize_and_save
2. If message has 'summarize' with NO save instruction → summarize
3. Empty file/folder request → create_file
4. Code request → write_code
5. Everything else → general_chat
"""

The ordering matters — longest/most-specific labels are checked first to avoid partial matches.

Chat UI showing the yellow write_code badge

The Bug That Cost Me an Hour

The worst bug: "Summarize this and save it as pythonnotes.txt" was producing an empty file.
The compound command splitter was splitting on "and", turning one command into two:

"Summarize this" → summarize intent → displayed in chat, no file
"save it as pythonnotes.txt" → classified as create_file → empty file created

Fix: detect summarize + save pattern before splitting:

def parse_compound(message: str) -> list[str]:
    has_summarize = bool(re.search(r'\bsummarize\b', message, re.I))
    has_save = bool(re.search(r'\b(save|store)\b', message, re.I))
    if has_summarize and has_save:
        return [message]  # keep as ONE command
    # otherwise split on "and" / "then"
    return re.split(r"\band\b|\bthen\b", message, flags=re.I)

The second bug on the same day: "save it as techsummary" wasn't extracting the filename because my regex didn't account for the word "it" between "save" and "as". One word broke the whole filename extractor. Fixed by making "it" optional: save\s+(?:it\s+)?(?:to|as|in).

The output/ folder open in File Explorer showing real files

Bonus Features I Added

Human-in-the-Loop — A confirmation bar slides up before any file write. The user sees the exact path, intent type, and two buttons. Input is disabled until they confirm or cancel. Nothing touches the filesystem without explicit approval.
Persistent Memory — Each browser tab gets a unique session ID from sessionStorage. Memory is stored as JSON, survives server restarts, capped at 20 exchanges per session. Passed to the LLM as structured HumanMessage/AIMessage objects for proper multi-turn context.
Compound Commands — "Write a retry function and create a file called config.json" executes both intents in one message.
Action Logging — Every file operation is appended to output/actions.log with a timestamp and file path.

The confirmation bar visible in the UI

What I'd Do Differently

Streaming responses — right now the UI waits for the full LLM response. Using LangGraph's stream() would make code generation feel instant.
Code execution sandbox — the natural next step is running the generated code in a Docker container and showing the output.