DEV Community

Kishan S
Kishan S

Posted on

How I Built a Voice-Controlled AI Agent in 2 Days Using Groq, LangChain, and React

For my Mem0 AI/ML Developer Intern assignment, I had to build a local AI agent that accepts voice input, detects intent, and executes real actions on the machine — generating code, creating files, and summarizing text. Here's how I built it, what I chose and why, and the bugs that almost broke me.

What the Agent Does

You speak or type a command. The agent:

  1. Transcribes your audio using Whisper
  2. Classifies your intent using LLaMA 3
  3. Routes to the right tool (code gen, file ops, summarizer, or chat)
  4. Asks for your confirmation before writing anything to disk
  5. Saves the result to a sandboxed output/ folder

The whole thing runs with one Groq API key — no GPU required.


Architecture

Architecture from connecting frontend to backend

Local AI Agent working Architecture
Architecture

Backend: FastAPI (Python) handles all endpoints — /chat, /transcribe, /execute, /memory.
Frontend: React with plain CSS. No UI library. The design mirrors the Blue ChatGPT mockup from the assignment.

Both terminals running side by side
Frontend and Backend Running terminal


Why I Used Groq Instead of Running Models Locally

The assignment preferred Ollama for local LLM inference and local Whisper for STT. I tried both. Here's what happened:
My machine has 8 GB RAM and no dedicated GPU. Running Whisper base plus llama3.2 via Ollama simultaneously pushed memory usage over 11 GB — the system killed the process before a single transcription completed. CPU-only Whisper inference took 45–90 seconds per clip, which made iterating on the pipeline impossible.
Groq solved both problems. They host whisper-large-v3 (exactly the model the assignment recommends) and Meta's LLaMA 3 family (same weights Ollama would serve) on custom LPU hardware. The same transcription that took 60 seconds on my CPU takes under 2 seconds on Groq. It's free, fast, and requires one API key.
The models are identical — the only difference is where inference runs.


The Models I Chose and Why

Whisper Large v3 for STT — most accurate Whisper variant, handles accents and noisy audio well. Groq hosts it natively.
llama-3.1-8b-instant for intent classification — fast and cheap. Temperature 0 for deterministic output. The intent prompt is carefully structured with explicit rules and examples to avoid misclassification.
llama-3.3-70b-versatile for code generation — stronger model for better code quality. Temperature 0.3 for slightly creative but consistent output.
llama-3.1-8b-instant for general chat — fast enough for conversational response, temperature 0.7 for natural replies.


Intent Detection — The Tricky Part

Getting intent classification right was harder than I expected. The first version used a simple keyword classifier. It failed on anything slightly rephrased.
I switched to LLM-based classification with a structured prompt:

system = """
You are an intent classifier. Reply with ONLY one label:
write_code | create_file | summarize_and_save | summarize | general_chat

CRITICAL RULES:
1. If message has BOTH 'summarize' AND 'save' → summarize_and_save
2. If message has 'summarize' with NO save instruction → summarize
3. Empty file/folder request → create_file
4. Code request → write_code
5. Everything else → general_chat
"""
Enter fullscreen mode Exit fullscreen mode

The ordering matters — longest/most-specific labels are checked first to avoid partial matches.

Chat UI showing the yellow write_code badge


The Bug That Cost Me an Hour

The worst bug: "Summarize this and save it as pythonnotes.txt" was producing an empty file.
The compound command splitter was splitting on "and", turning one command into two:

  • "Summarize this" → summarize intent → displayed in chat, no file
  • "save it as pythonnotes.txt" → classified as create_file → empty file created

Fix: detect summarize + save pattern before splitting:

def parse_compound(message: str) -> list[str]:
    has_summarize = bool(re.search(r'\bsummarize\b', message, re.I))
    has_save = bool(re.search(r'\b(save|store)\b', message, re.I))
    if has_summarize and has_save:
        return [message]  # keep as ONE command
    # otherwise split on "and" / "then"
    return re.split(r"\band\b|\bthen\b", message, flags=re.I)
Enter fullscreen mode Exit fullscreen mode

The second bug on the same day: "save it as techsummary" wasn't extracting the filename because my regex didn't account for the word "it" between "save" and "as". One word broke the whole filename extractor. Fixed by making "it" optional: save\s+(?:it\s+)?(?:to|as|in).

The output/ folder open in File Explorer showing real files


Bonus Features I Added

Human-in-the-Loop — A confirmation bar slides up before any file write. The user sees the exact path, intent type, and two buttons. Input is disabled until they confirm or cancel. Nothing touches the filesystem without explicit approval.
Persistent Memory — Each browser tab gets a unique session ID from sessionStorage. Memory is stored as JSON, survives server restarts, capped at 20 exchanges per session. Passed to the LLM as structured HumanMessage/AIMessage objects for proper multi-turn context.
Compound Commands — "Write a retry function and create a file called config.json" executes both intents in one message.
Action Logging — Every file operation is appended to output/actions.log with a timestamp and file path.

The confirmation bar visible in the UI


What I'd Do Differently

  • Streaming responses — right now the UI waits for the full LLM response. Using LangGraph's stream() would make code generation feel instant.
  • Code execution sandbox — the natural next step is running the generated code in a Docker container and showing the output.

- Better compound command handling — currently only splits on "and"/"then". A smarter parser using the LLM itself to identify sub-commands would handle more natural phrasing.

Examples which executed screenshots

Live Audio

Confirming that to save file

Stored in Output/ Folder

Recorded audio Upload

Selecting audio file

Parsed Audio File and asking for confirmation to create file

Saved File in Output/

General Chat

Memory Saved for each session

Summarize only

Summarize and store

Confirming file

Stored File named aisummary.txt in output/ folder


Top comments (0)