For my Mem0 AI/ML Developer Intern assignment, I had to build a local AI agent that accepts voice input, detects intent, and executes real actions on the machine — generating code, creating files, and summarizing text. Here's how I built it, what I chose and why, and the bugs that almost broke me.
What the Agent Does
You speak or type a command. The agent:
- Transcribes your audio using Whisper
- Classifies your intent using LLaMA 3
- Routes to the right tool (code gen, file ops, summarizer, or chat)
- Asks for your confirmation before writing anything to disk
- Saves the result to a sandboxed output/ folder
The whole thing runs with one Groq API key — no GPU required.
Architecture
Architecture from connecting frontend to backend
Backend: FastAPI (Python) handles all endpoints — /chat, /transcribe, /execute, /memory.
Frontend: React with plain CSS. No UI library. The design mirrors the Blue ChatGPT mockup from the assignment.
Why I Used Groq Instead of Running Models Locally
The assignment preferred Ollama for local LLM inference and local Whisper for STT. I tried both. Here's what happened:
My machine has 8 GB RAM and no dedicated GPU. Running Whisper base plus llama3.2 via Ollama simultaneously pushed memory usage over 11 GB — the system killed the process before a single transcription completed. CPU-only Whisper inference took 45–90 seconds per clip, which made iterating on the pipeline impossible.
Groq solved both problems. They host whisper-large-v3 (exactly the model the assignment recommends) and Meta's LLaMA 3 family (same weights Ollama would serve) on custom LPU hardware. The same transcription that took 60 seconds on my CPU takes under 2 seconds on Groq. It's free, fast, and requires one API key.
The models are identical — the only difference is where inference runs.
The Models I Chose and Why
Whisper Large v3 for STT — most accurate Whisper variant, handles accents and noisy audio well. Groq hosts it natively.
llama-3.1-8b-instant for intent classification — fast and cheap. Temperature 0 for deterministic output. The intent prompt is carefully structured with explicit rules and examples to avoid misclassification.
llama-3.3-70b-versatile for code generation — stronger model for better code quality. Temperature 0.3 for slightly creative but consistent output.
llama-3.1-8b-instant for general chat — fast enough for conversational response, temperature 0.7 for natural replies.
Intent Detection — The Tricky Part
Getting intent classification right was harder than I expected. The first version used a simple keyword classifier. It failed on anything slightly rephrased.
I switched to LLM-based classification with a structured prompt:
system = """
You are an intent classifier. Reply with ONLY one label:
write_code | create_file | summarize_and_save | summarize | general_chat
CRITICAL RULES:
1. If message has BOTH 'summarize' AND 'save' → summarize_and_save
2. If message has 'summarize' with NO save instruction → summarize
3. Empty file/folder request → create_file
4. Code request → write_code
5. Everything else → general_chat
"""
The ordering matters — longest/most-specific labels are checked first to avoid partial matches.
The Bug That Cost Me an Hour
The worst bug: "Summarize this and save it as pythonnotes.txt" was producing an empty file.
The compound command splitter was splitting on "and", turning one command into two:
- "Summarize this" → summarize intent → displayed in chat, no file
- "save it as pythonnotes.txt" → classified as create_file → empty file created
Fix: detect summarize + save pattern before splitting:
def parse_compound(message: str) -> list[str]:
has_summarize = bool(re.search(r'\bsummarize\b', message, re.I))
has_save = bool(re.search(r'\b(save|store)\b', message, re.I))
if has_summarize and has_save:
return [message] # keep as ONE command
# otherwise split on "and" / "then"
return re.split(r"\band\b|\bthen\b", message, flags=re.I)
The second bug on the same day: "save it as techsummary" wasn't extracting the filename because my regex didn't account for the word "it" between "save" and "as". One word broke the whole filename extractor. Fixed by making "it" optional: save\s+(?:it\s+)?(?:to|as|in).
The output/ folder open in File Explorer showing real files
Bonus Features I Added
Human-in-the-Loop — A confirmation bar slides up before any file write. The user sees the exact path, intent type, and two buttons. Input is disabled until they confirm or cancel. Nothing touches the filesystem without explicit approval.
Persistent Memory — Each browser tab gets a unique session ID from sessionStorage. Memory is stored as JSON, survives server restarts, capped at 20 exchanges per session. Passed to the LLM as structured HumanMessage/AIMessage objects for proper multi-turn context.
Compound Commands — "Write a retry function and create a file called config.json" executes both intents in one message.
Action Logging — Every file operation is appended to output/actions.log with a timestamp and file path.
The confirmation bar visible in the UI
What I'd Do Differently
- Streaming responses — right now the UI waits for the full LLM response. Using LangGraph's stream() would make code generation feel instant.
- Code execution sandbox — the natural next step is running the generated code in a Docker container and showing the output.
- Better compound command handling — currently only splits on "and"/"then". A smarter parser using the LLM itself to identify sub-commands would handle more natural phrasing.
Examples which executed screenshots
Live Audio
Confirming that to save file
Stored in Output/ Folder
Recorded audio Upload
Selecting audio file
Parsed Audio File and asking for confirmation to create file
General Chat
















Top comments (0)