Introduction
What if you could speak a command and have an AI agent create files, write code, or summarize text on your local machine in real time? That is exactly what I built for the Mem0 AI/ML Generative AI Developer Intern Assignment.
In this article I will walk through the architecture, the models I chose, the challenges I faced, and how I solved them.
What the Agent Does
The Voice AI Agent is a full-stack application that accepts voice input via microphone or audio file upload, converts speech to text using Groq Whisper Large v3, classifies the intent using LLaMA 3.3 70B, executes local tools, and displays the full pipeline in a dark-mode chat UI.
Supported intents:
- create_file - Create a text or markdown file
- write_code - Generate code and save to a file
- summarize - Summarize provided text
- general_chat - Conversational response
- compound - Multiple intents in one command (bonus feature)
Architecture
The system has 5 main components:
Frontend (index.html) - Vanilla HTML, CSS, and JavaScript. Uses the MediaRecorder API for mic input and supports audio file upload.
Backend (main.py) - FastAPI server exposing POST /process-audio, POST /process-text, GET /history, and GET /benchmark endpoints.
stt.py - Speech-to-Text using Groq Whisper Large v3 with local Whisper as primary and Groq as fallback.
intent.py - Intent classification using LLaMA 3.3 70B, returns structured JSON with intent, filename, language, content_hint, and confidence.
tools.py - Tool execution handling file creation, code generation, summarization, and general chat. All files are sandboxed inside an output/ folder.
Pipeline Flow
- Audio blob is sent to the backend
- stt.py transcribes it via Groq Whisper
- intent.py classifies the transcription via LLaMA 3.3 70B
- tools.py executes the correct tool
- The result is sent back and displayed in the chat UI
Models Chosen
STT: Groq Whisper Large v3
My local machine runs Windows with no dedicated GPU. Running whisper-base locally via HuggingFace took 45 to 90 seconds per transcription on CPU, which is completely unusable for real-time interaction.
Groq runs Whisper Large v3 at approximately 100x real-time speed. A 5-second audio clip transcribes in under 200ms.
The code maintains a fallback chain: Local Whisper first, then Groq, then OpenAI. So anyone with a GPU gets the local model automatically.
LLM: LLaMA 3.3 70B via Groq
On CPU-only machines, a 70B model via Ollama takes minutes per response. Groq's LLaMA 3.3 70B runs in around 500ms, which feels responsive in the chat UI.
Fallback chain: Ollama if running locally, then Groq LLaMA 3.3 70B, then OpenAI GPT-4o-mini.
Bonus Features Implemented
1. Compound Commands
When a user says "Summarize this text and save it to summary.txt", the LLM detects compound intent and returns both summarize and create_file as sub-intents. The _handle_compound() function chains them together in one request.
2. Human-in-the-Loop
Before any file operation, the backend returns needs_confirmation as true. The frontend shows a confirmation card with Yes and Cancel buttons. The confirmed flag is sent back with the next request.
3. Graceful Degradation
Every LLM call is wrapped in try/except. If the audio file is under 5KB (likely silence), the backend rejects it immediately with a clear error message rather than wasting an API call.
4. Session Memory
The SessionMemory class stores the last 10 interactions. The last 3 turns are passed as context to the intent classifier, enabling follow-up commands like "Now save that to a file."
5. Model Benchmarking
Every LLM call logs the provider, duration in ms, prompt length, and response length. The /benchmark endpoint returns aggregated stats and the frontend shows a live benchmark panel.
Challenges Faced
Challenge 1: Microphone Silent on Windows
The mic was recording only 1.5 to 3.7KB of audio, essentially silence, because the Windows microphone volume was at 0 by default for the device.
Fix: Added a minimum size check in stt.py. Any recording under 5000 bytes is rejected with a clear message telling the user to check their mic volume in Windows Sound Settings.
Challenge 2: MediaRecorder MIME Type Differences
Chrome produces audio/webm;codecs=opus, Firefox produces audio/ogg;codecs=opus, and Safari produces audio/mp4. Hardcoding audio/webm caused silent failures on Firefox and Safari.
Fix: Dynamic MIME type selection at runtime using MediaRecorder.isTypeSupported() to pick the best available format.
Challenge 3: LLM JSON Parsing
LLMs sometimes wrap their JSON response in markdown code fences, causing json.loads() to crash.
Fix: Strip markdown fences with regex before parsing, with a secondary regex to extract any JSON object found anywhere in the response string.
Challenge 4: Ollama Blocking Startup
On machines without Ollama installed, the connection attempt hangs for 30 seconds before timing out.
Fix: Documented clearly in README to set LLM_PROVIDER=groq in the .env file if not using Ollama. The fallback triggers immediately on ConnectionRefusedError.
Model Benchmarking Results
Tested on Windows 11, Intel i5, no GPU:
| Model | Task | Avg Response Time |
|---|---|---|
| Groq Whisper Large v3 | STT 5s audio | 180ms |
| Groq LLaMA 3.3 70B | Intent classification | 420ms |
| Groq LLaMA 3.3 70B | Code generation | 1200ms |
| Local Whisper Base CPU | STT 5s audio | 65000ms |
| Ollama LLaMA 3.2 CPU | Intent classification | 45000ms |
Groq API is roughly 150 to 350 times faster than running models locally on CPU. The tradeoff is needing an internet connection and an API key.
Tech Stack Summary
| Layer | Technology |
|---|---|
| Backend | FastAPI and Uvicorn |
| STT | Groq Whisper Large v3 |
| LLM | Groq LLaMA 3.3 70B |
| Frontend | Vanilla HTML CSS JS |
| Memory | In-session Python store |
| Safety | output/ sandbox folder |
How to Run
Clone the repo and install dependencies:
pip install fastapi uvicorn python-dotenv groq python-multipart
Create a .env file with:
GROQ_API_KEY=your_key_here
LLM_PROVIDER=groq
STT_PROVIDER=groq
Run the server:
python main.py
Then open http://localhost:8000 in your browser. Get a free Groq API key at console.groq.com.
Conclusion
Building this agent taught me a lot about practical tradeoffs between local and cloud-based AI inference, browser audio APIs, and designing resilient fallback systems. The most valuable lesson was that graceful degradation matters more than perfection. A system that fails clearly and recovers cleanly is far better than one that silently produces wrong results.
Built as part of the Mem0 AI/ML Generative AI Developer Intern Assignment.
Top comments (0)