**
Article:
**
I recently applied for an internship at Mem0 AI, a memory layer startup in San Francisco. As part of their selection process they gave me a technical assignment — build a voice-controlled AI agent that listens to commands, understands intent, and takes real actions on the computer.
I am fairly new to AI so this was a good challenge for me. Here is what I built and how.
What it does
You speak a command, and the agent:
Converts your speech to text
Understands what you want
Takes the actual action — creates files, writes code, summarizes text, or chats
Shows everything in a browser UI
Architecture
Four files, each doing one job:
Voice → stt.py → intent.py → tools.py → app.py
stt.py — sends audio to Groq Whisper, gets back text
intent.py — sends text to LLaMA 3.3, gets back JSON telling us what the user wants
tools.py — executes the action based on intent
app.py — shows everything in a Gradio UI
**
The JSON trick
**
The most interesting part was forcing LLaMA to reply only in JSON format. This way I can reliably extract things like filename and programming language without parsing messy natural language.
json{
"intent": "write_code",
"filename": "bubble_sort.py",
"language": "Python"
}
**
Why Groq instead of local models
**
My laptop runs on CPU only. Whisper locally takes 30-60 seconds per request which makes the agent unusable. Groq gives free API access with responses under 1 second. Swapping to Ollama later needs just one line change.
**
Challenges
**
Gradio version conflicts took the most debugging time
The LLaMA model I started with got decommissioned mid-development
Getting consistent JSON output from the LLM needed a very explicit system prompt
I tested both whisper-large-v3 and whisper-large-v3-turbo for speech to text. The turbo version was slightly faster but slightly less accurate on Indian accents. I went with the standard version for better accuracy.
**
What I learned
**
How to build an end to end AI pipeline in Python
Using LLMs for structured JSON output
How Whisper speech to text works
Building UIs with Gradio
GitHub: https://github.com/Abhayy-Raj/voice-agent
Youtube Link: https://youtu.be/Ii8TeJdH27w
Top comments (0)