How I Built a Voice-Controlled AI Agent in Python

#ai #python #gradio #beginners

Article:

I recently applied for an internship at Mem0 AI, a memory layer startup in San Francisco. As part of their selection process they gave me a technical assignment — build a voice-controlled AI agent that listens to commands, understands intent, and takes real actions on the computer.
I am fairly new to AI so this was a good challenge for me. Here is what I built and how.

What it does
You speak a command, and the agent:

Converts your speech to text
Understands what you want
Takes the actual action — creates files, writes code, summarizes text, or chats
Shows everything in a browser UI

Architecture

Four files, each doing one job:
Voice → stt.py → intent.py → tools.py → app.py
stt.py — sends audio to Groq Whisper, gets back text
intent.py — sends text to LLaMA 3.3, gets back JSON telling us what the user wants
tools.py — executes the action based on intent
app.py — shows everything in a Gradio UI

The JSON trick

**
The most interesting part was forcing LLaMA to reply only in JSON format. This way I can reliably extract things like filename and programming language without parsing messy natural language.
json{
"intent": "write_code",
"filename": "bubble_sort.py",
"language": "Python"
}

Why Groq instead of local models

**
My laptop runs on CPU only. Whisper locally takes 30-60 seconds per request which makes the agent unusable. Groq gives free API access with responses under 1 second. Swapping to Ollama later needs just one line change.

Challenges

Gradio version conflicts took the most debugging time
The LLaMA model I started with got decommissioned mid-development
Getting consistent JSON output from the LLM needed a very explicit system prompt

I tested both whisper-large-v3 and whisper-large-v3-turbo for speech to text. The turbo version was slightly faster but slightly less accurate on Indian accents. I went with the standard version for better accuracy.