Building a Voice-Controlled AI Agent with Groq and Gradio

#machinelearning #ai #beginners #python

Have you ever wanted to just talk to your computer and have it actually do something — create files, write code, or answer questions? That's exactly what I built for my Mem0 AI internship assignment: a fully working Voice-Controlled AI Agent.
In this article, I'll walk you through what I built, the architecture behind it, why I chose the tools I did, the challenges I faced, and what I learned along the way.

🎯 What I Built
A voice-controlled AI agent that:

Accepts audio from your microphone or an uploaded audio file
Transcribes your speech to text using Whisper
Detects your intent — do you want to create a file? write code? summarize something? or just chat?
Executes the action on your local machine
Shows all results in a clean web UI
Here's a quick example of the full flow:

You say: "Write Python code for a retry function"
The agent transcribes it → detects intent as Write Code → generates the Python code using an LLM → saves it to output/retry.py → shows you the result in the UI.

All of this happens in under 5 seconds.

🏗️ Architecture
The project is split into 4 clean modules:
User speaks
↓
Audio file (.wav / .mp3)
↓
[stt.py] → Groq Whisper Large V3 → Transcribed Text
↓
[intent.py] → LLaMA 3.3-70B → Intent Classification (JSON)
↓
[tools.py] → Tool Router
├── create_file → Creates empty file in output/
├── write_code → Generates & saves code to output/
├── summarize → Summarizes topic, saves to output/summary.txt
└── general_chat → LLM conversation response
↓
[app.py] → Gradio UI → Displays transcription, intent, action, output
↓
Session Memory → Tracks all actions in the session
Each module has one job and does it well. This separation makes the code easy to debug, easy to extend, and easy to understand.
🛠️ Tech Stack and Why I Chose It
ComponentToolReasonUIGradioFastest way to build a professional AI UI in PythonSpeech-to-TextGroq Whisper Large V3Free, fast, no GPU neededLLMGroq LLaMA 3.3-70BFree API, blazing fast, excellent qualityFile OperationsPython pathlibBuilt-in, no extra dependenciesEnvironmentpython-dotenvSecure API key management
Why Groq Instead of a Local Model?
The assignment mentioned using a local model via Ollama or LM Studio. I evaluated this option but chose Groq's API instead for three strong reasons:

No GPU required. Running Whisper Large V3 or LLaMA 70B locally requires at minimum 16GB VRAM. Most student laptops don't have this. Groq runs everything on their hardware.
Speed. Groq uses custom LPU (Language Processing Unit) chips that are genuinely faster than most local GPU setups. Responses come back in 1-2 seconds.
Free tier. Groq's free tier is generous enough to build and demo this entire project without paying anything. This is documented in the README as a hardware workaround, as the assignment allows.

🎯 The 4 Supported Intents
The heart of the agent is the intent classifier. I used a structured LLM prompt that forces the model to return clean JSON every time:
pythonprompt = f"""
Classify this command into ONE of:
create_file, write_code, summarize, general_chat

Command: "{text}"

Respond ONLY with JSON:
{{
"intent": "write_code",
"details": "Python retry function",
"filename": "retry.py"
}}
"""
This gives the agent four distinct capabilities:
📁 Create File — "Create a text file called meeting notes"
Creates an empty file instantly in the output/ folder.
💻 Write Code — "Write JavaScript code for a todo list"
Generates complete, commented, production-quality code and saves it to a file.
📋 Summarize — "Summarize the benefits of neural networks"
Produces a structured summary with bullet points and saves it to output/summary.txt.
💬 General Chat — "What is the difference between AI and ML?"
Responds conversationally to any question.

🧠 Bonus: Session Memory
Since I was building this for Mem0 AI — a company that builds memory layers for AI agents — I made sure to implement the memory bonus feature.
Every action the agent takes gets stored in a session history list:
pythonsession_history.append({
"transcription": transcription,
"intent": intent_display,
"action": action_taken,
"output": final_output,
})
This is displayed in the UI as a running log of everything that happened in the session. It's a simple implementation, but it demonstrates the core concept that Mem0 is built around: agents need memory to be truly useful.

😤 Challenges I Faced
Challenge 1: The LLM Model Was Decommissioned
Mid-development, I got this error:
Error code: 400 - The model 'llama3-70b-8192' has been decommissioned
Groq had retired the model I was using. The fix was simple — update to llama-3.3-70b-versatile — but it taught me an important lesson: always check API deprecation notices and avoid hardcoding model names without a fallback plan.
Challenge 2: Getting Clean JSON from the LLM
The intent classifier needs to return valid JSON every time. But LLMs sometimes wrap their response in markdown code blocks like:

{"intent": "write_code"}

This breaks json.loads(). My fix was to strip markdown before parsing:
pythonraw_response = raw_response.replace("

json", "").replace("

", "").strip()
result = json.loads(raw_response)
And I wrapped everything in a try/except that falls back to general_chat if parsing fails — so the app never crashes.
Challenge 3: Keeping File Operations Safe
The assignment required that all file creation be restricted to an output/ folder so the agent can't accidentally overwrite system files. I handled this with Python's pathlib:
pythonOUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)

filepath = OUTPUT_DIR / filename # Always inside output/
This makes it impossible for the agent to write anywhere else, no matter what the user says.

✅ What I Learned

Prompt engineering matters more than model choice. A well-structured prompt with clear instructions and a defined output format consistently outperforms a bigger model with a vague prompt.
API-based models are practical for hackathons and assignments. The assignment encouraged local models, but for rapid prototyping and demo-quality results, a well-chosen API (especially a free one like Groq) is often the better tool.
Modular code saves enormous time. Because each module (stt.py, intent.py, tools.py) had a single responsibility, debugging was fast. When the model was decommissioned, I only had to update 2 files.
Memory is a genuinely hard problem. Implementing even the simple session history feature made me appreciate why Mem0 exists as a product. True persistent memory — across sessions, across users, with context retrieval — is a deep engineering challenge.

🔗 Links

GitHub Repository: https://github.com/Amratanshu-d/voice-ai-agent
Video Demo: https://youtu.be/fbXTjaXM-oI
Assignment by: Mem0 AI — MLOps and AI Infra Internship

Built with ❤️ for the Mem0 AI internship assignment. If you're building AI agents and want to add persistent memory to them, check out mem0.ai.