Have you ever wanted to just talk to your computer and have it actually do something — create files, write code, or answer questions? That's exactly what I built for my Mem0 AI internship assignment: a fully working Voice-Controlled AI Agent.
In this article, I'll walk you through what I built, the architecture behind it, why I chose the tools I did, the challenges I faced, and what I learned along the way.
🎯 What I Built
A voice-controlled AI agent that:
Accepts audio from your microphone or an uploaded audio file
Transcribes your speech to text using Whisper
Detects your intent — do you want to create a file? write code? summarize something? or just chat?
Executes the action on your local machine
Shows all results in a clean web UI
Here's a quick example of the full flow:
You say: "Write Python code for a retry function"
The agent transcribes it → detects intent as Write Code → generates the Python code using an LLM → saves it to output/retry.py → shows you the result in the UI.
All of this happens in under 5 seconds.
🏗️ Architecture
The project is split into 4 clean modules:
User speaks
↓
Audio file (.wav / .mp3)
↓
[stt.py] → Groq Whisper Large V3 → Transcribed Text
↓
[intent.py] → LLaMA 3.3-70B → Intent Classification (JSON)
↓
[tools.py] → Tool Router
├── create_file → Creates empty file in output/
├── write_code → Generates & saves code to output/
├── summarize → Summarizes topic, saves to output/summary.txt
└── general_chat → LLM conversation response
↓
[app.py] → Gradio UI → Displays transcription, intent, action, output
↓
Session Memory → Tracks all actions in the session
Each module has one job and does it well. This separation makes the code easy to debug, easy to extend, and easy to understand.
🛠️ Tech Stack and Why I Chose It
ComponentToolReasonUIGradioFastest way to build a professional AI UI in PythonSpeech-to-TextGroq Whisper Large V3Free, fast, no GPU neededLLMGroq LLaMA 3.3-70BFree API, blazing fast, excellent qualityFile OperationsPython pathlibBuilt-in, no extra dependenciesEnvironmentpython-dotenvSecure API key management
Why Groq Instead of a Local Model?
The assignment mentioned using a local model via Ollama or LM Studio. I evaluated this option but chose Groq's API instead for three strong reasons:
- No GPU required. Running Whisper Large V3 or LLaMA 70B locally requires at minimum 16GB VRAM. Most student laptops don't have this. Groq runs everything on their hardware.
- Speed. Groq uses custom LPU (Language Processing Unit) chips that are genuinely faster than most local GPU setups. Responses come back in 1-2 seconds.
- Free tier. Groq's free tier is generous enough to build and demo this entire project without paying anything. This is documented in the README as a hardware workaround, as the assignment allows.
🎯 The 4 Supported Intents
The heart of the agent is the intent classifier. I used a structured LLM prompt that forces the model to return clean JSON every time:
pythonprompt = f"""
Classify this command into ONE of:
create_file, write_code, summarize, general_chat
Command: "{text}"
Respond ONLY with JSON:
{{
"intent": "write_code",
"details": "Python retry function",
"filename": "retry.py"
}}
"""
This gives the agent four distinct capabilities:
📁 Create File — "Create a text file called meeting notes"
Creates an empty file instantly in the output/ folder.
💻 Write Code — "Write JavaScript code for a todo list"
Generates complete, commented, production-quality code and saves it to a file.
📋 Summarize — "Summarize the benefits of neural networks"
Produces a structured summary with bullet points and saves it to output/summary.txt.
💬 General Chat — "What is the difference between AI and ML?"
Responds conversationally to any question.
🧠 Bonus: Session Memory
Since I was building this for Mem0 AI — a company that builds memory layers for AI agents — I made sure to implement the memory bonus feature.
Every action the agent takes gets stored in a session history list:
pythonsession_history.append({
"transcription": transcription,
"intent": intent_display,
"action": action_taken,
"output": final_output,
})
This is displayed in the UI as a running log of everything that happened in the session. It's a simple implementation, but it demonstrates the core concept that Mem0 is built around: agents need memory to be truly useful.
😤 Challenges I Faced
Challenge 1: The LLM Model Was Decommissioned
Mid-development, I got this error:
Error code: 400 - The model 'llama3-70b-8192' has been decommissioned
Groq had retired the model I was using. The fix was simple — update to llama-3.3-70b-versatile — but it taught me an important lesson: always check API deprecation notices and avoid hardcoding model names without a fallback plan.
Challenge 2: Getting Clean JSON from the LLM
The intent classifier needs to return valid JSON every time. But LLMs sometimes wrap their response in markdown code blocks like:
{"intent": "write_code"}
This breaks json.loads(). My fix was to strip markdown before parsing:
pythonraw_response = raw_response.replace("
json", "").replace("
", "").strip()
result = json.loads(raw_response)
And I wrapped everything in a try/except that falls back to general_chat if parsing fails — so the app never crashes.
Challenge 3: Keeping File Operations Safe
The assignment required that all file creation be restricted to an output/ folder so the agent can't accidentally overwrite system files. I handled this with Python's pathlib:
pythonOUTPUT_DIR = Path("output")
OUTPUT_DIR.mkdir(exist_ok=True)
filepath = OUTPUT_DIR / filename # Always inside output/
This makes it impossible for the agent to write anywhere else, no matter what the user says.
✅ What I Learned
- Prompt engineering matters more than model choice. A well-structured prompt with clear instructions and a defined output format consistently outperforms a bigger model with a vague prompt.
- API-based models are practical for hackathons and assignments. The assignment encouraged local models, but for rapid prototyping and demo-quality results, a well-chosen API (especially a free one like Groq) is often the better tool.
- Modular code saves enormous time. Because each module (stt.py, intent.py, tools.py) had a single responsibility, debugging was fast. When the model was decommissioned, I only had to update 2 files.
- Memory is a genuinely hard problem. Implementing even the simple session history feature made me appreciate why Mem0 exists as a product. True persistent memory — across sessions, across users, with context retrieval — is a deep engineering challenge.
🔗 Links
GitHub Repository: https://github.com/Amratanshu-d/voice-ai-agent
Video Demo: https://youtu.be/fbXTjaXM-oI
Assignment by: Mem0 AI — MLOps and AI Infra Internship
Built with ❤️ for the Mem0 AI internship assignment. If you're building AI agents and want to add persistent memory to them, check out mem0.ai.
Top comments (0)