What I Built
A voice-controlled AI agent that: records your voice → transcribes it →
understands your intent → executes the right action → shows everything in a UI.
Built for the Mem0 AI Internship assignment.
Architecture
Voice Input → Groq Whisper (STT) → LLaMA 3.3 70B (Intent) → Tool Executor → Gradio UI
Model Choices
Speech-to-Text: Whisper-large-v3 via Groq
I initially planned to run Whisper locally using HuggingFace, but on a CPU-only
Windows machine this was extremely slow (30+ seconds per clip). Groq's API gives
sub-second transcription for free, making the user experience much better.
LLM: LLaMA 3.3 70B via Groq
Running a 70B model locally requires 40GB+ VRAM. Groq's free tier handles this
instantly. The model is prompted to return structured JSON for reliable intent parsing.
The Intent Classification Trick
The key insight was prompting the LLM to return ONLY JSON — no explanation,
no markdown. Combined with low temperature (0.1), this gives very consistent results.
It also supports compound commands: "Write a retry function AND save it"
correctly returns intents: ["write_code", "create_file"].
Challenges
JSON parsing — LLMs sometimes wrap JSON in markdown fences.
Fixed with regex:re.sub(r"json\s*|\s*", "", raw)Audio formats — Gradio passes temp file paths.
Opening with "rb" and letting Groq handle format detection solved this.Safety — Used os.path.basename() to prevent path traversal attacks
when creating files.
What I'd Add With More Time
- Local model support via Ollama (offline mode)
- Wake word detection
- Web search tool integration
Top comments (0)