How I Built a Voice-Controlled Local AI Agent with Python and Groq

#ai #machinelearning #python #github

What I Built

I built a voice-controlled AI agent that can take spoken input and convert it into meaningful actions.

The system:

Accepts input via microphone or audio file
Converts speech to text using Whisper (via Groq API)
Uses an LLM to understand what the user wants
Executes the appropriate action locally — like creating files, generating code, summarizing content, or responding conversationally

This project was developed as part of the Mem0 AI/ML Generative AI Developer Intern assignment.

Live demo: [your streamlit URL here]
GitHub: [your github URL here]

Architecture Overview

The application follows a simple but effective pipeline:

Audio Input → Speech-to-Text → Intent Detection → Action Execution → UI Output

Tech stack used:

Streamlit — for building the UI quickly
Groq API — Whisper (speech-to-text) + LLM (intent understanding)
faster-whisper — local fallback for transcription
Python — core logic and tool execution

Intent Classification Approach

Instead of training a separate model, I used prompt engineering to guide the LLM to return structured outputs.

The model is instructed to respond strictly in JSON format like:

{"intents": ["write_code"], "params": {"filename": "sort.py", "language": "python", "description": "bubble sort function"}}

This makes the system:

predictable
easy to parse
extensible for multiple actions

Supported intents include:

create_file — creates a file in a safe directory
write_code — generates and saves code
summarize — produces concise summaries
general_chat — handles normal conversations
Handling Multiple Actions

The system supports compound commands.

For example:

“Write a retry function and save it as retry.py”

This results in multiple intents (write_code + create_file) which are executed sequentially by the system.

Safety Measures

All file operations are restricted to a controlled output/ directory.

To prevent misuse:

Filenames are sanitized
Path traversal (like ../../) is blocked

This ensures no unintended access to the system.

Challenges I Faced

Missing file contents
Some project files were empty after setup, which caused import errors. I had to manually verify and restore each file.
Model changes during development
The Groq model llama3-8b-8192 was deprecated, so I switched to llama-3.3-70b-versatile.
Incorrect language detection
The local Whisper model sometimes transcribed in the wrong language. Using Groq’s hosted Whisper resolved this.
Git issues with virtual environment
Accidentally committed .venv, which caused large commits. Fixed using .gitignore and removing it from tracking.

Groq vs Local Ollama — Speed Comparison

Backend	Transcription	Intent Classification
faster-whisper (local, base model)	~15-30 seconds	N/A
Ollama llama3 (local)	N/A	~8-12 seconds
Groq API	~1 second	~1-2 seconds

Groq wins by a huge margin for development and demos. For a fully offline/private deployment, Ollama + faster-whisper is the way to go.

Graceful Degradation

If the LLM is unavailable, the app falls back to a rule-based intent classifier using keyword matching. So even without an API key or internet connection, basic intents like create_file and write_code still work.

How to Run It Yourself

Clone the repo: git clone https://github.com/Sanvi09Kulkarni/voice-AI-agent
Install deps: pip install -r requirements.txt
Get a free Groq API key at console.groq.com
Run: streamlit run app.py
Select Groq API in both dropdowns, paste your key, and start talking

What I Learned

Structured JSON prompting is more reliable than free-form LLM output for classification tasks
Groq's hosted Whisper is dramatically faster than running Whisper locally on CPU
Streamlit makes it surprisingly easy to build production-looking AI apps quickly
Always add .venv to .gitignore before your first commit

Thanks for reading! Check out the live demo and drop a star on GitHub if you found this useful.