What I Built
I built a voice-controlled AI agent that can take spoken input and convert it into meaningful actions.
- The system:
Accepts input via microphone or audio file
Converts speech to text using Whisper (via Groq API)
Uses an LLM to understand what the user wants
Executes the appropriate action locally — like creating files, generating code, summarizing content, or responding conversationally
This project was developed as part of the Mem0 AI/ML Generative AI Developer Intern assignment.
Live demo: [your streamlit URL here]
GitHub: [your github URL here]
- Architecture Overview
The application follows a simple but effective pipeline:
Audio Input → Speech-to-Text → Intent Detection → Action Execution → UI Output
- Tech stack used:
Streamlit — for building the UI quickly
Groq API — Whisper (speech-to-text) + LLM (intent understanding)
faster-whisper — local fallback for transcription
Python — core logic and tool execution
- Intent Classification Approach
Instead of training a separate model, I used prompt engineering to guide the LLM to return structured outputs.
The model is instructed to respond strictly in JSON format like:
{"intents": ["write_code"], "params": {"filename": "sort.py", "language": "python", "description": "bubble sort function"}}
This makes the system:
predictable
easy to parse
extensible for multiple actions
- Supported intents include:
create_file — creates a file in a safe directory
write_code — generates and saves code
summarize — produces concise summaries
general_chat — handles normal conversations
Handling Multiple Actions
- The system supports compound commands.
For example:
“Write a retry function and save it as retry.py”
This results in multiple intents (write_code + create_file) which are executed sequentially by the system.
- Safety Measures
All file operations are restricted to a controlled output/ directory.
To prevent misuse:
Filenames are sanitized
Path traversal (like ../../) is blocked
This ensures no unintended access to the system.
- Challenges I Faced
Missing file contents
Some project files were empty after setup, which caused import errors. I had to manually verify and restore each file.Model changes during development
The Groq model llama3-8b-8192 was deprecated, so I switched to llama-3.3-70b-versatile.Incorrect language detection
The local Whisper model sometimes transcribed in the wrong language. Using Groq’s hosted Whisper resolved this.Git issues with virtual environment
Accidentally committed .venv, which caused large commits. Fixed using .gitignore and removing it from tracking.
Groq vs Local Ollama — Speed Comparison
| Backend | Transcription | Intent Classification |
|---|---|---|
| faster-whisper (local, base model) | ~15-30 seconds | N/A |
| Ollama llama3 (local) | N/A | ~8-12 seconds |
| Groq API | ~1 second | ~1-2 seconds |
Groq wins by a huge margin for development and demos. For a fully offline/private deployment, Ollama + faster-whisper is the way to go.
Graceful Degradation
If the LLM is unavailable, the app falls back to a rule-based intent classifier using keyword matching. So even without an API key or internet connection, basic intents like create_file and write_code still work.
How to Run It Yourself
- Clone the repo: git clone https://github.com/Sanvi09Kulkarni/voice-AI-agent
- Install deps: pip install -r requirements.txt
- Get a free Groq API key at console.groq.com
- Run: streamlit run app.py
- Select Groq API in both dropdowns, paste your key, and start talking
What I Learned
- Structured JSON prompting is more reliable than free-form LLM output for classification tasks
- Groq's hosted Whisper is dramatically faster than running Whisper locally on CPU
- Streamlit makes it surprisingly easy to build production-looking AI apps quickly
- Always add .venv to .gitignore before your first commit
Thanks for reading! Check out the live demo and drop a star on GitHub if you found this useful.
Top comments (0)