DEV Community

Sanvi_Kulkarni
Sanvi_Kulkarni

Posted on

How I Built a Voice-Controlled Local AI Agent with Python and Groq

What I Built

I built a voice-controlled AI agent that can take spoken input and convert it into meaningful actions.

  • The system:

Accepts input via microphone or audio file
Converts speech to text using Whisper (via Groq API)
Uses an LLM to understand what the user wants
Executes the appropriate action locally — like creating files, generating code, summarizing content, or responding conversationally

This project was developed as part of the Mem0 AI/ML Generative AI Developer Intern assignment.

Live demo: [your streamlit URL here]
GitHub: [your github URL here]

  • Architecture Overview

The application follows a simple but effective pipeline:

Audio Input → Speech-to-Text → Intent Detection → Action Execution → UI Output

  • Tech stack used:

Streamlit — for building the UI quickly
Groq API — Whisper (speech-to-text) + LLM (intent understanding)
faster-whisper — local fallback for transcription
Python — core logic and tool execution

  • Intent Classification Approach

Instead of training a separate model, I used prompt engineering to guide the LLM to return structured outputs.

The model is instructed to respond strictly in JSON format like:

{"intents": ["write_code"], "params": {"filename": "sort.py", "language": "python", "description": "bubble sort function"}}

Enter fullscreen mode Exit fullscreen mode

This makes the system:

predictable
easy to parse
extensible for multiple actions

  • Supported intents include:

create_file — creates a file in a safe directory
write_code — generates and saves code
summarize — produces concise summaries
general_chat — handles normal conversations
Handling Multiple Actions

  • The system supports compound commands.

For example:

“Write a retry function and save it as retry.py”

This results in multiple intents (write_code + create_file) which are executed sequentially by the system.

  • Safety Measures

All file operations are restricted to a controlled output/ directory.

To prevent misuse:

Filenames are sanitized
Path traversal (like ../../) is blocked

This ensures no unintended access to the system.

  • Challenges I Faced
  1. Missing file contents
    Some project files were empty after setup, which caused import errors. I had to manually verify and restore each file.

  2. Model changes during development
    The Groq model llama3-8b-8192 was deprecated, so I switched to llama-3.3-70b-versatile.

  3. Incorrect language detection
    The local Whisper model sometimes transcribed in the wrong language. Using Groq’s hosted Whisper resolved this.

  4. Git issues with virtual environment
    Accidentally committed .venv, which caused large commits. Fixed using .gitignore and removing it from tracking.

Groq vs Local Ollama — Speed Comparison

Backend Transcription Intent Classification
faster-whisper (local, base model) ~15-30 seconds N/A
Ollama llama3 (local) N/A ~8-12 seconds
Groq API ~1 second ~1-2 seconds

Groq wins by a huge margin for development and demos. For a fully offline/private deployment, Ollama + faster-whisper is the way to go.


Graceful Degradation

If the LLM is unavailable, the app falls back to a rule-based intent classifier using keyword matching. So even without an API key or internet connection, basic intents like create_file and write_code still work.


How to Run It Yourself

  1. Clone the repo: git clone https://github.com/Sanvi09Kulkarni/voice-AI-agent
  2. Install deps: pip install -r requirements.txt
  3. Get a free Groq API key at console.groq.com
  4. Run: streamlit run app.py
  5. Select Groq API in both dropdowns, paste your key, and start talking

What I Learned

  • Structured JSON prompting is more reliable than free-form LLM output for classification tasks
  • Groq's hosted Whisper is dramatically faster than running Whisper locally on CPU
  • Streamlit makes it surprisingly easy to build production-looking AI apps quickly
  • Always add .venv to .gitignore before your first commit

Thanks for reading! Check out the live demo and drop a star on GitHub if you found this useful.

Top comments (0)