ποΈ VoiceAgent AI β Local AI Agent with Voice Control
Fully-functioning voice-controlled local AI Agent for Mem0 AI/ML & Generative AI Developer Intern Assignment. The system accepts audio input, transcribes it, classifies intent with an LLM, and runs local tools, all presented in a sleek, dark-themed Streamlit UI.
ποΈ Architecture
The architecture includes a few components such as audio input, speech-to-text, intent classification, and a tool dispatcher.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VoiceAgent AI β
β β
β ββββββββββββ ββββββββββββ βββββββββββββ β
β β Audio ββββΆβ STT ββββΆβ Intent β β
β β Input β β (Whisper β β Classify β β
β β .wav β β via β β (LLaMA β β
β β .mp3 β β Groq) β β 3.3 70B β β
β ββββββββββββ ββββββββββββ β via Groq)β β
β βββββββ¬ββββββ β
β β β
β βββββββββββββββββββββββββΌβββββββββββββββββ β
β β Tool Dispatcher β β β
β β βΌ β β
β β create_file β write_code β summarize β β
β β β general_chat β β
β ββββββββββββββββββββββββββββββββββββββββ β β
β β β
β βΌ β
β output/ folder β
β (sandboxed, safe) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The architecture diagram above shows this.
Module Breakdown
app.py | Streamlit UI β Pipeline Display, Session History, Human-in-the-loop. intent_classifier.py | LLaMA 3.3 70B prompt + JSON parsing + graceful fallback. tools.py | Tool handlers: create_file, write_code, summarize, general_chat. requirements.txt | Minimal dependencies (Streamlit + Groq SDK).
π Hardware Note & Workaround
Why Groq API instead of local models?
My local machine does not have any dedicated GPU. Running Whisper Large v3 or LLaMA 3.3 70B locally would require at minimum 8GB VRAM (Whisper) and 40GB+ RAM/VRAM (LLaMA 70B quantized). Inference using the models locally will result in accuracy degradation thatβs prohibitively high.
Solution: Groqβs free API tier offers:
Whisper Large v3 for STT β state-of-the-art accuracy, ~2-3 seconds per audio file LLaMA 3.3 70B Versatile for intent + code generation β extremely fast (~200 tokens/sec on Groq hardware)
This is fully in compliance with the assignmentβs hardware workaround policy. The whole pipeline is running at API speed (3 β 6 seconds, end-to-end).
β¨ Bonus Features Implemented
Compound Commands β βSummarize and save to fileβ deals with more than one action Human-in-the-Loop β Checkbox confirmations for any writing-to-file commands Graceful Degradation β when JSON parsing fails, no intent corresponds to the request, or the audio message makes no sense, general_chat with a useful message is triggered. Session Memory: The entire history of the actions is displayed in the UI for the session. Safe Sandbox β All file operation limited to output/ folder with path traversal safeguard
π Project Structure
voice-agent-ai/ βββ app.py # The main Streamlit UI βββ intent_classifier.py # LLM-based intent classification βββ tools.py # Tool execution handlers; βββ requirements.txt # Dependencies βββ output/ # All generated files (gitignored) README.md
π¬ Demo Video
YouTube Unlisted Link β Demonstrates:
- Voice input β βCreate a python file with a retry decoratorβ β
write_codeintent β file saved 4. Provide a voice input: βWhat is the difference between RAM and ROM?" β intent of general_chat β get a response;
π Technical Article
Medium / Dev.to Link β Architecture, model selection, Groqβs speed advantage, challenges.
π‘οΈ Safety
All file writes are limited to output/ directory, using an os.path.basename() stripping Stripping Path traversal (β../β) Human-in-the-loop confirmation before any destructive file operation
π¦ Dependencies
streamlit>=1.35.0 # UI framework groq>=0.9.0 # Groq SDK (STT + LLM)
No Bulky ML Libraries Needed. No heavy ML libraries required; can run on any machine with Python 3.9+.
YOU CAN CHECK MY WORK
Top comments (0)