I built a voice-controlled AI agent that takes spoken input, converts it to text, classifies intent, executes local tools, and shows the full pipeline in a Gradio UI. The goal was to make the system practical, safe, and easy to debug end-to-end.
Architecture
The system follows a simple 4-stage pipeline:
Input Layer (UI)
Users can provide commands through microphone recording, audio upload, or text input (for quick testing).Speech-to-Text (STT)
Audio is transcribed using AssemblyAI.-
Intent Understanding (LLM Router)
Transcribed text is sent to a Groq-hosted Llama 3.3 70B model, which returns structured JSON intents such as:create_filewrite_codesummarizegeneral_chat
-
Tool Execution Layer
The app routes each detected intent to a tool:- file creation
- code generation and save
- summarization
- chat response
The UI then displays:
- transcribed text
- detected intents
- action(s) taken
- final result
All file operations are sandboxed to an output/ directory for safety.
Models and Why I Chose Them
1) AssemblyAI for STT
I initially considered local models (Whisper/wav2vec), but for this machine and timeline, API-based STT was more reliable and faster to integrate.
Why AssemblyAI:
- generous free tier
- strong transcription quality
- simple Python SDK
- avoids local GPU dependency
2) Groq + Llama 3.3 70B for Intent + Generation
For intent classification and text/code generation, I used Groq’s hosted Llama model.
Why Groq:
- fast inference latency
- good structured-output behavior (JSON intent schema)
- strong instruction following for routing + generation
- straightforward integration in Python
Key Challenges I Faced
1) STT model configuration mismatches
A major challenge was AssemblyAI configuration compatibility:
-
speech_modelwas deprecated by API expectations -
speech_modelsrequired specific values (universal-3-pro,universal-2) - enum values in SDK and accepted server values were not always intuitive
I resolved this by explicitly setting supported speech_models values.
2) Language drift (Hindi vs English output)
During testing, English speech was sometimes transcribed in Hindi script due to language auto-detection. This cascaded into Hindi LLM responses.
I fixed this by forcing English in STT config and aligning LLM prompts to respond in English.
3) Intent ambiguity in compound commands
User prompts like “create a file and write your capabilities” can be interpreted as:
- create file + generate code
- create file + write plain text
- chat + file write
Because the current intent set does not include a dedicated write_text intent, the model sometimes chose write_code, producing code when plain text was expected. This highlighted an important product gap: intent taxonomies must match real user phrasing.
4) Safety vs usability
I needed to enable local file actions while minimizing risk. Restricting writes to output/ and adding a confirmation toggle balanced safety with usability.
What Worked Well
- Clear modular design (
stt.py,intent.py,tools/*,app.py) - Human-in-the-loop confirmation before file operations
- Compound-intent execution support
- Persistent memory support and reset controls
- Benchmark script to evaluate intent accuracy and generation latency
Conclusion
This project showed that a practical voice agent is less about one “smart model” and more about pipeline reliability: robust STT config, strict intent schema, safe tool boundaries, and transparent UI feedback. The next meaningful improvement would be adding a write_text intent and richer compound-intent planning so user requests map more naturally to expected outcomes.
Top comments (0)