Introduction
In this project, I built a voice-controlled AI agent that converts spoken commands into executable actions such as generating code, creating files, and summarizing text.
The goal was to combine speech processing, language models, and tool execution into a single pipeline that feels like a real-world AI assistant.
Architecture Overview
The system follows a simple but powerful pipeline:
Audio Input → Speech-to-Text → Intent Detection → Tool Execution → Output
Each stage is modular, making the system easy to extend and debug.
Tech Stack
- Speech-to-Text: AssemblyAI
- Language Model: Groq (llama-3.1-8b-instant)
- Frontend: Streamlit
- Backend: Python
How It Works
1. Speech-to-Text
The user uploads an audio file, which is transcribed into text using AssemblyAI.
This step converts unstructured voice input into usable text.
2. Intent Detection
The transcribed text is sent to a language model hosted on Groq.
The model analyzes the command and returns structured output like:
{
"intents": ["write_code", "create_file"],
"params": {
"filename": "retry.py",
"language": "python"
}
}
This allows the system to support multiple actions in a single command.
3. Tool Execution
Based on detected intents, the system executes actions such as:
- Generating code
- Creating files
- Summarizing text
- General chat responses
All file operations are restricted to a safe output/ directory.
4. User Interface
The UI is built using Streamlit and shows:
- Transcribed text
- Detected intent
- Execution results
- Session history
Example
Input:
"Create a Python file with a Fibonacci function"
Output:
- Code is generated
- File is created in the output folder
- Results are displayed in the UI
Bonus Features
Compound Commands
The system supports multiple actions in one input:
"Summarize this text and save it to summary.txt"
Human-in-the-Loop
Before file operations, the user is asked to confirm execution.
This adds a layer of safety and control.
Graceful Degradation
If intent detection fails, the system falls back to keyword-based classification instead of crashing.
Session Memory
The agent maintains a history of interactions within the session.
Challenges Faced
1. Local Model Limitations
Initially, I used local models:
- Whisper (HuggingFace) for speech-to-text
- Ollama for language models
However, this approach led to:
- FFmpeg setup issues on Windows
- High memory usage
- Slow performance on CPU
- Frequent crashes
2. Switching to API-based Models
After exploring developer discussions (including Reddit), I switched to:
- AssemblyAI for STT
- Groq for LLM inference
This significantly improved:
- Speed
- Stability
- Ease of setup
3. Model Deprecation Issues
While using Groq, some models were deprecated during development.
This required updating model names and adapting quickly to API changes.
4. Output Cleaning
Language models sometimes returned explanations along with code.
This was fixed by enforcing strict prompts and cleaning responses before saving.
Model Benchmarking
| Component | Model | Speed | Stability |
|---|---|---|---|
| STT | AssemblyAI | Fast | High |
| LLM (Local) | Ollama | Slow | Unstable |
| LLM (API) | Groq | Very Fast | High |
API-based models clearly outperformed local setups in this project.
Key Learnings
- Building reliable systems is more important than using local models
- APIs can significantly improve performance and developer experience
- Fallback mechanisms are essential in AI systems
- Debugging agent pipelines requires step-by-step visibility
Future Improvements
- Real-time microphone input
- Persistent memory across sessions
- Streaming responses
- More advanced tool integrations
Conclusion
This project demonstrates how speech, language models, and execution logic can be combined to build a practical AI agent.
It also highlights the tradeoffs between local and API-based approaches, and the importance of choosing the right tools based on system constraints.
Top comments (0)