I built a voice-controlled AI agent that can take audio input, convert speech to text, understand the user’s intent, perform local actions, and display the complete pipeline in a simple web UI.
The goal of the project was to create an agent that supports:
- creating files
- writing code into files
- summarizing text
- general chat
To keep the system safe, all generated files are stored only inside an output/ folder.
Tech Stack:
I used:
- Python
- Streamlit for the UI
- Groq Speech-to-Text API for transcription
- OpenRouter API for intent classification and text generation
- python-dotenv for API key management
How It Works:
The workflow is simple:
- The user records audio or uploads an audio file.
- The audio is saved temporarily.
- Groq converts the speech into text.
- OpenRouter classifies the user’s intent.
- Based on the intent, the system performs the required action.
- The UI shows the transcription, intent, action taken, and final result.
Why I Chose This Approach:
At first, I considered using local Whisper and Ollama. However, local speech-to-text often needs extra dependencies like FFmpeg, and local LLM setup can be harder to manage across devices.
To make the project easier to run and more deployment-friendly, I used:
- Groq for fast speech transcription
- OpenRouter for reasoning, summarization, chat, and code generation
This made the system more stable and portable.
Main Challenges:
One challenge was intent classification.
For example, a command like:
“Create a Python file with a retry function”
was initially classified as create_file instead of write_code.
I fixed this by improving the classifier prompt and making the intent rules more explicit.
Another issue was handling API keys securely. I solved that by using a .env file and excluding it from GitHub with .gitignore.
What I Learned:
This project taught me that building an AI agent is not just about calling a model. The real work is in:
- handling inputs properly
- structuring model outputs
- routing actions safely
- building a clear interface for users
I also learned how much prompt design matters. A small prompt change can significantly improve the quality of intent detection.
This project was a practical way to combine speech recognition, intent understanding, local tool execution, and UI design into one application.
Using Groq, OpenRouter, and Streamlit, I built a voice-controlled AI agent that can listen, understand, and act on user commands in a safe and structured way.

Top comments (0)