Introduction
In this project, I built a voice-controlled AI agent that accepts audio input, understands user intent, and executes actions locally on my machine. The agent supports creating files, writing code, summarizing text, and general chat — all triggered by voice commands.
Architecture
The system follows a 4-step pipeline:
Audio Input → Speech-to-Text → Intent Classification → Tool Execution → UI Display
- Audio Input
The app supports two methods — microphone recording using sounddevice and file upload (wav, mp3, m4a).
- Speech-to-Text
I used Groq's Whisper Large V3 API for transcription. Initially I tried running Whisper locally but it gave inaccurate results on my CPU-only 8GB RAM machine, especially with Indian English accent. Groq's API solved this completely — it is free, fast (2-3 seconds), and highly accurate.
- Intent Classification
I used Ollama running llama3.2:1b locally for intent detection. The LLM classifies the transcribed text into one of four intents — create file, write code, summarize, or general chat. It also supports compound commands where multiple intents are detected in a single voice input.
- Tool Execution
Based on detected intents, the system executes the appropriate action — generating code, creating files, summarizing text, or responding to chat. All files are safely saved inside an output folder.
- UI
Built with Streamlit, the interface displays all 4 pipeline steps clearly — transcription, detected intent, action taken, and final result.
Models Chosen
For Speech-to-Text I used Groq Whisper Large V3 because it is accurate, free, and fast on CPU. For Intent Classification and Code Generation I used Ollama llama3.2:1b because it runs comfortably on 8GB RAM locally.
Bonus Features
I implemented all four bonus features from the requirements. Compound Commands allow multiple intents in one voice input. Human-in-the-Loop adds a confirmation prompt before any file operation. Graceful Error Handling shows all errors clearly in the UI. Session Memory tracks the full history of actions in the sidebar.
Challenges Faced
Challenge 1 — Whisper accuracy on CPU
Local Whisper-base gave wrong transcriptions for Indian English accent. I switched to Groq Whisper API which solved it completely with much better accuracy.
Challenge 2 — RAM limitations
With 8GB RAM, running Whisper and Ollama together caused memory issues. I solved this by using Groq API for STT which uses zero RAM, and switching to llama3.2:1b which only needs 1.3GB instead of the 3b model.
Challenge 3 — Streamlit rerun issue
The Human-in-the-Loop confirmation was clearing all variables on rerun. I fixed this by storing the entire pipeline state in st.session_state before showing the confirmation dialog.
Challenge 4 — Compound command execution
Getting the LLM to return structured JSON for multiple intents required careful prompt engineering with clear examples in the system prompt.
Tech Stack
UI — Streamlit
Speech-to-Text — Groq Whisper Large V3 API
LLM — Ollama llama3.2:1b running locally
Audio Recording — sounddevice and scipy
Language — Python 3.11
GitHub and Demo
GitHub Repository: https://github.com/rishithabompelli/voice-agent
Demo Video: https://youtu.be/6ulvTsCmlEk
Conclusion
Building this agent taught me a lot about combining STT, LLM, and tool execution into a clean pipeline. The biggest learning was choosing the right model for the hardware. Running everything locally on 8GB RAM required careful optimization of model sizes and offloading STT to an API. The project gave me hands-on experience with local LLMs, voice processing, and building agentic AI systems.
Top comments (0)