Building a Voice-Controlled Local AI Agent Using Whisper and Ollama
Introduction
Voice interfaces are becoming an important way to interact with intelligent systems. This project explores how to build a local AI agent that can understand spoken commands, interpret user intent, and execute real actions on a system.
The goal was to create a complete pipeline that takes audio input, converts it into text, analyzes the intent using a language model, and performs tasks such as file creation, code generation, and text summarization through a web interface.
System Overview
The system follows a modular pipeline:
Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI Output
Each component is designed to be independent, making the system easier to debug, optimize, and extend.
Architecture
1. Audio Input
The system supports two input modes:
- Microphone input using Streamlit’s
audio_input - Audio file upload (
.wav,.mp3,.m4a)
This ensures flexibility for both real-time interaction and testing.
2. Speech-to-Text (STT)
Speech is converted to text using OpenAI’s Whisper model running locally.
To improve performance:
- A smaller model (
tiny) was used - The model was cached to avoid reloading on every request
This reduced latency significantly while maintaining acceptable accuracy for command-based inputs.
3. Intent Detection
Intent detection was implemented using a hybrid approach:
Rule-Based Classification
Common patterns such as:
- “write code”
- “create file”
- “summarize”
are handled instantly using keyword matching. This avoids unnecessary LLM calls and improves speed.
LLM Fallback (Ollama)
For ambiguous inputs, a local LLM (via Ollama) is used to classify intent and extract structured data.
This combination provides both speed and flexibility.
4. Filename Extraction
Instead of relying on the LLM, filenames are extracted using regex directly from the transcribed text.
Example:
- Input: “write code in hello.py”
- Extracted:
hello.py
This approach avoids inconsistencies and ensures reliable file handling.
5. Tool Execution
Based on the detected intent, specific actions are triggered:
- Create File: Creates a new file inside a restricted directory
- Write Code: Generates code using the LLM and writes it to a file
- Summarize: Returns a shortened version of the input text
All file operations are restricted to an output/ folder to prevent unintended system modifications.
6. Code Generation
Code generation is handled using a local LLM (LLaMA via Ollama).
To ensure clean output:
- Prompts explicitly restrict responses to Python code only
- Post-processing removes markdown, non-ASCII characters, and unwanted prefixes
This ensures that generated code can be written directly to files without errors.
7. User Interface
The UI is built using Streamlit and displays:
- Transcribed text
- Detected intent
- Generated code
- File content
- Action result
This provides full transparency into each stage of the pipeline.
Challenges Faced
1. Model Latency
Running Whisper and LLM locally introduced noticeable delays.
Solution:
- Switched to smaller models
- Cached model loading
- Reduced unnecessary LLM calls using rule-based detection
2. Incorrect Intent Classification
The LLM sometimes misclassified inputs (e.g., treating code generation as file creation).
Solution:
- Added strict prompting rules
- Introduced rule-based overrides for critical keywords
3. Filename Extraction Issues
Initially, filenames were not reliably extracted, leading to incorrect file operations.
Solution:
- Implemented regex-based extraction
- Added fallback defaults
- Handled common speech-to-text variations
4. File Overwrite Logic
The system initially failed to write code into existing files due to premature returns in logic.
Solution:
- Ensured write operations always execute
- Separated file existence checks from write logic
5. Noisy LLM Output
Generated code sometimes contained:
- Markdown formatting
- Extra text
- Non-ASCII characters
Solution:
- Cleaned output using regex
- Enforced strict prompt constraints
Performance Optimizations
- Used Whisper
tinymodel for faster transcription - Cached models to avoid repeated loading
- Implemented rule-based intent detection
- Reduced LLM calls to only necessary cases
- Used a lighter model (
mistral) for intent classification
Limitations
- Speech recognition may introduce minor transcription errors
- Local models require sufficient system resources
- Summarization is currently a simple placeholder
- No support for multi-step or compound commands
Future Improvements
- Support compound commands (e.g., summarize and save)
- Add confirmation before file operations
- Replace summarization with LLM-based summarization
- Maintain session memory and conversation history
- Improve UI responsiveness and feedback
Conclusion
This project demonstrates how a complete voice-controlled AI agent can be built using local models and simple tools. By combining speech recognition, intent classification, and automated execution, it is possible to create systems that bridge natural language interaction with real-world actions.
The key takeaway is that combining rule-based logic with LLM capabilities leads to systems that are both efficient and reliable.
Top comments (0)