Nayana Shaji Mekkunnel

Posted on Apr 13

Building a Voice-Controlled Local AI Agent Using Whisper and Ollama

#agents #ai #llm #tutorial

Building a Voice-Controlled Local AI Agent Using Whisper and Ollama

Introduction

Voice interfaces are becoming an important way to interact with intelligent systems. This project explores how to build a local AI agent that can understand spoken commands, interpret user intent, and execute real actions on a system.

The goal was to create a complete pipeline that takes audio input, converts it into text, analyzes the intent using a language model, and performs tasks such as file creation, code generation, and text summarization through a web interface.

System Overview

The system follows a modular pipeline:

Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI Output

Each component is designed to be independent, making the system easier to debug, optimize, and extend.

Architecture

1. Audio Input

The system supports two input modes:

Microphone input using Streamlit’s audio_input
Audio file upload (.wav, .mp3, .m4a)

This ensures flexibility for both real-time interaction and testing.

2. Speech-to-Text (STT)

Speech is converted to text using OpenAI’s Whisper model running locally.

To improve performance:

A smaller model (tiny) was used
The model was cached to avoid reloading on every request

This reduced latency significantly while maintaining acceptable accuracy for command-based inputs.

3. Intent Detection

Intent detection was implemented using a hybrid approach:

Rule-Based Classification

Common patterns such as:

“write code”
“create file”
“summarize”

are handled instantly using keyword matching. This avoids unnecessary LLM calls and improves speed.

LLM Fallback (Ollama)

For ambiguous inputs, a local LLM (via Ollama) is used to classify intent and extract structured data.

This combination provides both speed and flexibility.

4. Filename Extraction

Instead of relying on the LLM, filenames are extracted using regex directly from the transcribed text.

Example:

Input: “write code in hello.py”
Extracted: hello.py

This approach avoids inconsistencies and ensures reliable file handling.

5. Tool Execution

Based on the detected intent, specific actions are triggered:

Create File: Creates a new file inside a restricted directory
Write Code: Generates code using the LLM and writes it to a file
Summarize: Returns a shortened version of the input text

All file operations are restricted to an output/ folder to prevent unintended system modifications.

6. Code Generation

Code generation is handled using a local LLM (LLaMA via Ollama).

To ensure clean output:

Prompts explicitly restrict responses to Python code only
Post-processing removes markdown, non-ASCII characters, and unwanted prefixes

This ensures that generated code can be written directly to files without errors.

7. User Interface

The UI is built using Streamlit and displays:

Transcribed text
Detected intent
Generated code
File content
Action result

This provides full transparency into each stage of the pipeline.

Challenges Faced

1. Model Latency

Running Whisper and LLM locally introduced noticeable delays.

Solution:

Switched to smaller models
Cached model loading
Reduced unnecessary LLM calls using rule-based detection

2. Incorrect Intent Classification

The LLM sometimes misclassified inputs (e.g., treating code generation as file creation).

Solution:

Added strict prompting rules
Introduced rule-based overrides for critical keywords

3. Filename Extraction Issues

Initially, filenames were not reliably extracted, leading to incorrect file operations.

Solution:

Implemented regex-based extraction
Added fallback defaults
Handled common speech-to-text variations

4. File Overwrite Logic

The system initially failed to write code into existing files due to premature returns in logic.

Solution:

Ensured write operations always execute
Separated file existence checks from write logic

5. Noisy LLM Output

Generated code sometimes contained:

Markdown formatting
Extra text
Non-ASCII characters

Solution:

Cleaned output using regex
Enforced strict prompt constraints

Performance Optimizations

Used Whisper tiny model for faster transcription
Cached models to avoid repeated loading
Implemented rule-based intent detection
Reduced LLM calls to only necessary cases
Used a lighter model (mistral) for intent classification

Limitations

Speech recognition may introduce minor transcription errors
Local models require sufficient system resources
Summarization is currently a simple placeholder
No support for multi-step or compound commands

Future Improvements

Support compound commands (e.g., summarize and save)
Add confirmation before file operations
Replace summarization with LLM-based summarization
Maintain session memory and conversation history
Improve UI responsiveness and feedback

Conclusion

This project demonstrates how a complete voice-controlled AI agent can be built using local models and simple tools. By combining speech recognition, intent classification, and automated execution, it is possible to create systems that bridge natural language interaction with real-world actions.

The key takeaway is that combining rule-based logic with LLM capabilities leads to systems that are both efficient and reliable.

DEV Community

Building a Voice-Controlled Local AI Agent Using Whisper and Ollama

Building a Voice-Controlled Local AI Agent Using Whisper and Ollama

Introduction

System Overview

Architecture

1. Audio Input

2. Speech-to-Text (STT)

3. Intent Detection

Rule-Based Classification

LLM Fallback (Ollama)

4. Filename Extraction

5. Tool Execution

6. Code Generation

7. User Interface

Challenges Faced

1. Model Latency

2. Incorrect Intent Classification

3. Filename Extraction Issues

4. File Overwrite Logic

5. Noisy LLM Output

Performance Optimizations

Limitations

Future Improvements

Conclusion

Top comments (0)