Building a Voice-Controlled Local AI Agent (with Streamlit + Ollama)

Nidhesh Gomai — Sun, 12 Apr 2026 21:34:01 +0000

🧠 Introduction

Voice interfaces are becoming a natural way to interact with software. In this project, I built a Voice-Controlled Local AI Agent that can:

Take audio or text input
Convert speech to text
Detect user intent using a local LLM
Execute actions like file creation, code generation, summarization, and chat
Display everything in a clean web UI

The entire system runs locally, optimized for a low-end laptop (8GB RAM, CPU only).

🎯 What the Agent Can Do

The agent supports four core intents:

create_file → Generate a new file
write_code → Write/update code in a file
summarize → Summarize input text
chat → General conversation

It also includes:

✅ Session memory (history tracking)
✅ Error handling (graceful degradation)
✅ Human-in-the-loop confirmation for file operations
🏗️ System Architecture

The system follows a simple but powerful pipeline:

Audio/Text Input → Speech-to-Text → Intent Detection → Tool Execution → UI Display → Memory
Components:
Frontend: Streamlit
STT: Whisper (CPU-based)
LLM: Ollama (phi3)
Execution Layer: Python functions
Memory: Streamlit session state
🛠️ Tech Stack
Python
Streamlit
Ollama (phi3 model)
Whisper (speech-to-text)
⚙️ How It Works

Audio Input

Users can either:

Upload an audio file
Record from microphone

The audio is transcribed using Whisper.

Intent Detection

The transcribed text is passed to the local LLM via Ollama.

A prompt is used to classify the intent into:

create_file, write_code, summarize, chat

Tool Execution

Based on the detected intent:

File operations: Generate and save files inside an output/ folder
Summarization: Condense long text
Chat: Generate conversational responses

UI Display

The Streamlit UI shows:

Transcribed text
Detected intent
Action taken
Output result
Session history
⚠️ Challenges & Solutions
🔴 1. Speech-to-Text Accuracy

Whisper on CPU produced inconsistent results.

Solution:
Added a manual text input fallback and allowed users to edit transcription.

🔴 2. API Rate Limits

Initial attempts using cloud APIs failed due to rate limits.

Solution:
Switched to Ollama, enabling fully local inference.

🔴 3. Hardware Constraints

Running large models on a low-end laptop was slow.

Solution:
Used phi3, a lightweight model optimized for performance.

🔴 4. File Safety

Risk of writing files anywhere on the system.

Solution:
Restricted all file operations to a dedicated output/ folder.

🧠 Why Ollama?

Ollama made it possible to:

Run LLMs locally
Avoid API costs and limits
Maintain privacy
Keep the system responsive
🔐 Safety Considerations

To prevent accidental system changes:

All files are created only inside the output/ directory
File operations require user confirmation
🔮 Future Improvements
Multi-step commands (e.g., “summarize and save to file”)
Better speech recognition
Persistent memory (database)
Voice feedback (text-to-speech)
🏁 Conclusion

This project demonstrates how a complete AI agent pipeline can be built using local tools. Despite hardware limitations, it delivers:

Real-time interaction
Multi-intent execution
Clean UI experience

It highlights the power of combining:

Speech processing
Language models
System automation
🔗 Links
💻 GitHub Repository: https://github.com/NidheshGomai/Voice-Controlled-Local-AI-Agent
🎥 Demo Video: https://youtu.be/CI2mNQl-Bh4

DEV Community: Nidhesh Gomai

Building a Voice-Controlled Local AI Agent (with Streamlit + Ollama)