DEV Community

Nidhesh Gomai
Nidhesh Gomai

Posted on

Building a Voice-Controlled Local AI Agent (with Streamlit + Ollama)

๐Ÿง  Introduction

Voice interfaces are becoming a natural way to interact with software. In this project, I built a Voice-Controlled Local AI Agent that can:

Take audio or text input
Convert speech to text
Detect user intent using a local LLM
Execute actions like file creation, code generation, summarization, and chat
Display everything in a clean web UI

The entire system runs locally, optimized for a low-end laptop (8GB RAM, CPU only).

๐ŸŽฏ What the Agent Can Do

The agent supports four core intents:

create_file โ†’ Generate a new file
write_code โ†’ Write/update code in a file
summarize โ†’ Summarize input text
chat โ†’ General conversation

It also includes:

โœ… Session memory (history tracking)
โœ… Error handling (graceful degradation)
โœ… Human-in-the-loop confirmation for file operations
๐Ÿ—๏ธ System Architecture

The system follows a simple but powerful pipeline:

Audio/Text Input โ†’ Speech-to-Text โ†’ Intent Detection โ†’ Tool Execution โ†’ UI Display โ†’ Memory
Components:
Frontend: Streamlit
STT: Whisper (CPU-based)
LLM: Ollama (phi3)
Execution Layer: Python functions
Memory: Streamlit session state
๐Ÿ› ๏ธ Tech Stack
Python
Streamlit
Ollama (phi3 model)
Whisper (speech-to-text)
โš™๏ธ How It Works

  1. Audio Input

Users can either:

Upload an audio file
Record from microphone

The audio is transcribed using Whisper.

  1. Intent Detection

The transcribed text is passed to the local LLM via Ollama.

A prompt is used to classify the intent into:

create_file, write_code, summarize, chat

  1. Tool Execution

Based on the detected intent:

File operations: Generate and save files inside an output/ folder
Summarization: Condense long text
Chat: Generate conversational responses

  1. UI Display

The Streamlit UI shows:

Transcribed text
Detected intent
Action taken
Output result
Session history
โš ๏ธ Challenges & Solutions
๐Ÿ”ด 1. Speech-to-Text Accuracy

Whisper on CPU produced inconsistent results.

Solution:
Added a manual text input fallback and allowed users to edit transcription.

๐Ÿ”ด 2. API Rate Limits

Initial attempts using cloud APIs failed due to rate limits.

Solution:
Switched to Ollama, enabling fully local inference.

๐Ÿ”ด 3. Hardware Constraints

Running large models on a low-end laptop was slow.

Solution:
Used phi3, a lightweight model optimized for performance.

๐Ÿ”ด 4. File Safety

Risk of writing files anywhere on the system.

Solution:
Restricted all file operations to a dedicated output/ folder.

๐Ÿง  Why Ollama?

Ollama made it possible to:

Run LLMs locally
Avoid API costs and limits
Maintain privacy
Keep the system responsive
๐Ÿ” Safety Considerations

To prevent accidental system changes:

All files are created only inside the output/ directory
File operations require user confirmation
๐Ÿ”ฎ Future Improvements
Multi-step commands (e.g., โ€œsummarize and save to fileโ€)
Better speech recognition
Persistent memory (database)
Voice feedback (text-to-speech)
๐Ÿ Conclusion

This project demonstrates how a complete AI agent pipeline can be built using local tools. Despite hardware limitations, it delivers:

Real-time interaction
Multi-intent execution
Clean UI experience

It highlights the power of combining:

Speech processing
Language models
System automation
๐Ÿ”— Links
๐Ÿ’ป GitHub Repository: https://github.com/NidheshGomai/Voice-Controlled-Local-AI-Agent
๐ŸŽฅ Demo Video: https://youtu.be/CI2mNQl-Bh4

Top comments (0)