Jessica Ekka

Posted on Apr 17

Local Voice-Controlled AI Agent (Whisper + Ollama + Streamlit)

#agents #ai #llm #python

Most AI assistants today rely heavily on cloud APIs. While powerful, they introduce latency, cost, and privacy concerns.

So I built a fully local voice-controlled AI agent that can:

Understand voice commands
Detect user intent
Generate code
Create files
Summarize text
Chat interactively

All running completely offline using open-source tools.

System Architecture

End-to-End Flow
User Input (Voice/Text)
↓
Speech-to-Text (Whisper)
↓
Intent Detection (Rules + LLM)
↓
Execution Engine
├── File Operations
├── Code Generation
├── Summarization
└── Chat
↓
Streamlit UI (Results + Memory)

Component Breakdown

app.py → UI + orchestration
agent.py → intent detection + LLM calls
tools.py → secure file operations
stt.py → voice → text

How It Works

Input Layer

User provides either:

Voice input (recorded via browser)
Text command

Speech-to-Text

Voice input is converted to text using Whisper:

"create a file called hello.txt"

Intent Detection

A hybrid approach is used:

Rule-based classification (fast + reliable)
LLM fallback for flexibility

Example:

"Write a Python function for factorial"
→ Intent: write_code

Execution Engine

Depending on intent:

create_file → writes to sandbox
write_code → calls LLM
summarize → LLM summarization
chat → conversational response
UI Layer

Built with Streamlit:

Shows transcription
Displays detected intent
Requires confirmation for file actions
Displays results + saved files

All file operations are sandboxed to:

/output/

This prevents:

Directory traversal (../../)
Overwriting system files
Unsafe file access

Model Strategy

Running large models locally can be tricky, so I used:

Model Purpose
llama3.2:3b Primary model
llama3.2:1b Fallback (low RAM)

Fallback Mechanism

If the main model fails:
→ automatically switches to a smaller one

This ensures stability even on low-memory systems.

Dynamic Model Switching

The UI includes a dropdown to switch models in real time:

No restart required
Useful for testing performance
Helps in benchmarking
Session Memory (Bonus Feature)

The system maintains a short-term memory:

Stores last commands
Tracks detected intents
Displays recent activity

Example:

Command: create hello.txt Intent: create_file ⚠️ Challenges Faced
1. LLM Returning Bad JSON

Sometimes the model output was malformed.

Fix:

Avoid strict JSON parsing
Use rule-based fallback

High Memory Usage

Large models like 70B were unusable locally.

Fix:

Switched to smaller models (3B, 1B)
Added fallback logic

Voice Misinterpretation

Example:

"write" → "right"

Fix:

Added text cleaning layer

Parameter Extraction Issues

Example:

"write hello world in it"

Fix:

Regex-based extraction
Post-cleaning of phrases

Bonus Features Implemented

Human-in-the-loop confirmation
Graceful error handling
Session memory
Model switching
Sandboxed file system

Future Improvements

Multi-command execution

“Summarize this and save it to file”

Persistent memory (database)
Model benchmarking dashboard
Smarter NLP-based intent detection

DEV Community

Local Voice-Controlled AI Agent (Whisper + Ollama + Streamlit)

System Architecture

Component Breakdown

Model Strategy

Fallback Mechanism

Dynamic Model Switching

Future Improvements

Top comments (0)