DEV Community

Jessica Ekka
Jessica Ekka

Posted on

Local Voice-Controlled AI Agent (Whisper + Ollama + Streamlit)

Most AI assistants today rely heavily on cloud APIs. While powerful, they introduce latency, cost, and privacy concerns.

So I built a fully local voice-controlled AI agent that can:

  • Understand voice commands
  • Detect user intent
  • Generate code
  • Create files
  • Summarize text
  • Chat interactively

All running completely offline using open-source tools.

System Architecture

End-to-End Flow
User Input (Voice/Text)

Speech-to-Text (Whisper)

Intent Detection (Rules + LLM)

Execution Engine
├── File Operations
├── Code Generation
├── Summarization
└── Chat

Streamlit UI (Results + Memory)

Component Breakdown

app.py → UI + orchestration
agent.py → intent detection + LLM calls
tools.py → secure file operations
stt.py → voice → text

How It Works

  1. Input Layer

User provides either:

Voice input (recorded via browser)
Text command

  1. Speech-to-Text

Voice input is converted to text using Whisper:

"create a file called hello.txt"

  1. Intent Detection

A hybrid approach is used:

Rule-based classification (fast + reliable)
LLM fallback for flexibility

Example:

"Write a Python function for factorial"
→ Intent: write_code

  1. Execution Engine

Depending on intent:

  • create_file → writes to sandbox
  • write_code → calls LLM
  • summarize → LLM summarization
  • chat → conversational response
  • UI Layer

Built with Streamlit:

Shows transcription
Displays detected intent
Requires confirmation for file actions
Displays results + saved files

All file operations are sandboxed to:

/output/

This prevents:

Directory traversal (../../)
Overwriting system files
Unsafe file access

Model Strategy

Running large models locally can be tricky, so I used:

Model Purpose
llama3.2:3b Primary model
llama3.2:1b Fallback (low RAM)

Fallback Mechanism

If the main model fails:
→ automatically switches to a smaller one

This ensures stability even on low-memory systems.

Dynamic Model Switching

The UI includes a dropdown to switch models in real time:

No restart required
Useful for testing performance
Helps in benchmarking
Session Memory (Bonus Feature)

The system maintains a short-term memory:

Stores last commands
Tracks detected intents
Displays recent activity

Example:

  1. Command: create hello.txt Intent: create_file ⚠️ Challenges Faced
    1. LLM Returning Bad JSON

Sometimes the model output was malformed.

Fix:

Avoid strict JSON parsing
Use rule-based fallback

  1. High Memory Usage

Large models like 70B were unusable locally.

Fix:

Switched to smaller models (3B, 1B)
Added fallback logic

  1. Voice Misinterpretation

Example:

"write" → "right"

Fix:

Added text cleaning layer

  1. Parameter Extraction Issues

Example:

"write hello world in it"

Fix:

Regex-based extraction
Post-cleaning of phrases

Bonus Features Implemented

  • Human-in-the-loop confirmation
  • Graceful error handling
  • Session memory
  • Model switching
  • Sandboxed file system

Future Improvements

Multi-command execution

“Summarize this and save it to file”

Persistent memory (database)
Model benchmarking dashboard
Smarter NLP-based intent detection

Top comments (0)