DEV Community

Dhruv Sood
Dhruv Sood

Posted on

Building a Voice-Controlled Local AI Agent (End-to-End)

I recently built a fully local, voice-controlled AI agent that can listen to audio, understand user intent, and execute real actions like creating files or generating code. Here’s a quick breakdown of how it works and what I learned along the way.

🧠 Architecture Overview

The system follows a clean pipeline:

Audio β†’ Speech-to-Text β†’ Intent Detection β†’ Tool Execution β†’ UI Output

Speech-to-Text (STT): I used Faster-Whisper running locally for accurate and fast transcription.
Intent Detection: A lightweight LLM (phi3:latest via Ollama) classifies what the user wants.
Tool Execution: Based on intent, the system triggers actions like:
Creating files
Writing code
Summarizing text
General chat
Frontend: Built with Flask + simple UI to visualize each stage of the pipeline.
πŸ€– Why These Models?
🎀 Faster-Whisper
Works locally (no API dependency)
Handles multiple audio formats
Good balance of speed and accuracy on CPU
🧠 Phi-3 (via Ollama)
Lightweight (~2–4GB runtime)
Fast inference β†’ avoids timeouts
Reliable for structured outputs (JSON)

Initially, I tried larger models like Qwen, but they caused latency issues and frequent timeouts on my hardware. Switching to Phi-3 made the system much more responsive.

βš™οΈ Key Features
Supports both mic input and file upload
Multi-intent handling (e.g., β€œcreate a file and write code”)
Safety sandbox (output/ folder)
Chat history memory
Streaming responses for better UX
😡 Challenges I Faced

  1. ❌ Model Timeouts

Large models were too slow for real-time interaction. Requests would simply hang or return empty responses.

πŸ‘‰ Fix: Switched to a smaller model and reduced token generation.

  1. 🎀 Speech-to-Text Errors

Whisper sometimes misheard:

hello.py β†’ hello.5
dot py β†’ .5

πŸ‘‰ Fix: Added preprocessing rules to normalize filenames.

  1. 🧠 Incorrect Intent Detection

The model often classified everything as β€œchat,” even when the user clearly wanted to create a file.

πŸ‘‰ Fix: Added rule-based overrides (hybrid system = rules + LLM).

  1. πŸ”„ Streaming Bugs

Enabling streaming broke responses because I was still parsing them like normal JSON.

Fix: Switched to chunk-based parsing for streaming responses.

What I Learned
Smaller, faster models are often better for real-time systems.
LLMs alone are not reliable for control logic β€” rules are essential.
Preprocessing (especially for speech input) is critical.
Good UX (like streaming) makes a huge difference in perception.

Final Thoughts
This project taught me how to build a practical AI systemβ€”not just a model, but a full pipeline that works reliably in real-world conditions.

If I were to extend this further, I’d add:
Real-time voice streaming
Persistent memory (vector DB)
Better UI with live token streaming

Thanks for reading! πŸš€

Top comments (0)