I recently built a fully local, voice-controlled AI agent that can listen to audio, understand user intent, and execute real actions like creating files or generating code. Hereโs a quick breakdown of how it works and what I learned along the way.
๐ง Architecture Overview
The system follows a clean pipeline:
Audio โ Speech-to-Text โ Intent Detection โ Tool Execution โ UI Output
Speech-to-Text (STT): I used Faster-Whisper running locally for accurate and fast transcription.
Intent Detection: A lightweight LLM (phi3:latest via Ollama) classifies what the user wants.
Tool Execution: Based on intent, the system triggers actions like:
Creating files
Writing code
Summarizing text
General chat
Frontend: Built with Flask + simple UI to visualize each stage of the pipeline.
๐ค Why These Models?
๐ค Faster-Whisper
Works locally (no API dependency)
Handles multiple audio formats
Good balance of speed and accuracy on CPU
๐ง Phi-3 (via Ollama)
Lightweight (~2โ4GB runtime)
Fast inference โ avoids timeouts
Reliable for structured outputs (JSON)
Initially, I tried larger models like Qwen, but they caused latency issues and frequent timeouts on my hardware. Switching to Phi-3 made the system much more responsive.
โ๏ธ Key Features
Supports both mic input and file upload
Multi-intent handling (e.g., โcreate a file and write codeโ)
Safety sandbox (output/ folder)
Chat history memory
Streaming responses for better UX
๐ต Challenges I Faced
- โ Model Timeouts
Large models were too slow for real-time interaction. Requests would simply hang or return empty responses.
๐ Fix: Switched to a smaller model and reduced token generation.
- ๐ค Speech-to-Text Errors
Whisper sometimes misheard:
hello.py โ hello.5
dot py โ .5
๐ Fix: Added preprocessing rules to normalize filenames.
- ๐ง Incorrect Intent Detection
The model often classified everything as โchat,โ even when the user clearly wanted to create a file.
๐ Fix: Added rule-based overrides (hybrid system = rules + LLM).
- ๐ Streaming Bugs
Enabling streaming broke responses because I was still parsing them like normal JSON.
Fix: Switched to chunk-based parsing for streaming responses.
What I Learned
Smaller, faster models are often better for real-time systems.
LLMs alone are not reliable for control logic โ rules are essential.
Preprocessing (especially for speech input) is critical.
Good UX (like streaming) makes a huge difference in perception.
Final Thoughts
This project taught me how to build a practical AI systemโnot just a model, but a full pipeline that works reliably in real-world conditions.
If I were to extend this further, Iโd add:
Real-time voice streaming
Persistent memory (vector DB)
Better UI with live token streaming
Thanks for reading! ๐
Top comments (0)