I recently built a fully local, voice-controlled AI agent that can listen to audio, understand user intent, and execute real actions like creating files or generating code. Hereβs a quick breakdown of how it works and what I learned along the way.
π§ Architecture Overview
The system follows a clean pipeline:
Audio β Speech-to-Text β Intent Detection β Tool Execution β UI Output
Speech-to-Text (STT): I used Faster-Whisper running locally for accurate and fast transcription.
Intent Detection: A lightweight LLM (phi3:latest via Ollama) classifies what the user wants.
Tool Execution: Based on intent, the system triggers actions like:
Creating files
Writing code
Summarizing text
General chat
Frontend: Built with Flask + simple UI to visualize each stage of the pipeline.
π€ Why These Models?
π€ Faster-Whisper
Works locally (no API dependency)
Handles multiple audio formats
Good balance of speed and accuracy on CPU
π§ Phi-3 (via Ollama)
Lightweight (~2β4GB runtime)
Fast inference β avoids timeouts
Reliable for structured outputs (JSON)
Initially, I tried larger models like Qwen, but they caused latency issues and frequent timeouts on my hardware. Switching to Phi-3 made the system much more responsive.
βοΈ Key Features
Supports both mic input and file upload
Multi-intent handling (e.g., βcreate a file and write codeβ)
Safety sandbox (output/ folder)
Chat history memory
Streaming responses for better UX
π΅ Challenges I Faced
- β Model Timeouts
Large models were too slow for real-time interaction. Requests would simply hang or return empty responses.
π Fix: Switched to a smaller model and reduced token generation.
- π€ Speech-to-Text Errors
Whisper sometimes misheard:
hello.py β hello.5
dot py β .5
π Fix: Added preprocessing rules to normalize filenames.
- π§ Incorrect Intent Detection
The model often classified everything as βchat,β even when the user clearly wanted to create a file.
π Fix: Added rule-based overrides (hybrid system = rules + LLM).
- π Streaming Bugs
Enabling streaming broke responses because I was still parsing them like normal JSON.
Fix: Switched to chunk-based parsing for streaming responses.
What I Learned
Smaller, faster models are often better for real-time systems.
LLMs alone are not reliable for control logic β rules are essential.
Preprocessing (especially for speech input) is critical.
Good UX (like streaming) makes a huge difference in perception.
Final Thoughts
This project taught me how to build a practical AI systemβnot just a model, but a full pipeline that works reliably in real-world conditions.
If I were to extend this further, Iβd add:
Real-time voice streaming
Persistent memory (vector DB)
Better UI with live token streaming
Thanks for reading! π
Top comments (0)