๐ง Introduction
Voice interfaces are becoming a natural way to interact with software. In this project, I built a Voice-Controlled Local AI Agent that can:
Take audio or text input
Convert speech to text
Detect user intent using a local LLM
Execute actions like file creation, code generation, summarization, and chat
Display everything in a clean web UI
The entire system runs locally, optimized for a low-end laptop (8GB RAM, CPU only).
๐ฏ What the Agent Can Do
The agent supports four core intents:
create_file โ Generate a new file
write_code โ Write/update code in a file
summarize โ Summarize input text
chat โ General conversation
It also includes:
โ
Session memory (history tracking)
โ
Error handling (graceful degradation)
โ
Human-in-the-loop confirmation for file operations
๐๏ธ System Architecture
The system follows a simple but powerful pipeline:
Audio/Text Input โ Speech-to-Text โ Intent Detection โ Tool Execution โ UI Display โ Memory
Components:
Frontend: Streamlit
STT: Whisper (CPU-based)
LLM: Ollama (phi3)
Execution Layer: Python functions
Memory: Streamlit session state
๐ ๏ธ Tech Stack
Python
Streamlit
Ollama (phi3 model)
Whisper (speech-to-text)
โ๏ธ How It Works
- Audio Input
Users can either:
Upload an audio file
Record from microphone
The audio is transcribed using Whisper.
- Intent Detection
The transcribed text is passed to the local LLM via Ollama.
A prompt is used to classify the intent into:
create_file, write_code, summarize, chat
- Tool Execution
Based on the detected intent:
File operations: Generate and save files inside an output/ folder
Summarization: Condense long text
Chat: Generate conversational responses
- UI Display
The Streamlit UI shows:
Transcribed text
Detected intent
Action taken
Output result
Session history
โ ๏ธ Challenges & Solutions
๐ด 1. Speech-to-Text Accuracy
Whisper on CPU produced inconsistent results.
Solution:
Added a manual text input fallback and allowed users to edit transcription.
๐ด 2. API Rate Limits
Initial attempts using cloud APIs failed due to rate limits.
Solution:
Switched to Ollama, enabling fully local inference.
๐ด 3. Hardware Constraints
Running large models on a low-end laptop was slow.
Solution:
Used phi3, a lightweight model optimized for performance.
๐ด 4. File Safety
Risk of writing files anywhere on the system.
Solution:
Restricted all file operations to a dedicated output/ folder.
๐ง Why Ollama?
Ollama made it possible to:
Run LLMs locally
Avoid API costs and limits
Maintain privacy
Keep the system responsive
๐ Safety Considerations
To prevent accidental system changes:
All files are created only inside the output/ directory
File operations require user confirmation
๐ฎ Future Improvements
Multi-step commands (e.g., โsummarize and save to fileโ)
Better speech recognition
Persistent memory (database)
Voice feedback (text-to-speech)
๐ Conclusion
This project demonstrates how a complete AI agent pipeline can be built using local tools. Despite hardware limitations, it delivers:
Real-time interaction
Multi-intent execution
Clean UI experience
It highlights the power of combining:
Speech processing
Language models
System automation
๐ Links
๐ป GitHub Repository: https://github.com/NidheshGomai/Voice-Controlled-Local-AI-Agent
๐ฅ Demo Video: https://youtu.be/CI2mNQl-Bh4
Top comments (0)