Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools
I built a voice-controlled local AI agent that accepts direct microphone input, transcribes speech, detects intent, and executes safe local actions inside a sandboxed output folder.
This project was designed as a local-first demo, but I also focused on making it practical in real-world conditions. That meant adding fallback behavior, transparent pipeline visibility, and guardrails around file operations so the assistant stays useful without becoming risky.
What the project does
The app follows a simple but effective pipeline:
Microphone input -> Speech-to-text -> Intent detection -> Tool execution -> Final output
It supports a few core actions:
- Create a file
- Write code to a file
- Summarize text
- General chat
The entire flow is displayed in the UI so you can see exactly what the system heard, what it understood, and what action it took.
Tech stack
The project uses:
- Streamlit for the user interface
- Whisper for speech-to-text
- Ollama for local LLM-based intent reasoning
- Python for orchestration and local tool execution
- requests for API fallback transcription
-
Sandboxed file handling inside an
output/directory
Architecture overview
The code is organized into small modules so each part of the pipeline stays focused:
-
app.pyhandles the Streamlit UI -
stt.pyhandles transcription -
intents.pydetects what the user wants -
tools.pyperforms safe local actions -
pipeline.pyconnects everything together -
config.pystores runtime settings
That structure makes the application easier to debug and easier to extend later.
1. Audio input
The UI accepts direct microphone input using Streamlit’s audio component.
I also kept file upload support and a manual text rerun option so the app remains usable if speech recognition is noisy.
2. Speech-to-text
The default transcription path uses a local Whisper model through HuggingFace Transformers.
If local STT fails due to environment issues, the app can fall back to an API-based transcription path. That fallback is helpful on weaker machines or when local dependencies are not fully available.
3. Intent detection
Once the transcript is available, the app sends it to a local Ollama model to classify intent.
Supported intents include:
create_filewrite_codesummarize_textgeneral_chat
If the model is unavailable, the app uses a keyword-based fallback parser so the pipeline still works.
4. Tool execution
After intent detection, the pipeline routes to the correct tool:
-
create_file()creates a safe empty file -
write_code_file()generates and writes code -
summarize_text()returns a concise summary -
general_chat()handles general conversational output
All file-related actions are restricted to the output/ folder, which acts as a sandbox.
Safety guardrails
One of the most important design decisions was limiting file operations to a safe local directory.
That means:
- No arbitrary path writes
- Filenames are sanitized
- File extensions are restricted
- Generated files stay inside
output/
This makes the assistant much safer for demo and assignment use.
Challenges I ran into
Local STT dependencies
Speech-to-text on local machines can be fragile, especially when audio decoding libraries like ffmpeg are missing.
To reduce that friction, I added error handling and a fallback path for WAV files.
Local model availability
Local LLMs are useful, but they can fail if Ollama is not running or if the configured model is unavailable.
To handle that, the app shows runtime diagnostics and falls back to simpler behavior when needed.
Noisy transcription
Speech recognition is not always perfect, especially with background noise or accents.
To make the workflow more forgiving, I added a manual transcript edit box so the user can correct the text and rerun the intent and tool pipeline.
What I learned
This project reinforced a few important lessons:
- A good AI assistant is not just about the model
- Fallbacks matter as much as the primary path
- Transparency improves trust
- Safety constraints should be built in from the beginning
- A simple modular architecture makes debugging much easier
Demo flow
For the video demo, I plan to show:
- Direct microphone input
- Transcript generation
- Intent detection
- File creation or code generation
- Final output inside the app
That gives a clean end-to-end view of how the system behaves.
Future improvements
A few enhancements I would add next:
- Better support for compound commands
- Confirmation prompts before tool execution
- More tools, like search or note-taking
- Memory for multi-turn workflows
- Improved structured intent schemas with confidence scores
Conclusion
This project was a practical exercise in building a local-first voice assistant that is usable, safe, and transparent.
Instead of aiming for a flashy demo with a single model call, I focused on the full pipeline: audio input, transcription, intent detection, tool routing, safe execution, and clear UI feedback.
That combination made the system feel much more realistic and much easier to trust.
If you want to try a similar build, start small, keep the architecture modular, and make failure cases visible from day one.
Top comments (0)