DEV Community

Kaustubh Gole
Kaustubh Gole

Posted on

Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools

Voice-Controlled Local AI Agent with Whisper, Ollama, and Safe Local Tools

I built a voice-controlled local AI agent that accepts direct microphone input, transcribes speech, detects intent, and executes safe local actions inside a sandboxed output folder.

This project was designed as a local-first demo, but I also focused on making it practical in real-world conditions. That meant adding fallback behavior, transparent pipeline visibility, and guardrails around file operations so the assistant stays useful without becoming risky.

What the project does

The app follows a simple but effective pipeline:

Microphone input -> Speech-to-text -> Intent detection -> Tool execution -> Final output

It supports a few core actions:

  • Create a file
  • Write code to a file
  • Summarize text
  • General chat

The entire flow is displayed in the UI so you can see exactly what the system heard, what it understood, and what action it took.

Tech stack

The project uses:

  • Streamlit for the user interface
  • Whisper for speech-to-text
  • Ollama for local LLM-based intent reasoning
  • Python for orchestration and local tool execution
  • requests for API fallback transcription
  • Sandboxed file handling inside an output/ directory

Architecture overview

The code is organized into small modules so each part of the pipeline stays focused:

  • app.py handles the Streamlit UI
  • stt.py handles transcription
  • intents.py detects what the user wants
  • tools.py performs safe local actions
  • pipeline.py connects everything together
  • config.py stores runtime settings

That structure makes the application easier to debug and easier to extend later.

1. Audio input

The UI accepts direct microphone input using Streamlit’s audio component.

I also kept file upload support and a manual text rerun option so the app remains usable if speech recognition is noisy.

2. Speech-to-text

The default transcription path uses a local Whisper model through HuggingFace Transformers.

If local STT fails due to environment issues, the app can fall back to an API-based transcription path. That fallback is helpful on weaker machines or when local dependencies are not fully available.

3. Intent detection

Once the transcript is available, the app sends it to a local Ollama model to classify intent.

Supported intents include:

  • create_file
  • write_code
  • summarize_text
  • general_chat

If the model is unavailable, the app uses a keyword-based fallback parser so the pipeline still works.

4. Tool execution

After intent detection, the pipeline routes to the correct tool:

  • create_file() creates a safe empty file
  • write_code_file() generates and writes code
  • summarize_text() returns a concise summary
  • general_chat() handles general conversational output

All file-related actions are restricted to the output/ folder, which acts as a sandbox.

Safety guardrails

One of the most important design decisions was limiting file operations to a safe local directory.

That means:

  • No arbitrary path writes
  • Filenames are sanitized
  • File extensions are restricted
  • Generated files stay inside output/

This makes the assistant much safer for demo and assignment use.

Challenges I ran into

Local STT dependencies

Speech-to-text on local machines can be fragile, especially when audio decoding libraries like ffmpeg are missing.

To reduce that friction, I added error handling and a fallback path for WAV files.

Local model availability

Local LLMs are useful, but they can fail if Ollama is not running or if the configured model is unavailable.

To handle that, the app shows runtime diagnostics and falls back to simpler behavior when needed.

Noisy transcription

Speech recognition is not always perfect, especially with background noise or accents.

To make the workflow more forgiving, I added a manual transcript edit box so the user can correct the text and rerun the intent and tool pipeline.

What I learned

This project reinforced a few important lessons:

  • A good AI assistant is not just about the model
  • Fallbacks matter as much as the primary path
  • Transparency improves trust
  • Safety constraints should be built in from the beginning
  • A simple modular architecture makes debugging much easier

Demo flow

For the video demo, I plan to show:

  1. Direct microphone input
  2. Transcript generation
  3. Intent detection
  4. File creation or code generation
  5. Final output inside the app

That gives a clean end-to-end view of how the system behaves.

Future improvements

A few enhancements I would add next:

  • Better support for compound commands
  • Confirmation prompts before tool execution
  • More tools, like search or note-taking
  • Memory for multi-turn workflows
  • Improved structured intent schemas with confidence scores

Conclusion

This project was a practical exercise in building a local-first voice assistant that is usable, safe, and transparent.

Instead of aiming for a flashy demo with a single model call, I focused on the full pipeline: audio input, transcription, intent detection, tool routing, safe execution, and clear UI feedback.

That combination made the system feel much more realistic and much easier to trust.

If you want to try a similar build, start small, keep the architecture modular, and make failure cases visible from day one.

Top comments (0)