DEV Community

Shaik Idris
Shaik Idris

Posted on

Build a Voice-Controlled Local AI Agent with Ollama and Faster-Whisper"

Building a Private, Voice-Controlled AI Agent with Ollama and Faster-Whisper

🎯 Project Overview

As part of the Mem0 AI & Generative AI Developer Intern assignment, I built a local AI agent that allows users to manage files, write code, and summarize text using only their voice. The core mission: 100% privacy and zero cloud dependencies.

πŸ› οΈ The Tech Stack

To ensure the agent runs entirely on a local machine, I selected the following components:

  • Frontend: Streamlit for a fast, responsive Web UI.
  • Speech-to-Text: Faster-Whisper (Int8 quantized) for high-speed local transcription on a CPU.
  • Brain (LLM): Ollama running phi3:mini (or llama3.2:1b) to classify intents.
  • Tool Execution: Python's os and pathlib for safe file operations.

πŸ—οΈ The Architecture

The pipeline follows a clear flow:

  1. Audio Input: The user provides audio via the browser microphone or a file upload.
  2. Transcription: Faster-Whisper processes the audio into text.
  3. Intent Detection: The LLM analyzes the text and returns a structured JSON object.
  4. Action: The system executes the specific intent (e.g., creating a file in the output/ folder).

🚧 Challenges Faced

1. Hardware Constraints (RAM)

My local machine had limited available memory (~3.1 GiB), which initially caused crashes when running larger models.
Solution: I optimized the system by switching to a smaller parameter model (Phi-3) and forcing CPU-only mode in Ollama.

2. Browser Mic Permissions

Running Streamlit on localhost often triggers strict browser security blocks for the microphone.
Solution: I implemented a dual-input system that allows users to upload .wav files as a reliable fallback.

πŸ” Safety & Security

To prevent accidental system damage, I implemented a strict safety constraint: all file operations are sandboxed within a dedicated ./output/ directory.

πŸ“Ί Conclusion

Building this agent taught me how to bridge the gap between speech-to-text and LLM tool-calling in a local environment.

Links:

Top comments (0)