Build a Voice-Controlled Local AI Agent with Ollama and Faster-Whisper"

#python #ai #agents #llm

Building a Private, Voice-Controlled AI Agent with Ollama and Faster-Whisper

🎯 Project Overview

As part of the Mem0 AI & Generative AI Developer Intern assignment, I built a local AI agent that allows users to manage files, write code, and summarize text using only their voice. The core mission: 100% privacy and zero cloud dependencies.

🛠️ The Tech Stack

To ensure the agent runs entirely on a local machine, I selected the following components:

Frontend: Streamlit for a fast, responsive Web UI.
Speech-to-Text: Faster-Whisper (Int8 quantized) for high-speed local transcription on a CPU.
Brain (LLM): Ollama running phi3:mini (or llama3.2:1b) to classify intents.
Tool Execution: Python's os and pathlib for safe file operations.

🏗️ The Architecture

The pipeline follows a clear flow:

Audio Input: The user provides audio via the browser microphone or a file upload.
Transcription: Faster-Whisper processes the audio into text.
Intent Detection: The LLM analyzes the text and returns a structured JSON object.
Action: The system executes the specific intent (e.g., creating a file in the output/ folder).

🚧 Challenges Faced

1. Hardware Constraints (RAM)

My local machine had limited available memory (~3.1 GiB), which initially caused crashes when running larger models.
Solution: I optimized the system by switching to a smaller parameter model (Phi-3) and forcing CPU-only mode in Ollama.

2. Browser Mic Permissions

Running Streamlit on localhost often triggers strict browser security blocks for the microphone.
Solution: I implemented a dual-input system that allows users to upload .wav files as a reliable fallback.

🔐 Safety & Security

To prevent accidental system damage, I implemented a strict safety constraint: all file operations are sandboxed within a dedicated ./output/ directory.