Building a Voice-Controlled Local AI Agent with Whisper, LLaMA 3 and Streamlit

#agents #ai #llm #tutorial

Introduction

EchoMemo is a Voice Controlled Local AI Agent that runs entirely on your local machine. The user is required to give a command, either through their microphone or by uploading an audio file, through which the system automatically:

Converts speech to text
Understands what the user wants to do
Executes the right action by capturing the intent
Shows every step of the pipeline in a clean web UI

No requirement of cloud, API keys, or internet after setup.

What difference does this make?

Local AI solves the problems of Privacy, Cost, Dependency, and Latency unlike most AI tools today. With tools like Whisper for speech recognition and Ollama as a local LLM, you can run powerful AI models directly on your own hardware. Your data never leaves your computer, there are no usage fees, and the system works even completely offline.

This project is a practical demonstration of that idea — a fully functional voice agent built entirely from local, open-source models.

Architecture Overview

Audio Input (mic / file upload)
        │
        ▼
  Speech-to-Text          ← OpenAI Whisper "base" model (fully local)
  [models/stt.py]
        │
        ▼
  Intent Classifier       ← LLaMA 3 via Ollama (fully local)
  [llm/intent_classifier.py]
        │
        ├── create_file    → tools/file_tools.py   → output/
        ├── write_code     → tools/code_tools.py   → output/
        ├── summarize_text → tools/text_tools.py
        └── general_chat   → direct Ollama LLaMA 3 chat
        │
        ▼
  Streamlit UI            ← displays transcription, intent, action, result
  [app.py]

Models I Chose

Whisper base — free, local, accurate, no API key needed
LLaMA 3 via Ollama — fully local LLM, easy to set up and run

How Each Component Works

models/stt.py — Converts audio to a format Whisper understands using pydub + ffmpeg, runs it through the Whisper base model locally, and returns plain text. Whisper handles accents and multiple audio formats without any API call. Supported formats: Microphone, WAV, MP3, M4A, OGG, WEBM.
llm/intent_classifier.py — Takes the transcribed text, sends it to LLaMA 3 running locally via Ollama with a strict prompt. The LLM returns one of 4 intents (create_file, write_code, summarize_text, general_chat). A normalization function cleans up any freeform LLM reply into the exact intent label.
File Tool — Extracts a filename from the transcription using regex and creates a .txt file in output/.
Code Tool — Sends the request to LLaMA 3, strips markdown fences from the response, and saves a clean .py file to output/.
Text Tool — Sends the transcription to LLaMA 3 and returns a 5-bullet summary.
Streamlit UI — Handles mic + file upload, runs the full pipeline, and displays each step's result cleanly.

Challenges I Faced

1. ffmpeg not found on Windows
winget installs ffmpeg to a very long user-specific path instead of a standard location. pydub couldn't find it automatically. Fixed by hardcoding the exact path using AudioSegment.converter and AudioSegment.ffprobe directly in stt.py.

2. pydub needing explicit path in code
Even after ffmpeg was installed and working in CMD, Python/Streamlit's process couldn't see it on PATH. Fixed by explicitly setting os.environ["PATH"] inside the code to force pydub to find ffmpeg.

3. LLM returning freeform text instead of clean intent labels
LLaMA 3 would sometimes reply with things like "The intent is write_code" or "2. create_file" instead of just the label. This caused all intent matching to silently fail. Fixed by writing a _normalize_intent() function that scans the raw reply and maps it to the correct label using keyword matching as fallback.

4. Windows file locking (WinError 32)
Windows holds a lock on temp files longer than Linux/Mac, so os.remove() was throwing a PermissionError. Fixed by wrapping the cleanup in a try/except PermissionError block and letting Windows handle it automatically.

Further Enhancements

Improve intent detection using few-shot prompting for higher accuracy
Add support for more intents like delete file, rename file, and web search
Build a chat history panel so previous commands are visible in the UI
Add a visual waveform display for recorded microphone audio
Support multilingual voice commands using Whisper's built-in language detection
Package the app as a desktop executable so no terminal setup is needed

Links

🔗 GitHub Repo: EchoMemo - Voice controlled Local AI Agent
🎬 YouTube Demo: EchoMemo - Voice Controlled Local AI Agent ---

Note: I developed this project as part of an assignment. It helped me gain solid knowledge on tools, pipelining, agents, and debugging. I used ChatGPT, Claude as an AI assistant for code baselining, debugging Windows-specific issues, and architectural guidance. All understanding, testing, and final decisions were my own.

Thank you for reading!
— D. Nishitha

DEV Community