DEV Community

Donthi Nishitha
Donthi Nishitha

Posted on

Building a Voice-Controlled Local AI Agent with Whisper, LLaMA 3 and Streamlit

Introduction

EchoMemo is a Voice Controlled Local AI Agent that runs entirely on your local machine. The user is required to give a command, either through their microphone or by uploading an audio file, through which the system automatically:

  • Converts speech to text
  • Understands what the user wants to do
  • Executes the right action by capturing the intent
  • Shows every step of the pipeline in a clean web UI

No requirement of cloud, API keys, or internet after setup.

What difference does this make?

Local AI solves the problems of Privacy, Cost, Dependency, and Latency unlike most AI tools today. With tools like Whisper for speech recognition and Ollama as a local LLM, you can run powerful AI models directly on your own hardware. Your data never leaves your computer, there are no usage fees, and the system works even completely offline.

This project is a practical demonstration of that idea — a fully functional voice agent built entirely from local, open-source models.


Architecture Overview

Audio Input (mic / file upload)
        │
        ▼
  Speech-to-Text          ← OpenAI Whisper "base" model (fully local)
  [models/stt.py]
        │
        ▼
  Intent Classifier       ← LLaMA 3 via Ollama (fully local)
  [llm/intent_classifier.py]
        │
        ├── create_file    → tools/file_tools.py   → output/
        ├── write_code     → tools/code_tools.py   → output/
        ├── summarize_text → tools/text_tools.py
        └── general_chat   → direct Ollama LLaMA 3 chat
        │
        ▼
  Streamlit UI            ← displays transcription, intent, action, result
  [app.py]
Enter fullscreen mode Exit fullscreen mode

Models I Chose

  • Whisper base — free, local, accurate, no API key needed
  • LLaMA 3 via Ollama — fully local LLM, easy to set up and run

How Each Component Works

  • models/stt.py — Converts audio to a format Whisper understands using pydub + ffmpeg, runs it through the Whisper base model locally, and returns plain text. Whisper handles accents and multiple audio formats without any API call. Supported formats: Microphone, WAV, MP3, M4A, OGG, WEBM.

  • llm/intent_classifier.py — Takes the transcribed text, sends it to LLaMA 3 running locally via Ollama with a strict prompt. The LLM returns one of 4 intents (create_file, write_code, summarize_text, general_chat). A normalization function cleans up any freeform LLM reply into the exact intent label.

  • File Tool — Extracts a filename from the transcription using regex and creates a .txt file in output/.

  • Code Tool — Sends the request to LLaMA 3, strips markdown fences from the response, and saves a clean .py file to output/.

  • Text Tool — Sends the transcription to LLaMA 3 and returns a 5-bullet summary.

  • Streamlit UI — Handles mic + file upload, runs the full pipeline, and displays each step's result cleanly.


Challenges I Faced

1. ffmpeg not found on Windows
winget installs ffmpeg to a very long user-specific path instead of a standard location. pydub couldn't find it automatically. Fixed by hardcoding the exact path using AudioSegment.converter and AudioSegment.ffprobe directly in stt.py.

2. pydub needing explicit path in code
Even after ffmpeg was installed and working in CMD, Python/Streamlit's process couldn't see it on PATH. Fixed by explicitly setting os.environ["PATH"] inside the code to force pydub to find ffmpeg.

3. LLM returning freeform text instead of clean intent labels
LLaMA 3 would sometimes reply with things like "The intent is write_code" or "2. create_file" instead of just the label. This caused all intent matching to silently fail. Fixed by writing a _normalize_intent() function that scans the raw reply and maps it to the correct label using keyword matching as fallback.

4. Windows file locking (WinError 32)
Windows holds a lock on temp files longer than Linux/Mac, so os.remove() was throwing a PermissionError. Fixed by wrapping the cleanup in a try/except PermissionError block and letting Windows handle it automatically.


Further Enhancements

  • Improve intent detection using few-shot prompting for higher accuracy
  • Add support for more intents like delete file, rename file, and web search
  • Build a chat history panel so previous commands are visible in the UI
  • Add a visual waveform display for recorded microphone audio
  • Support multilingual voice commands using Whisper's built-in language detection
  • Package the app as a desktop executable so no terminal setup is needed

Links

Note: I developed this project as part of an assignment. It helped me gain solid knowledge on tools, pipelining, agents, and debugging. I used ChatGPT, Claude as an AI assistant for code baselining, debugging Windows-specific issues, and architectural guidance. All understanding, testing, and final decisions were my own.


Thank you for reading!
— D. Nishitha

Top comments (0)