DEV Community

LostAlien96
LostAlien96

Posted on

Building a Voice-Controlled Local AI Agent with Whisper, Ollama & Gradio

Building a Voice-Controlled Local AI Agent with Whisper, Ollama & Gradio

Introduction

What if you could control your computer just by speaking to it — and have the
AI run entirely on your own machine, with no cloud, no API costs, and no data
leaving your device?

That's exactly what I built for my Mem0 internship assignment: a fully local
voice-controlled AI agent that transcribes speech, understands intent, and
executes real actions like writing code, creating files, and summarizing text
— all through a clean web UI.


What It Does

You speak (or upload audio). The agent:

  1. Transcribes your speech using Whisper
  2. Classifies your intent using a local LLM
  3. Executes the action — writes code, creates files, summarizes text, or chats
  4. Shows the full pipeline result in a Gradio UI
  5. Asks for confirmation before writing any file (human-in-the-loop)

Example: Say "Write a Python retry decorator and save it to retry.py"
the agent generates the code and saves it to your local output/ folder.


Architecture

Audio Input (mic / file upload)


┌─────────────────┐
│ Whisper STT │ HuggingFace transformers (local) or Groq API fallback
└────────┬────────┘
│ transcript text

┌──────────────────────┐
│ Intent Classifier │ Ollama (llama3.2) → returns structured JSON
└────────┬─────────────┘
│ [{"intent": "write_code", "params": {...}}]

┌─────────────────────────────────┐
│ Agent Orchestrator │
│ • write_code → generate + save │
│ • create_file → write to disk │
│ • summarize → bullet points │
│ • general_chat → conversation │
└─────────────────┬───────────────┘


Gradio Web UI

output/ folder (sandboxed)

Models I Chose and Why

Speech-to-Text: Whisper (openai/whisper-base)

I used the HuggingFace transformers pipeline with openai/whisper-base.
It's ~74MB, runs on CPU without a GPU, auto-detects CUDA if available, and
has excellent accuracy for English commands. For machines where local
inference is too slow, I built in a fallback to Groq's Whisper API
(set STT_BACKEND=groq in .env) — near-instant transcription for free.

LLM: llama3.2 via Ollama

For intent classification and code generation I used Ollama running
llama3.2 locally. The key design decision was making the LLM return
structured JSON for intent routing:

[
  {
    "intent": "write_code",
    "params": {
      "filename": "retry.py",
      "language": "python",
      "description": "a retry decorator function"
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

This made tool routing deterministic and reliable. Temperature is set to
0.1 for classification (consistent) and 0.3 for code generation (slightly
creative but still focused).


Key Features

Compound Commands

The LLM returns a list of intents, not just one. So saying "Summarize
this and save it to summary.txt"
triggers two intents in sequence —
summarize first, then create the file with the summary as content.

Human-in-the-Loop

Before any file is written to disk, the UI shows a confirmation prompt.
The user must click "Yes, proceed" or "Cancel". This can be turned off
with the auto-confirm checkbox for power users.

Safety Sandbox

All file writes are strictly limited to the output/ folder using path
sanitization. It's impossible to write outside this folder via voice
commands — path traversal attempts are silently stripped.

Graceful Degradation

  • If Ollama is unreachable → falls back to keyword-based intent matching
  • If local Whisper is too slow → falls back to Groq API
  • If audio is unintelligible → shows a friendly error, no crash

Challenges I Faced

1. Gradio 6.0 Breaking Changes

Gradio 6 moved theme and css from gr.Blocks() to launch(), removed
show_download_button from gr.Audio(), and changed the Chatbot
component's message format. Each of these threw a TypeError at startup
and had to be fixed one by one by reading the changelog.

2. Python 3.13 venv Bug

python -m venv fails on Python 3.13 on Windows due to a pip bootstrap
issue. The fix was to skip the venv entirely and install packages directly
with pip install — perfectly fine for a development setup.

3. Ollama Port Conflict

On Windows, Ollama auto-starts as a background service after installation.
Running ollama serve manually throws a port conflict error. The solution:
don't run it manually — it's already running.

4. Structured JSON from LLM

Getting the LLM to reliably return valid JSON (no markdown fences, no
preamble) required careful prompt engineering. The system prompt explicitly
says "Respond ONLY with a valid JSON array — no markdown, no explanation"
and uses temperature: 0.1 to reduce hallucination.


What I'd Improve Next

  • Streaming responses — show LLM output token by token instead of waiting for the full response
  • Wake word detection — always-on listening with a trigger word
  • More tools — web search, calendar integration, running shell commands
  • Voice feedback — text-to-speech so the agent speaks its response back
  • Model benchmarking — compare Whisper tiny vs base vs large latency on the same hardware

Try It Yourself

The full source code is on GitHub:
👉 https://github.com/LostAlien96/voice-ai-agent

Requirements: Python 3.10+, Ollama, 8GB RAM. Setup takes about 10 minutes
following the README.


Conclusion

Building this taught me how quickly local AI has matured. Running
production-quality speech recognition and a capable LLM entirely on a
laptop — with no internet required after setup — would have seemed
impressive just two years ago. Today it takes an afternoon.

The most interesting design challenge wasn't the ML part — it was making
the system reliable: structured outputs, graceful fallbacks, safe file
sandboxing, and a UI that guides the user through confirmation steps.
That's where the real engineering work lives.

Top comments (0)