<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: LostAlien96</title>
    <description>The latest articles on DEV Community by LostAlien96 (@lostalien96).</description>
    <link>https://dev.to/lostalien96</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3871299%2F7df34576-0b5e-4b4b-99ca-9541b36bc116.png</url>
      <title>DEV Community: LostAlien96</title>
      <link>https://dev.to/lostalien96</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/lostalien96"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent with Whisper, Ollama &amp; Gradio</title>
      <dc:creator>LostAlien96</dc:creator>
      <pubDate>Fri, 10 Apr 2026 08:37:39 +0000</pubDate>
      <link>https://dev.to/lostalien96/building-a-voice-controlled-local-ai-agent-with-whisper-ollama-gradio-15dn</link>
      <guid>https://dev.to/lostalien96/building-a-voice-controlled-local-ai-agent-with-whisper-ollama-gradio-15dn</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent with Whisper, Ollama &amp;amp; Gradio
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;What if you could control your computer just by speaking to it — and have the &lt;br&gt;
AI run entirely on your own machine, with no cloud, no API costs, and no data &lt;br&gt;
leaving your device?&lt;/p&gt;

&lt;p&gt;That's exactly what I built for my Mem0 internship assignment: a fully local &lt;br&gt;
voice-controlled AI agent that transcribes speech, understands intent, and &lt;br&gt;
executes real actions like writing code, creating files, and summarizing text &lt;br&gt;
— all through a clean web UI.&lt;/p&gt;


&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;You speak (or upload audio). The agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Transcribes your speech using Whisper&lt;/li&gt;
&lt;li&gt;Classifies your intent using a local LLM&lt;/li&gt;
&lt;li&gt;Executes the action — writes code, creates files, summarizes text, or chats&lt;/li&gt;
&lt;li&gt;Shows the full pipeline result in a Gradio UI&lt;/li&gt;
&lt;li&gt;Asks for confirmation before writing any file (human-in-the-loop)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example: Say &lt;em&gt;"Write a Python retry decorator and save it to retry.py"&lt;/em&gt; — &lt;br&gt;
the agent generates the code and saves it to your local &lt;code&gt;output/&lt;/code&gt; folder.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Audio Input (mic / file upload)&lt;br&gt;
│&lt;br&gt;
▼&lt;br&gt;
┌─────────────────┐&lt;br&gt;
│  Whisper STT     │  HuggingFace transformers (local) or Groq API fallback&lt;br&gt;
└────────┬────────┘&lt;br&gt;
│ transcript text&lt;br&gt;
▼&lt;br&gt;
┌──────────────────────┐&lt;br&gt;
│  Intent Classifier    │  Ollama (llama3.2) → returns structured JSON&lt;br&gt;
└────────┬─────────────┘&lt;br&gt;
│ [{"intent": "write_code", "params": {...}}]&lt;br&gt;
▼&lt;br&gt;
┌─────────────────────────────────┐&lt;br&gt;
│        Agent Orchestrator        │&lt;br&gt;
│  • write_code  → generate + save │&lt;br&gt;
│  • create_file → write to disk   │&lt;br&gt;
│  • summarize   → bullet points   │&lt;br&gt;
│  • general_chat → conversation   │&lt;br&gt;
└─────────────────┬───────────────┘&lt;br&gt;
│&lt;br&gt;
▼&lt;br&gt;
Gradio Web UI&lt;br&gt;
│&lt;br&gt;
▼&lt;/p&gt;
&lt;h2&gt;
  
  
  output/ folder (sandboxed)
&lt;/h2&gt;
&lt;h2&gt;
  
  
  Models I Chose and Why
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Speech-to-Text: Whisper (openai/whisper-base)
&lt;/h3&gt;

&lt;p&gt;I used the HuggingFace &lt;code&gt;transformers&lt;/code&gt; pipeline with &lt;code&gt;openai/whisper-base&lt;/code&gt;. &lt;br&gt;
It's ~74MB, runs on CPU without a GPU, auto-detects CUDA if available, and &lt;br&gt;
has excellent accuracy for English commands. For machines where local &lt;br&gt;
inference is too slow, I built in a fallback to Groq's Whisper API &lt;br&gt;
(set &lt;code&gt;STT_BACKEND=groq&lt;/code&gt; in &lt;code&gt;.env&lt;/code&gt;) — near-instant transcription for free.&lt;/p&gt;
&lt;h3&gt;
  
  
  LLM: llama3.2 via Ollama
&lt;/h3&gt;

&lt;p&gt;For intent classification and code generation I used Ollama running &lt;br&gt;
&lt;code&gt;llama3.2&lt;/code&gt; locally. The key design decision was making the LLM return &lt;br&gt;
&lt;strong&gt;structured JSON&lt;/strong&gt; for intent routing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a retry decorator function"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made tool routing deterministic and reliable. Temperature is set to &lt;br&gt;
0.1 for classification (consistent) and 0.3 for code generation (slightly &lt;br&gt;
creative but still focused).&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Compound Commands
&lt;/h3&gt;

&lt;p&gt;The LLM returns a &lt;strong&gt;list&lt;/strong&gt; of intents, not just one. So saying &lt;em&gt;"Summarize &lt;br&gt;
this and save it to summary.txt"&lt;/em&gt; triggers two intents in sequence — &lt;br&gt;
summarize first, then create the file with the summary as content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;Before any file is written to disk, the UI shows a confirmation prompt. &lt;br&gt;
The user must click "Yes, proceed" or "Cancel". This can be turned off &lt;br&gt;
with the auto-confirm checkbox for power users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Safety Sandbox
&lt;/h3&gt;

&lt;p&gt;All file writes are strictly limited to the &lt;code&gt;output/&lt;/code&gt; folder using path &lt;br&gt;
sanitization. It's impossible to write outside this folder via voice &lt;br&gt;
commands — path traversal attempts are silently stripped.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graceful Degradation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If Ollama is unreachable → falls back to keyword-based intent matching&lt;/li&gt;
&lt;li&gt;If local Whisper is too slow → falls back to Groq API&lt;/li&gt;
&lt;li&gt;If audio is unintelligible → shows a friendly error, no crash&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Gradio 6.0 Breaking Changes
&lt;/h3&gt;

&lt;p&gt;Gradio 6 moved &lt;code&gt;theme&lt;/code&gt; and &lt;code&gt;css&lt;/code&gt; from &lt;code&gt;gr.Blocks()&lt;/code&gt; to &lt;code&gt;launch()&lt;/code&gt;, removed &lt;br&gt;
&lt;code&gt;show_download_button&lt;/code&gt; from &lt;code&gt;gr.Audio()&lt;/code&gt;, and changed the &lt;code&gt;Chatbot&lt;/code&gt; &lt;br&gt;
component's message format. Each of these threw a &lt;code&gt;TypeError&lt;/code&gt; at startup &lt;br&gt;
and had to be fixed one by one by reading the changelog.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Python 3.13 venv Bug
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;python -m venv&lt;/code&gt; fails on Python 3.13 on Windows due to a pip bootstrap &lt;br&gt;
issue. The fix was to skip the venv entirely and install packages directly &lt;br&gt;
with &lt;code&gt;pip install&lt;/code&gt; — perfectly fine for a development setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Ollama Port Conflict
&lt;/h3&gt;

&lt;p&gt;On Windows, Ollama auto-starts as a background service after installation. &lt;br&gt;
Running &lt;code&gt;ollama serve&lt;/code&gt; manually throws a port conflict error. The solution: &lt;br&gt;
don't run it manually — it's already running.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Structured JSON from LLM
&lt;/h3&gt;

&lt;p&gt;Getting the LLM to reliably return valid JSON (no markdown fences, no &lt;br&gt;
preamble) required careful prompt engineering. The system prompt explicitly &lt;br&gt;
says "Respond ONLY with a valid JSON array — no markdown, no explanation" &lt;br&gt;
and uses &lt;code&gt;temperature: 0.1&lt;/code&gt; to reduce hallucination.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Improve Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming responses&lt;/strong&gt; — show LLM output token by token instead of 
waiting for the full response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wake word detection&lt;/strong&gt; — always-on listening with a trigger word&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More tools&lt;/strong&gt; — web search, calendar integration, running shell commands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Voice feedback&lt;/strong&gt; — text-to-speech so the agent speaks its response back&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model benchmarking&lt;/strong&gt; — compare Whisper tiny vs base vs large latency 
on the same hardware&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The full source code is on GitHub:&lt;br&gt;
👉 &lt;a href="https://github.com/LostAlien96/voice-ai-agent" rel="noopener noreferrer"&gt;https://github.com/LostAlien96/voice-ai-agent&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Requirements: Python 3.10+, Ollama, 8GB RAM. Setup takes about 10 minutes &lt;br&gt;
following the README.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this taught me how quickly local AI has matured. Running &lt;br&gt;
production-quality speech recognition and a capable LLM entirely on a &lt;br&gt;
laptop — with no internet required after setup — would have seemed &lt;br&gt;
impressive just two years ago. Today it takes an afternoon.&lt;/p&gt;

&lt;p&gt;The most interesting design challenge wasn't the ML part — it was making &lt;br&gt;
the system reliable: structured outputs, graceful fallbacks, safe file &lt;br&gt;
sandboxing, and a UI that guides the user through confirmation steps. &lt;br&gt;
That's where the real engineering work lives.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
