<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Akash Khare</title>
    <description>The latest articles on DEV Community by Akash Khare (@akash_khare_ad5055a427660).</description>
    <link>https://dev.to/akash_khare_ad5055a427660</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3871407%2F649c2f36-dbde-4a2b-beea-49bd47129466.png</url>
      <title>DEV Community: Akash Khare</title>
      <link>https://dev.to/akash_khare_ad5055a427660</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akash_khare_ad5055a427660"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent: Architecture, Models &amp; Lessons Learned</title>
      <dc:creator>Akash Khare</dc:creator>
      <pubDate>Fri, 10 Apr 2026 09:40:01 +0000</pubDate>
      <link>https://dev.to/akash_khare_ad5055a427660/building-a-voice-controlled-local-ai-agent-architecture-models-lessons-learned-2le0</link>
      <guid>https://dev.to/akash_khare_ad5055a427660/building-a-voice-controlled-local-ai-agent-architecture-models-lessons-learned-2le0</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent: Architecture, Models &amp;amp; Lessons Learned
&lt;/h1&gt;

&lt;p&gt;I built a voice agent that listens to your commands, understands your intent, and executes real actions on your machine — creating files, writing code, summarizing text, and more. Here's exactly how I built it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;You speak (or upload audio). The system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Transcribes your audio using Whisper&lt;/li&gt;
&lt;li&gt;Classifies your intent using an LLM (Claude / GPT-4 / Ollama)&lt;/li&gt;
&lt;li&gt;Executes the action locally (creates files, generates code, summarizes text)&lt;/li&gt;
&lt;li&gt;Shows you the full pipeline result in a clean Streamlit UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example: Say &lt;em&gt;"Create a Python file with a retry decorator"&lt;/em&gt; → the agent generates the code and saves it to &lt;code&gt;output/retry_decorator.py&lt;/code&gt; automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input (mic or file)
       ↓
STT Module (Whisper)  →  transcript text
       ↓
Intent Module (LLM)   →  { intent, parameters }
       ↓
Executor              →  file / code / summarize / chat
       ↓
Session Memory        →  context for follow-up commands
       ↓
Streamlit UI          →  shows all 4 pipeline steps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each layer is modular — swap the STT provider, LLM, or executor without touching anything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: Speech-to-Text
&lt;/h2&gt;

&lt;p&gt;I implemented three STT options in &lt;code&gt;src/stt.py&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI Whisper API&lt;/strong&gt; (default) — 1-2s latency, no GPU needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Groq Whisper API&lt;/strong&gt; — even faster, generous free tier&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HuggingFace Whisper (local)&lt;/strong&gt; — fully offline, uses &lt;code&gt;openai/whisper-base&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why I defaulted to cloud Whisper instead of local
&lt;/h3&gt;

&lt;p&gt;Running &lt;code&gt;whisper-base&lt;/code&gt; on CPU takes 20–40 seconds per 10-second clip. &lt;code&gt;whisper-large-v3&lt;/code&gt; takes several minutes without a GPU. For a responsive demo, that latency is a dealbreaker.&lt;/p&gt;

&lt;p&gt;The cloud API returns in ~1-2 seconds regardless of hardware. The local HuggingFace option is still available via a sidebar toggle for users with a GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  STT Benchmark
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Latency (10s clip)&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Whisper API&lt;/td&gt;
&lt;td&gt;~1-2s&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;$0.006/min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Groq Whisper&lt;/td&gt;
&lt;td&gt;~0.5-1s&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HuggingFace local (CPU)&lt;/td&gt;
&lt;td&gt;~30-40s&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HuggingFace local (GPU)&lt;/td&gt;
&lt;td&gt;~2-3s&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Stage 2: Intent Classification
&lt;/h2&gt;

&lt;p&gt;The intent module (&lt;code&gt;src/intent.py&lt;/code&gt;) sends the transcript to an LLM with a structured system prompt and gets back a JSON object:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry_decorator.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry decorator function"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.97&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Supported Intents
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Intent&lt;/th&gt;
&lt;th&gt;Example Command&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Create a Python retry function"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;create_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Make a new file called notes.txt"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;summarize_text&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Summarize this article: ..."&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;general_chat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"What is recursion?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;create_folder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"Create a folder called experiments"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;list_files&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;"What files have been created?"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  LLM Benchmark
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Intent Accuracy&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude (Anthropic)&lt;/td&gt;
&lt;td&gt;~1-2s&lt;/td&gt;
&lt;td&gt;~97%&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;~1-2s&lt;/td&gt;
&lt;td&gt;~95%&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama llama3 (local)&lt;/td&gt;
&lt;td&gt;~3-8s&lt;/td&gt;
&lt;td&gt;~88%&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Claude gave the most reliable structured JSON output with fewest parsing errors, especially for compound commands.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Tool Execution
&lt;/h2&gt;

&lt;p&gt;The executor (&lt;code&gt;src/executor.py&lt;/code&gt;) routes each intent to its handler:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;write_code&lt;/strong&gt; — prompts the LLM with a code-generation system prompt, strips markdown fences, saves to &lt;code&gt;output/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;create_file&lt;/strong&gt; — writes content directly to &lt;code&gt;output/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;summarize_text&lt;/strong&gt; — calls LLM with a summarizer prompt, optionally saves output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;general_chat&lt;/strong&gt; — returns a conversational response&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All file operations are sandboxed to the &lt;code&gt;output/&lt;/code&gt; folder. Filenames are sanitized to prevent path traversal.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Compound Commands
&lt;/h3&gt;

&lt;p&gt;The intent classifier returns an array of intents. If multiple are detected, the executor chains them:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Summarize this and save it to summary.txt"
→ [summarize_text, create_file]
→ Step 1: summarize → Step 2: save output to output/summary.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;A sidebar toggle (on by default) shows a confirmation prompt before any file write operation. Users must click "Confirm &amp;amp; Execute" before the agent touches the filesystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session Memory
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;src/memory.py&lt;/code&gt; stores the last 10 commands and their outcomes. This context is passed to the intent classifier on every request, enabling follow-up commands like &lt;em&gt;"now save that to a file"&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;Every stage is wrapped in try/catch. STT failures, LLM API errors, and unrecognized intents all surface clearly in the UI with actionable error messages instead of crashing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The UI
&lt;/h2&gt;

&lt;p&gt;Built with Streamlit, styled with a custom dark industrial theme. The UI displays all four pipeline steps as distinct cards:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Transcription&lt;/strong&gt; — what was heard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent&lt;/strong&gt; — colored badge showing detected intent(s) + parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action&lt;/strong&gt; — what the agent did&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result&lt;/strong&gt; — the output (code preview, summary, file confirmation)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A session history panel on the right tracks all commands in the current session.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Structured JSON from LLMs is fragile&lt;/strong&gt;&lt;br&gt;
LLMs sometimes wrap JSON in markdown fences or add preamble text. I wrote a robust &lt;code&gt;_parse_intent_json()&lt;/code&gt; function that strips fences, handles missing fields, and falls back to &lt;code&gt;general_chat&lt;/code&gt; on parse failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Compound command detection&lt;/strong&gt;&lt;br&gt;
Getting the LLM to reliably return multiple intents required careful prompt engineering. The system prompt had to explicitly define the &lt;code&gt;intents&lt;/code&gt; array format and give examples of compound commands.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Filename inference&lt;/strong&gt;&lt;br&gt;
When users don't specify a filename ("create a Python retry function"), the agent needs to infer a sensible one. I built &lt;code&gt;_infer_filename()&lt;/code&gt; that strips stop words and constructs something like &lt;code&gt;retry_function.py&lt;/code&gt; from the command description.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Local Whisper performance&lt;/strong&gt;&lt;br&gt;
As noted above, local Whisper is impractically slow on CPU. The workaround (cloud API with local fallback) balances usability with the project's preference for local models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/akashkhare315/Agent-ai-voice
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env  &lt;span class="c"&gt;# add your API keys&lt;/span&gt;
streamlit run app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need: &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; (or &lt;code&gt;OPENAI_API_KEY&lt;/code&gt;) + &lt;code&gt;OPENAI_API_KEY&lt;/code&gt; for Whisper STT.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this agent taught me that the hardest part isn't any single model — it's the glue between them. Reliable JSON parsing, graceful error handling, and thoughtful UX (like human-in-the-loop confirmation) matter just as much as model accuracy.&lt;/p&gt;

&lt;p&gt;The modular architecture means you can upgrade any layer independently: swap Whisper for a better local model when your hardware supports it, switch from Claude to a fully local Ollama model for offline use, or add new intents without touching the rest of the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; [&lt;a href="https://github.com/akashkhare315/Agent-ai-voice" rel="noopener noreferrer"&gt;https://github.com/akashkhare315/Agent-ai-voice&lt;/a&gt;]&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>javascript</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
