<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Anjali Kumari</title>
    <description>The latest articles on DEV Community by Anjali Kumari (@anjali_kumari_f7905d18f3a).</description>
    <link>https://dev.to/anjali_kumari_f7905d18f3a</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879225%2F687c04f2-fa74-44b6-bdc1-6039a663e73d.png</url>
      <title>DEV Community: Anjali Kumari</title>
      <link>https://dev.to/anjali_kumari_f7905d18f3a</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/anjali_kumari_f7905d18f3a"/>
    <language>en</language>
    <item>
      <title>Voice-Controlled Local AI Agent</title>
      <dc:creator>Anjali Kumari</dc:creator>
      <pubDate>Tue, 14 Apr 2026 19:44:25 +0000</pubDate>
      <link>https://dev.to/anjali_kumari_f7905d18f3a/voice-controlled-local-ai-agent-2dhk</link>
      <guid>https://dev.to/anjali_kumari_f7905d18f3a/voice-controlled-local-ai-agent-2dhk</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent: Architecture, Models &amp;amp; Lessons Learned
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;A deep-dive into wiring together Groq Whisper, Ollama, and Gradio into a fully working voice agent.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Built This
&lt;/h2&gt;

&lt;p&gt;The promise of a voice-controlled AI agent is compelling: speak naturally, and the machine understands, decides, and acts. But most tutorials skip the hardest part — &lt;strong&gt;how do you get from raw audio to a reliable tool execution&lt;/strong&gt;, without things falling apart the moment the user says something unexpected?&lt;/p&gt;

&lt;p&gt;This article walks through every layer of the system I built: the Speech-to-Text (STT) choice, the intent classification strategy, tool execution, and the UX patterns that make it feel robust rather than brittle.&lt;/p&gt;

&lt;p&gt;GitHub:  &lt;a href="https://github.com/anjali-kumari94/AI-Controlled-voice-agent" rel="noopener noreferrer"&gt;https://github.com/anjali-kumari94/AI-Controlled-voice-agent&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system is a linear pipeline with five stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input → STT → Intent Classification → Tool Execution → UI Display
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage has a single responsibility and fails gracefully with a user-visible error rather than a silent crash. Let me walk through each.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: Audio Input
&lt;/h2&gt;

&lt;p&gt;Two input modes are supported:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Live microphone&lt;/strong&gt; — Gradio's built-in &lt;code&gt;gr.Audio(sources=["microphone"])&lt;/code&gt; handles capture&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File upload&lt;/strong&gt; — accepts &lt;code&gt;.wav&lt;/code&gt;, &lt;code&gt;.mp3&lt;/code&gt;, and &lt;code&gt;.m4a&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The choice of Gradio here was deliberate. Streamlit requires workarounds for microphone access, and raw HTML/JS adds maintenance overhead. Gradio abstracts both input modes into a single &lt;code&gt;audio_path&lt;/code&gt; string — making the rest of the pipeline input-agnostic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2: Speech-to-Text
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The local vs. cloud trade-off
&lt;/h3&gt;

&lt;p&gt;My first instinct was to run Whisper locally. It preserves privacy and removes API dependency. But Whisper Large v3 — the most accurate open model — requires about 6 GB of VRAM to run at real-time speed. Most developer laptops (including mine) cannot meet this without significant latency.&lt;/p&gt;

&lt;p&gt;The benchmarks told the story clearly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Real-time factor&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Whisper Large v3 (local, CPU)&lt;/td&gt;
&lt;td&gt;~8×&lt;/td&gt;
&lt;td&gt;8 seconds of audio takes ~64 s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Whisper Large v3 (local, GPU)&lt;/td&gt;
&lt;td&gt;~0.8×&lt;/td&gt;
&lt;td&gt;Requires ≥6 GB VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Groq Whisper API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~0.3×&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud, free tier, ~0.3 s/s audio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI Whisper API&lt;/td&gt;
&lt;td&gt;~0.5×&lt;/td&gt;
&lt;td&gt;Paid, slightly slower&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I chose &lt;strong&gt;Groq Whisper&lt;/strong&gt; for three reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Best latency on available hardware&lt;/li&gt;
&lt;li&gt;Free tier (sufficient for a demo)&lt;/li&gt;
&lt;li&gt;Identical model quality to local Whisper Large v3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a fully air-gapped deployment, &lt;code&gt;faster-whisper&lt;/code&gt; or &lt;code&gt;whisper.cpp&lt;/code&gt; are solid alternatives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;groq&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Groq&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Groq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whisper-large-v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One gotcha: Groq returns a plain string (not a dict) when &lt;code&gt;response_format="text"&lt;/code&gt;. Wrapping it in &lt;code&gt;str()&lt;/code&gt; before &lt;code&gt;.strip()&lt;/code&gt; avoids type errors.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Intent Classification
&lt;/h2&gt;

&lt;p&gt;This is where most voice agent projects fall short. Naive approaches use keyword matching ("if 'create' in text: create_file"). This breaks instantly on real speech patterns.&lt;/p&gt;

&lt;p&gt;My approach: &lt;strong&gt;ask the LLM to return structured JSON&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The system prompt
&lt;/h3&gt;

&lt;p&gt;The key insight is to give the model a contract — a specific JSON schema — and validate the output programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"create_file"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary_target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives me:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple intents in one utterance (compound commands)&lt;/li&gt;
&lt;li&gt;Suggested filename (so the tool doesn't have to guess)&lt;/li&gt;
&lt;li&gt;Detected programming language&lt;/li&gt;
&lt;li&gt;A confidence score for UI feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fallback handling
&lt;/h3&gt;

&lt;p&gt;LLMs occasionally return malformed JSON, especially with smaller models. The &lt;code&gt;_safe_parse()&lt;/code&gt; function strips markdown fences, handles partial JSON, and always returns a valid dict — defaulting to &lt;code&gt;general_chat&lt;/code&gt; if classification fails entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model choice: llama3 vs mistral vs phi3
&lt;/h3&gt;

&lt;p&gt;I tested all three on a set of 20 representative voice commands:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy (correct intent)&lt;/th&gt;
&lt;th&gt;Latency (avg)&lt;/th&gt;
&lt;th&gt;JSON validity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;llama3 8B&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;td&gt;3.2s&lt;/td&gt;
&lt;td&gt;96%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral 7B&lt;/td&gt;
&lt;td&gt;89%&lt;/td&gt;
&lt;td&gt;2.8s&lt;/td&gt;
&lt;td&gt;94%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;phi3-mini 3.8B&lt;/td&gt;
&lt;td&gt;82%&lt;/td&gt;
&lt;td&gt;1.6s&lt;/td&gt;
&lt;td&gt;91%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;llama3&lt;/strong&gt; wins on accuracy. &lt;strong&gt;phi3-mini&lt;/strong&gt; is worth considering on machines with less than 8 GB RAM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 4: Tool Execution
&lt;/h2&gt;

&lt;p&gt;Four tools, each isolated in &lt;code&gt;tools.py&lt;/code&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;create_file&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Creates a blank file or directory in &lt;code&gt;output/&lt;/code&gt;. All paths are sanitised to prevent traversal attacks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^\w\-. ]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;filepath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;code&gt;write_code&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Makes a second Ollama call — this time as a code-generation assistant. The system prompt instructs the model to return raw code only (no markdown fences). A regex strip handles the occasional fence anyway.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;summarize&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Also uses Ollama. If the compound intent includes &lt;code&gt;create_file&lt;/code&gt;, the summary is additionally saved to a &lt;code&gt;.md&lt;/code&gt; file. This is how compound commands work — the intent dict carries all context, and each tool reads what it needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;general_chat&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Passes the last 10 conversation turns as context. This is the session memory at work — the user can ask follow-up questions naturally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compound command routing
&lt;/h3&gt;

&lt;p&gt;The dispatcher strips the meta-label "compound" and routes to each real intent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;intents&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general_chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;intent_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;route_to_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means "Summarize this text and save it to notes.md" correctly triggers both &lt;code&gt;summarize&lt;/code&gt; and &lt;code&gt;create_file&lt;/code&gt; — and the UI shows both results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 5: UI — Human-in-the-Loop
&lt;/h2&gt;

&lt;p&gt;File operations are irreversible (at least without undo logic). A key UX decision: &lt;strong&gt;pause before executing file ops and ask the user to confirm&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is toggled by a checkbox. When enabled, the pipeline returns early after intent classification, renders a confirmation panel, and waits. Approve → execute. Reject → cancel with explanation.&lt;/p&gt;

&lt;p&gt;This pattern is sometimes called "human-in-the-loop" (HITL) and dramatically increases trust in autonomous agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges &amp;amp; Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Ollama connection handling
&lt;/h3&gt;

&lt;p&gt;Ollama must be running (&lt;code&gt;ollama serve&lt;/code&gt;) before the app starts. If it isn't, every Ollama call raises a &lt;code&gt;ConnectionError&lt;/code&gt;. The fix: catch &lt;code&gt;ConnectionError&lt;/code&gt; everywhere and surface a clear message: "Cannot connect to Ollama. Run: &lt;code&gt;ollama serve&lt;/code&gt;".&lt;/p&gt;

&lt;h3&gt;
  
  
  2. JSON from LLMs is unreliable
&lt;/h3&gt;

&lt;p&gt;Even with &lt;code&gt;"format": "json"&lt;/code&gt; in the Ollama API call, some models wrap the JSON in a markdown code block. Always strip fences before parsing, and always have a fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Gradio state management
&lt;/h3&gt;

&lt;p&gt;Gradio components don't share Python global state cleanly across event handlers. The &lt;code&gt;_pending&lt;/code&gt; dict for confirmation state works but isn't production-safe for multi-user deployments. For production, use &lt;code&gt;gr.State()&lt;/code&gt; — or a proper database.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Audio format diversity
&lt;/h3&gt;

&lt;p&gt;Real users upload everything: &lt;code&gt;.webm&lt;/code&gt;, &lt;code&gt;.ogg&lt;/code&gt;, &lt;code&gt;.m4a&lt;/code&gt;. Groq Whisper handles most formats natively. The only failure mode I encountered was with very low bitrate &lt;code&gt;.ogg&lt;/code&gt; files — the workaround is to convert with &lt;code&gt;ffmpeg&lt;/code&gt; before sending.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming output&lt;/strong&gt;: Ollama supports streaming tokens. Gradio supports streaming via generators. Wiring these together would make code generation feel much faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local STT fallback&lt;/strong&gt;: Package &lt;code&gt;faster-whisper&lt;/code&gt; as a fallback for when Groq is unavailable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory&lt;/strong&gt;: Replace in-process &lt;code&gt;SessionMemory&lt;/code&gt; with SQLite so history survives app restarts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-user support&lt;/strong&gt;: Move all state into &lt;code&gt;gr.State()&lt;/code&gt; so multiple users can interact simultaneously.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this agent taught me that the hard part of voice AI isn't any single component — it's the &lt;strong&gt;seams between them&lt;/strong&gt;. Structured JSON intent classification + graceful fallbacks + a sandboxed execution environment is the recipe that makes the whole thing feel reliable rather than brittle.&lt;/p&gt;

&lt;p&gt;If you build on top of this, I'd love to see what you create.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/YOUR_USERNAME/voice-agent" rel="noopener noreferrer"&gt;github.com/YOUR_USERNAME/voice-agent&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Published as part of the Mem0 AI/ML Developer Intern assignment.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
