<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AKSHI SHARMA</title>
    <description>The latest articles on DEV Community by AKSHI SHARMA (@akshisharmaa).</description>
    <link>https://dev.to/akshisharmaa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3876532%2F349b0950-5445-4bb8-83a0-8106f4cc0646.png</url>
      <title>DEV Community: AKSHI SHARMA</title>
      <link>https://dev.to/akshisharmaa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akshisharmaa"/>
    <language>en</language>
    <item>
      <title>Voice-Controlled Local AI Agent</title>
      <dc:creator>AKSHI SHARMA</dc:creator>
      <pubDate>Mon, 13 Apr 2026 11:27:09 +0000</pubDate>
      <link>https://dev.to/akshisharmaa/voice-controlled-local-ai-agent-10n4</link>
      <guid>https://dev.to/akshisharmaa/voice-controlled-local-ai-agent-10n4</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent: Architecture, Models &amp;amp; Lessons Learned
&lt;/h1&gt;

&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;I recently built a voice-controlled AI agent for the Mem0 Generative AI internship assignment. The system takes an audio command, converts it to text, classifies the user's intent, and executes the appropriate local action — all displayed in a web UI.&lt;/p&gt;

&lt;p&gt;Here's what I built, how it works, and what I learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The pipeline has five stages:&lt;/p&gt;

&lt;p&gt;Audio → STT → Intent Classification → Tool Execution → UI Display&lt;/p&gt;

&lt;p&gt;Each stage is independently swappable — the system degrades gracefully if local hardware can't keep up.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Primary&lt;/th&gt;
&lt;th&gt;Fallback&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Audio Input&lt;/td&gt;
&lt;td&gt;Gradio mic / file upload&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;Groq Whisper API&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent&lt;/td&gt;
&lt;td&gt;Groq llama-3.3-70b-versatile&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tools&lt;/td&gt;
&lt;td&gt;Python stdlib&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI&lt;/td&gt;
&lt;td&gt;Gradio 6.x&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Model Choices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Speech-to-Text: Groq Whisper API
&lt;/h3&gt;

&lt;p&gt;I used Groq's hosted Whisper API for transcription. It is near-instant for short clips, requires no local GPU, and has a generous free tier — making it ideal for laptops without powerful hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM: Groq llama-3.3-70b-versatile
&lt;/h3&gt;

&lt;p&gt;For intent classification and all text generation (code, summarization, chat), I used Groq's llama-3.3-70b-versatile. Groq's LPU hardware makes inference extremely fast (~200 tokens/s), which feels nearly instant in practice.&lt;/p&gt;

&lt;p&gt;For intent classification, I structured the LLM output as JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Generate a Python retry function"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a retry decorator with exponential backoff"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made it trivial to support compound commands — the actions array simply holds multiple items.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Implementation Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Safety Sandbox
&lt;/h3&gt;

&lt;p&gt;All file operations are restricted to an output/ directory. I implemented path traversal protection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
    &lt;span class="n"&gt;resolved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;safe&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resolved&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Path traversal attempt blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resolved&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;Before executing any file write, the UI asks for confirmation. The pending action is stored in gr.State and executed only on the confirmation click.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session Memory
&lt;/h3&gt;

&lt;p&gt;I maintain two separate state structures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chat context — OpenAI-style message list passed to the LLM on each call&lt;/li&gt;
&lt;li&gt;Action log — displayed in the UI history panel so users can review what the agent has done&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. JSON parsing from LLMs
&lt;/h3&gt;

&lt;p&gt;LLMs sometimes wrap JSON output in markdown fences or add preamble text. I wrote a robust parser that strips fences, attempts json.loads, falls back to regex extraction, and as a last resort defaults to a chat intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Compound command detection
&lt;/h3&gt;

&lt;p&gt;Getting the LLM to reliably emit two action objects for commands like "summarize this and save to a file" required careful prompt engineering. Showing an explicit schema example in the system prompt dramatically improved reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Gradio version compatibility
&lt;/h3&gt;

&lt;p&gt;Gradio 6 removed several parameters that existed in older versions (show_download_button, GoogleFont). Updating the code to match the new API was a key debugging step.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Add Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Streaming LLM responses — show code as it's generated token by token&lt;/li&gt;
&lt;li&gt;Wake word detection — "Hey Agent" to trigger recording&lt;/li&gt;
&lt;li&gt;Plugin system — let users define custom tool handlers in YAML&lt;/li&gt;
&lt;li&gt;Persistent memory — save session history to disk across restarts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/akshisharmaaa/Voice-Controlled-Local-AI-Agent" rel="noopener noreferrer"&gt;https://github.com/akshisharmaaa/Voice-Controlled-Local-AI-Agent&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Thanks for reading! If you're building something similar or have questions about the architecture, drop a comment below.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
