<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: S@ndeep Kum@r</title>
    <description>The latest articles on DEV Community by S@ndeep Kum@r (@sand33p312).</description>
    <link>https://dev.to/sand33p312</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874958%2F41d33630-7e92-451e-98a8-7796e8441cb8.jpeg</url>
      <title>DEV Community: S@ndeep Kum@r</title>
      <link>https://dev.to/sand33p312</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sand33p312"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent with Compound Commands, Memory &amp; Human-in-the-Loop</title>
      <dc:creator>S@ndeep Kum@r</dc:creator>
      <pubDate>Sun, 12 Apr 2026 18:54:26 +0000</pubDate>
      <link>https://dev.to/sand33p312/building-a-voice-controlled-local-ai-agent-with-compound-commands-memory-human-in-the-loop-1kfj</link>
      <guid>https://dev.to/sand33p312/building-a-voice-controlled-local-ai-agent-with-compound-commands-memory-human-in-the-loop-1kfj</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent with Compound Commands, Memory &amp;amp; Human-in-the-Loop
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;Published as part of the Mem0 AI MLOps &amp;amp; AI Infra Internship Assignment&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;What if you could speak to your computer and have it write code, create files, or summarize text — with full transparency into every step of the pipeline?&lt;/p&gt;

&lt;p&gt;That's exactly what I built: a voice-controlled AI agent that transcribes audio, classifies intent (including compound multi-intent commands), executes the right tools, maintains persistent session memory, and requires human confirmation before touching your filesystem.&lt;/p&gt;

&lt;p&gt;In this article I'll walk through the architecture, model choices, the three bonus features I implemented, and the real bugs I hit along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system is a linear pipeline with four stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input
    │
    ▼
Speech-to-Text  (Groq Whisper API — whisper-large-v3)
    │
    ▼
Intent Classification  (Groq llama-3.3-70b — structured JSON)
    │
    ▼
Tool Execution  (write_code | create_file | summarize | general_chat)
    │
    ├── output/                   ← all files created here only
    └── session_history.json      ← persistent memory
    │
    ▼
Gradio UI  (5-step pipeline display + session memory panel)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stage is an isolated Python module (&lt;code&gt;stt.py&lt;/code&gt;, &lt;code&gt;intent.py&lt;/code&gt;, &lt;code&gt;tools.py&lt;/code&gt;, &lt;code&gt;memory.py&lt;/code&gt;) wired together by a thin orchestrator in &lt;code&gt;app.py&lt;/code&gt;. This makes it easy to swap any component independently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: Speech-to-Text — Why Groq Whisper
&lt;/h2&gt;

&lt;p&gt;The assignment recommended running HuggingFace Whisper locally. I tried this first — Whisper large-v3 requires ~4 GB VRAM and runs at about 0.1× realtime on a CPU-only machine. That's a 10-second wait for a 1-second voice clip.&lt;/p&gt;

&lt;p&gt;My solution: &lt;strong&gt;Groq Whisper API&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Groq runs the same &lt;code&gt;whisper-large-v3&lt;/code&gt; model on their LPU hardware, delivering roughly 200× realtime speed. The API is free, requires one pip install, and the same API key is reused for the LLM too — no extra signup.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Groq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transcriptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whisper-large-v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;response_format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;transcription&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;  &lt;span class="c1"&gt;# returns plain string when format="text"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For anyone wanting a fully local setup, I kept a commented-out &lt;code&gt;_transcribe_local()&lt;/code&gt; function using &lt;code&gt;faster-whisper&lt;/code&gt; with the &lt;code&gt;base&lt;/code&gt; model — much lighter at ~150 MB and fast enough on CPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson:&lt;/strong&gt; Don't let hardware constraints kill your UX. Document the workaround clearly and provide a fallback.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2: Intent Classification with Compound Command Support
&lt;/h2&gt;

&lt;p&gt;This is the most interesting part. Given a sentence like &lt;em&gt;"Summarize this text and save it to summary.txt"&lt;/em&gt;, the system needs to identify &lt;strong&gt;two&lt;/strong&gt; intents — not just one.&lt;/p&gt;

&lt;p&gt;I used &lt;strong&gt;Groq llama-3.3-70b&lt;/strong&gt; and designed the system prompt to return a &lt;strong&gt;JSON array&lt;/strong&gt; of intent objects instead of a single object. This unlocks compound command support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"summarize"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"text_to_summarize"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"this text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"save_result_to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"summary.txt"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"create_file"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"summary.txt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content_hint"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"save the summary result"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool executor then runs each step in order, passing the output of one step as input to the next — so the summary text is automatically injected into the file creation step without any extra user input.&lt;/p&gt;

&lt;p&gt;Four supported intents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;write_code&lt;/code&gt; — generate code and save to file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;create_file&lt;/code&gt; — create a file with generated content&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;summarize&lt;/code&gt; — summarize provided text&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;general_chat&lt;/code&gt; — conversational fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Groq LLM instead of a local model?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I initially tried Ollama with llama3 locally. It worked but was slow (~8 seconds per classification on CPU). Groq's hosted llama-3.3-70b responds in under 1 second and is free — a clear win for this use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Tool Execution with Safety Constraint
&lt;/h2&gt;

&lt;p&gt;All tools that write files use a hard safety rule — every file is written inside the &lt;code&gt;output/&lt;/code&gt; directory only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;safe_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;  &lt;span class="c1"&gt;# strips "../" and any directory components
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;safe_name&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For compound commands, the executor passes results between steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_all_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;last_result_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;last_result_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Inject previous step's output (e.g. summary) into file
&lt;/span&gt;            &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;tool_create_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                                    &lt;span class="n"&gt;injected_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;last_result_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_command&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;last_result_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_plain_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Bonus 1: Human-in-the-Loop Confirmation
&lt;/h2&gt;

&lt;p&gt;Before any file operation, the pipeline pauses and shows confirmation buttons in the UI. The file is never created until the user explicitly clicks "Yes, proceed".&lt;/p&gt;

&lt;p&gt;This required careful state management in Gradio — I stored the pending intent in a global &lt;code&gt;_pending&lt;/code&gt; dict and only executed it on confirmation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Stage 1: classify, detect file op, pause
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;_pending&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcription&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;transcription&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;details&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;gr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;visible&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# show confirm buttons
&lt;/span&gt;
&lt;span class="c1"&gt;# Stage 2: user clicks confirm
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;confirm_execution&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;details&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_pending&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;details&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_all_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;details&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Bonus 2: Persistent Session Memory
&lt;/h2&gt;

&lt;p&gt;Every action is logged to &lt;code&gt;output/session_history.json&lt;/code&gt; on disk — not just in memory. This means the history panel survives page refreshes and server restarts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_entry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# read from disk
&lt;/span&gt;    &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%H:%M:%S&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%Y-%m-%d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;action&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nf"&gt;_save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# write back to disk
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The UI loads this file on startup so the memory panel is always populated, even after a restart. Cancelled operations are also logged with a 🚫 status.&lt;/p&gt;




&lt;h2&gt;
  
  
  Bonus 3: Compound Commands
&lt;/h2&gt;

&lt;p&gt;As described in Stage 2, a single voice input can now trigger multiple sequential actions. The intent classifier returns an ordered list of steps, and the tool executor runs them in sequence, piping output between steps automatically.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Summarize: AI is transforming every industry — and save it to summary.txt"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Summarizes the text&lt;/li&gt;
&lt;li&gt;Saves the summary to &lt;code&gt;output/summary.txt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Logs both actions as one entry in session memory&lt;/li&gt;
&lt;li&gt;Displays both step results in the UI&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Groq returns a string, not an object for STT&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When using &lt;code&gt;response_format="text"&lt;/code&gt;, the API returns a plain string — not an object with a &lt;code&gt;.text&lt;/code&gt; attribute. This caused an &lt;code&gt;AttributeError&lt;/code&gt; on first run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Gradio 6 breaking changes&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Upgrading to Gradio 6 broke two things: &lt;code&gt;show_copy_button&lt;/code&gt; was removed from &lt;code&gt;gr.Textbox&lt;/code&gt;, and &lt;code&gt;css&lt;/code&gt; moved from &lt;code&gt;gr.Blocks()&lt;/code&gt; to &lt;code&gt;demo.launch()&lt;/code&gt;. Always check the changelog when upgrading UI frameworks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Gemini free tier quota exhausted immediately&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I initially used Google Gemini for the LLM. The free tier quota was exhausted after just a few test calls (&lt;code&gt;limit: 0&lt;/code&gt; errors). Switching to Groq solved this — their free tier is genuinely generous with thousands of requests per day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Confirmation buttons disappearing too fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In Gradio 6, toggling &lt;code&gt;visible&lt;/code&gt; on individual buttons inside a Row behaved inconsistently. The fix was to toggle the entire &lt;code&gt;gr.Row&lt;/code&gt; container's visibility instead of individual buttons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Session history lost on refresh&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Storing history in a Python global variable means it resets on every page refresh. Moving to a JSON file on disk solved this completely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;voice-ai-agent/
├── app.py              # Gradio UI + pipeline orchestration
├── src/
│   ├── stt.py          # Speech-to-Text (Groq Whisper)
│   ├── intent.py       # Intent classification + compound command support
│   ├── tools.py        # Tool execution with result chaining
│   └── memory.py       # Persistent session memory (JSON on disk)
├── output/             # All generated files + session_history.json
├── requirements.txt
├── .env.example
└── README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tech Stack Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;Groq Whisper (whisper-large-v3)&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM&lt;/td&gt;
&lt;td&gt;Groq llama-3.3-70b-versatile&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI&lt;/td&gt;
&lt;td&gt;Gradio 6&lt;/td&gt;
&lt;td&gt;Free / Open Source&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;JSON file on disk&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Python 3.12&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Total API cost: $0&lt;/strong&gt; — one free Groq key handles everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Add Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Graceful degradation&lt;/strong&gt; — better error messages for unintelligible audio or unmapped intents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model benchmarking&lt;/strong&gt; — compare Groq vs local Ollama on speed and accuracy&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More compound patterns&lt;/strong&gt; — e.g. &lt;em&gt;"Write a retry function and also write tests for it"&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File editing&lt;/strong&gt; — &lt;em&gt;"Open retry.py and add logging to it"&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this agent taught me that the hardest part of an AI pipeline isn't any single model — it's the &lt;strong&gt;glue between components&lt;/strong&gt;. Making structured JSON output reliable, handling Gradio version changes, managing state for multi-step confirmation flows, and persisting memory across sessions are all engineering challenges that tutorials skip over.&lt;/p&gt;

&lt;p&gt;The compound command feature was the most satisfying to build — it turns a simple classifier into something that genuinely understands user intent at a higher level.&lt;/p&gt;

&lt;p&gt;Full source code: &lt;a href="https://github.com/sand33p312/voice-ai-agent" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;br&gt;
Demo video: &lt;a href="https://www.loom.com/share/dbd2a53faccd4bee8e23614dfe72ee7b" rel="noopener noreferrer"&gt;Loom Demo&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with: Python · Gradio · Groq API (Whisper + llama-3.3-70b)&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>gradio</category>
    </item>
  </channel>
</rss>
