<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Barath Jakkuva</title>
    <description>The latest articles on DEV Community by Barath Jakkuva (@barath_jakkuva_875abb7b6f).</description>
    <link>https://dev.to/barath_jakkuva_875abb7b6f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873471%2F1c1765db-96c8-4b55-8ece-677a70e0a49f.png</url>
      <title>DEV Community: Barath Jakkuva</title>
      <link>https://dev.to/barath_jakkuva_875abb7b6f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/barath_jakkuva_875abb7b6f"/>
    <language>en</language>
    <item>
      <title>How I Built a Voice-Controlled AI Agent That Listens, Thinks, and Acts</title>
      <dc:creator>Barath Jakkuva</dc:creator>
      <pubDate>Sat, 11 Apr 2026 12:11:09 +0000</pubDate>
      <link>https://dev.to/barath_jakkuva_875abb7b6f/how-i-built-a-voice-controlled-ai-agent-that-listens-thinks-and-acts-2jed</link>
      <guid>https://dev.to/barath_jakkuva_875abb7b6f/how-i-built-a-voice-controlled-ai-agent-that-listens-thinks-and-acts-2jed</guid>
      <description>&lt;p&gt;From audio input to file creation — a complete walkthrough of the architecture, models, and hard lessons learned&lt;/p&gt;

&lt;p&gt;When I set out to build a voice-controlled AI agent, I thought the hard part would be the AI. It wasn't. The hard part was getting every layer of the pipeline — audio, transcription, intent classification, tool execution, and UI — to talk to each other reliably. This article walks through exactly how I did it, what broke, and what I'd do differently.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Agent Does
&lt;/h2&gt;

&lt;p&gt;The agent accepts voice input (either a recorded audio file or live microphone), transcribes it to text, figures out what the user wants, and executes the right action on the local machine. The supported actions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create a file&lt;/strong&gt; — creates a new file in a sandboxed output folder&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write code&lt;/strong&gt; — generates code with an LLM and saves it to a file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize text&lt;/strong&gt; — produces a bullet-point summary of provided content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General chat&lt;/strong&gt; — conversational responses to anything else&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compound commands&lt;/strong&gt; — multiple of the above in a single utterance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The entire pipeline is displayed in a clean Streamlit UI showing each step: transcription → intent → action → result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system is split into four clean modules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input
    │
    ▼
STT Module (utils/stt.py)
    │  Groq Whisper large-v3
    ▼
Intent Module (utils/intent.py)
    │  Groq LLaMA 3.3 70B → structured JSON
    ▼
Executor Module (utils/executor.py)
    │  file ops / code gen / summarization / chat
    ▼
Streamlit UI (app.py)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each module is independently swappable. The STT module, for example, supports local HuggingFace Whisper, Groq's hosted Whisper, or OpenAI's Whisper — selected automatically based on what's available.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Models I Chose
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Speech-to-Text: Groq Whisper large-v3
&lt;/h3&gt;

&lt;p&gt;The assignment recommended running a HuggingFace Whisper model locally. I started there. The problem: running &lt;code&gt;openai/whisper-large-v3&lt;/code&gt; locally requires a CUDA-capable GPU with ~6 GB VRAM. On a standard laptop, CPU inference takes 30–60 seconds per utterance — which completely kills the user experience for a demo.&lt;/p&gt;

&lt;p&gt;Groq's hosted Whisper API solves this. It uses the same model weights but runs on Groq's custom LPU hardware, returning transcriptions in under 3 seconds. It has a generous free tier, requires no local GPU, and the API is drop-in compatible with the OpenAI Whisper format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; You need an internet connection and a Groq API key. For a local-first setup, the code supports falling back to local HuggingFace inference by setting &lt;code&gt;STT_BACKEND=local&lt;/code&gt; in the environment config.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intent Classification &amp;amp; Generation: Groq LLaMA 3.3 70B
&lt;/h3&gt;

&lt;p&gt;For the LLM, the assignment recommended Ollama for local inference. Again, I tried it. LLaMA 3 70B requires ~40 GB of disk space and significant RAM. On a development machine, this is impractical for a demo submission.&lt;/p&gt;

&lt;p&gt;Groq's hosted LLaMA 3.3 70B solves the same problem — same model quality, ~200ms response time, free tier available.&lt;/p&gt;

&lt;p&gt;The intent classifier uses a structured JSON prompt that forces the model to return exactly the schema I need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"summary_source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"none"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"chat_reply"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Using &lt;code&gt;"response_format": {"type": "json_object"}&lt;/code&gt; in the API call ensures the model never returns markdown fences or prose — just clean JSON every time.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Intent Classification Design
&lt;/h2&gt;

&lt;p&gt;The most interesting design challenge was intent classification. A naive approach would be to check for keywords ("create", "write", "summarize"). This breaks immediately on real speech — people say "can you make me a Python script" not "write code".&lt;/p&gt;

&lt;p&gt;Instead, I wrote a system prompt that gives the LLM the full schema, the allowed intent values, and explicit rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If the user mentions multiple actions, return multiple intents (compound command support)&lt;/li&gt;
&lt;li&gt;Infer the programming language from context&lt;/li&gt;
&lt;li&gt;Suggest a meaningful filename based on what was requested&lt;/li&gt;
&lt;li&gt;If summarization is requested but no extra text is provided, summarize the transcription itself&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach handles natural, conversational language robustly. "Hey, can you write me something that retries failed HTTP requests and save it as a Python file?" correctly returns &lt;code&gt;["write_code", "create_file"]&lt;/code&gt; with &lt;code&gt;filename: "retry.py"&lt;/code&gt; and &lt;code&gt;language: "python"&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Executor and Safety Constraints
&lt;/h2&gt;

&lt;p&gt;The executor maps intents to actions. The critical safety constraint: &lt;strong&gt;all file I/O is restricted to an &lt;code&gt;output/&lt;/code&gt; folder&lt;/strong&gt;. This is enforced by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resolving all file paths relative to the &lt;code&gt;output/&lt;/code&gt; directory&lt;/li&gt;
&lt;li&gt;Sanitizing filenames to strip path separators and traversal sequences (&lt;code&gt;../&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Never accepting absolute paths from the LLM output
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_sanitize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^\w.\-]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;safe&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;  &lt;span class="c1"&gt;# strips any directory component
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means even if the LLM hallucinated a filename like &lt;code&gt;../../system32/important.dll&lt;/code&gt;, the executor would write to &lt;code&gt;output/______system32_important.dll&lt;/code&gt; instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The UI
&lt;/h2&gt;

&lt;p&gt;I built the frontend in Streamlit. It displays the four pipeline stages in cards — transcription, detected intents (shown as colour-coded badges), action taken, and final output. A session history panel at the bottom shows all previous interactions in the current session.&lt;/p&gt;

&lt;p&gt;Two bonus features worth highlighting:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-Loop:&lt;/strong&gt; Before executing any file operation, the UI shows a confirmation prompt with the suggested filename. The user can review what the agent is about to do and cancel if needed. This is a small addition but makes the agent feel trustworthy rather than dangerous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compound Commands:&lt;/strong&gt; If the LLM detects multiple intents in one utterance, the executor runs them sequentially and combines the results. "Write a retry function and summarize how it works" triggers both &lt;code&gt;write_code&lt;/code&gt; and &lt;code&gt;summarize_text&lt;/code&gt; in one go.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Environment variables not loading
&lt;/h3&gt;

&lt;p&gt;The biggest time sink was a subtle Python import order bug. The &lt;code&gt;STT_BACKEND&lt;/code&gt; and &lt;code&gt;LLM_BACKEND&lt;/code&gt; variables were being read at module import time (top-level &lt;code&gt;os.getenv()&lt;/code&gt; calls), before &lt;code&gt;load_dotenv()&lt;/code&gt; had been called. The fix was moving all environment variable reads inside the functions that use them, so they execute after dotenv has loaded.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Model deprecation mid-build
&lt;/h3&gt;

&lt;p&gt;Midway through development, Groq deprecated &lt;code&gt;llama3-70b-8192&lt;/code&gt;. Any in-flight API calls started returning &lt;code&gt;model_decommissioned&lt;/code&gt; errors. The fix was switching to &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt;, which Groq now recommends as the replacement. Lesson: always read the deprecation docs before submitting.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Audio format handling
&lt;/h3&gt;

&lt;p&gt;Streamlit's &lt;code&gt;st.audio_input&lt;/code&gt; (microphone recording) and &lt;code&gt;st.file_uploader&lt;/code&gt; return different data types. The microphone returns a BytesIO object; the uploader returns an UploadedFile. Both need to be written to a temporary file on disk before passing to the Whisper API, because Whisper expects a file path or binary file handle, not a stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. LLM JSON reliability
&lt;/h3&gt;

&lt;p&gt;Early versions of the intent prompt would occasionally return JSON wrapped in markdown fences (&lt;code&gt;&lt;/code&gt;&lt;code&gt;json ...&lt;/code&gt;&lt;code&gt;&lt;/code&gt;), or include explanatory prose before the JSON. I fixed this by using Groq's &lt;code&gt;response_format: json_object&lt;/code&gt; parameter and adding a regex strip as a fallback:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;cleaned&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;```

(?:json)?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;` &lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="sb"&gt;``&lt;/span&gt;&lt;span class="err"&gt;`&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;## What I'd Do Differently
&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt; &lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="n"&gt;currently&lt;/span&gt; &lt;span class="n"&gt;wait&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;full&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Streaming&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;UI&lt;/span&gt; &lt;span class="n"&gt;would&lt;/span&gt; &lt;span class="n"&gt;make&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="n"&gt;feel&lt;/span&gt; &lt;span class="n"&gt;much&lt;/span&gt; &lt;span class="n"&gt;more&lt;/span&gt; &lt;span class="n"&gt;responsive&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Persistent&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="n"&gt;across&lt;/span&gt; &lt;span class="n"&gt;sessions&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="n"&gt;resets&lt;/span&gt; &lt;span class="n"&gt;when&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="n"&gt;restarts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Adding&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;SQLite&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="n"&gt;would&lt;/span&gt; &lt;span class="n"&gt;make&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="n"&gt;genuinely&lt;/span&gt; &lt;span class="n"&gt;useful&lt;/span&gt; &lt;span class="n"&gt;over&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Voice&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="n"&gt;completing&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;loop&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;speech&lt;/span&gt; &lt;span class="n"&gt;so&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="n"&gt;speaks&lt;/span&gt; &lt;span class="n"&gt;its&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="n"&gt;back&lt;/span&gt; &lt;span class="n"&gt;would&lt;/span&gt; &lt;span class="n"&gt;make&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="n"&gt;feel&lt;/span&gt; &lt;span class="n"&gt;like&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;true&lt;/span&gt; &lt;span class="n"&gt;voice&lt;/span&gt; &lt;span class="n"&gt;assistant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Better&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="n"&gt;recovery&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;transcription&lt;/span&gt; &lt;span class="n"&gt;produces&lt;/span&gt; &lt;span class="n"&gt;garbled&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt; &lt;span class="n"&gt;passes&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt; &lt;span class="n"&gt;which&lt;/span&gt; &lt;span class="n"&gt;then&lt;/span&gt; &lt;span class="n"&gt;classifies&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="sb"&gt;`general_chat`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;Whisper&lt;/span&gt; &lt;span class="n"&gt;could&lt;/span&gt; &lt;span class="n"&gt;be&lt;/span&gt; &lt;span class="n"&gt;used&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="n"&gt;instead&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;## Final Thoughts
&lt;/span&gt;
&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;valuable&lt;/span&gt; &lt;span class="n"&gt;thing&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;learned&lt;/span&gt; &lt;span class="n"&gt;building&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="n"&gt;wasn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;t about AI — it was about pipeline design. Every layer needs clean, well-defined inputs and outputs. When something breaks (and it will), you need to be able to isolate which layer is the problem in seconds, not hours.

The modularity of separating STT, intent, and execution into independent files paid off every single time something went wrong. I could test each component in isolation, swap backends without touching the UI, and add new intents without rewriting the executor.

The full source code is available on GitHub: **[link]**
Video demo: **[link]**

---

*Built with Streamlit, Groq Whisper, and LLaMA 3.3 70B.*
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>nlp</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
