<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manas Ranjan Jena</title>
    <description>The latest articles on DEV Community by Manas Ranjan Jena (@manas_ranjanjena_6946ef7).</description>
    <link>https://dev.to/manas_ranjanjena_6946ef7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874419%2Ff3b5f32b-bc91-4041-b641-05d2c8fb1dc8.jpg</url>
      <title>DEV Community: Manas Ranjan Jena</title>
      <link>https://dev.to/manas_ranjanjena_6946ef7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manas_ranjanjena_6946ef7"/>
    <language>en</language>
    <item>
      <title>Building EchoKernel: A Voice-Controlled AI Agent That Actually Does Things</title>
      <dc:creator>Manas Ranjan Jena</dc:creator>
      <pubDate>Sun, 12 Apr 2026 05:24:38 +0000</pubDate>
      <link>https://dev.to/manas_ranjanjena_6946ef7/building-echokernel-a-voice-controlled-ai-agent-that-actually-does-things-1l5a</link>
      <guid>https://dev.to/manas_ranjanjena_6946ef7/building-echokernel-a-voice-controlled-ai-agent-that-actually-does-things-1l5a</guid>
      <description>&lt;p&gt;I want to be upfront about something before we start: the phrase "local AI agent" is one of the most overloaded terms in the current AI landscape. Half the demos you'll find online are chatbots with a file-picker attached. The other half require a $3,000 workstation with 24GB of VRAM just to boot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EchoKernel&lt;/strong&gt; is my attempt to build something in the middle — a voice-controlled agent that genuinely executes actions on your local machine (creating files, writing code, summarizing text), runs on any laptop without a GPU, and has a pipeline transparent enough that you can understand and modify every stage.&lt;/p&gt;

&lt;p&gt;This article walks through the full architecture, the reasoning behind every major decision, and the specific bugs that bit me hardest. The &lt;a href="https://github.com/ManasRanjanJena253/EchoKernel" rel="noopener noreferrer"&gt;source code is on GitHub&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What It Does
&lt;/h2&gt;

&lt;p&gt;You speak a command (or type one). EchoKernel:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Transcribes your audio to text using Groq's Whisper API&lt;/li&gt;
&lt;li&gt;Sends that transcript to LLaMA 3.3 70B to classify your intent as structured JSON&lt;/li&gt;
&lt;li&gt;Routes the intent to the right local tool — file creation, code generation, summarization, or chat&lt;/li&gt;
&lt;li&gt;Displays the transcription, detected intent, action taken, and output in a clean three-panel UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A full interaction looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;User&lt;/span&gt; &lt;span class="n"&gt;says&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python function that retries failed HTTP requests and save it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;Pipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="n"&gt;blob&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;Whisper&lt;/span&gt; &lt;span class="n"&gt;Large&lt;/span&gt; &lt;span class="n"&gt;v3&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a Python function that...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="n"&gt;transcript&lt;/span&gt;  &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;LLaMA&lt;/span&gt; &lt;span class="mf"&gt;3.3&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;    &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;primary_intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                         &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;retry.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;tool&lt;/span&gt; &lt;span class="n"&gt;executor&lt;/span&gt;    &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;generates&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writes&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
  &lt;span class="n"&gt;result&lt;/span&gt;      &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;UI&lt;/span&gt;               &lt;span class="err"&gt;→&lt;/span&gt;  &lt;span class="n"&gt;shows&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;download&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything generated lands in an &lt;code&gt;output/&lt;/code&gt; directory. Nothing touches the rest of your filesystem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Browser  (frontend/index.html)
     │
     │  multipart/form-data  (audio blob + session metadata)
     │  application/json     (text commands, confirmations)
     │
     ▼
FastAPI  (backend/main.py)
     │
     ├── [1] STT Service     →  Groq Whisper API       →  transcript text
     ├── [2] Intent Service  →  Groq LLaMA 3.3 70B     →  structured JSON intent
     ├── [3] Tool Executor   →  local Python functions  →  file / code / summary / chat
     └── [4] Memory Store    →  in-process dict         →  per-session chat history
     │
     ▼
output/   ← every file write is sandboxed here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline is deliberately sequential and single-responsibility. Each stage produces a typed Pydantic object and hands it to the next:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TranscriptionResult  →  IntentResult  →  ToolResult
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means any stage can be replaced without touching the others. Want to swap Groq Whisper for a local &lt;code&gt;faster-whisper&lt;/code&gt; binary? You change exactly one async function in &lt;code&gt;stt.py&lt;/code&gt;. The rest of the pipeline doesn't know or care.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1: Speech-to-Text
&lt;/h2&gt;

&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;The browser records audio using the Web &lt;code&gt;MediaRecorder&lt;/code&gt; API and sends the raw blob to the &lt;code&gt;/agent/audio&lt;/code&gt; endpoint as a multipart upload. FastAPI reads the bytes and forwards them to Groq's &lt;code&gt;/v1/audio/transcriptions&lt;/code&gt; endpoint, which is OpenAI-API-compatible and runs Whisper Large v3 on Groq's LPU hardware.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transcribe_audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;TranscriptionResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;60.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whisper-large-v3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verbose_json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.groq.com/openai/v1/audio/transcriptions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;TranscriptionResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;duration&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The bug that wasted two hours
&lt;/h3&gt;

&lt;p&gt;Browsers emit recorded audio as &lt;code&gt;audio/webm;codecs=opus&lt;/code&gt;. My original content-type validation used a Python set membership check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ALLOWED_AUDIO_TYPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/mpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/webm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...}&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ALLOWED_AUDIO_TYPES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;415&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every microphone recording returned a 415 Unsupported Media Type. The issue: &lt;code&gt;"audio/webm;codecs=opus"&lt;/code&gt; is not equal to &lt;code&gt;"audio/webm"&lt;/code&gt;. The codec suffix makes them different strings.&lt;/p&gt;

&lt;p&gt;The fix was switching from exact-match to prefix-match, and separately stripping the codec suffix before forwarding to Groq (which also rejects the full string):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ALLOWED_AUDIO_PREFIXES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/mpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/webm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audio/ogg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;

&lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ALLOWED_AUDIO_PREFIXES&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;415&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...)&lt;/span&gt;

&lt;span class="c1"&gt;# groq doesn't accept codec params — strip before forwarding
&lt;/span&gt;&lt;span class="n"&gt;clean_content_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lesson: always treat browser-emitted MIME types as prefixes, not exact strings.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Whisper Large v3?
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;large-v3&lt;/code&gt; checkpoint is meaningfully better than &lt;code&gt;medium&lt;/code&gt; or &lt;code&gt;small&lt;/code&gt; on short, command-like utterances — exactly what a voice agent receives. Smaller checkpoints are more prone to hallucinating filler words or mis-hearing technical terms ("create a YAML file" becoming "create a yaml pile"). The latency difference on Groq's LPU is small enough (~100–200ms) that it's not worth compromising accuracy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2: Intent Classification
&lt;/h2&gt;

&lt;p&gt;This is the most architecturally interesting part of the system. The challenge is taking freeform transcribed text — which could be anything from "make a file called config dot yaml" to "write me a function that debounces events in JavaScript" — and turning it into a structured, typed object that the tool executor can act on reliably.&lt;/p&gt;

&lt;h3&gt;
  
  
  The prompt design
&lt;/h3&gt;

&lt;p&gt;The system prompt instructs LLaMA 3.3 70B to return &lt;em&gt;only&lt;/em&gt; a JSON object — no preamble, no explanation, no markdown fences:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;INTENT_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;You are an intent classifier for a voice-controlled AI agent.
Analyze the user&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s transcribed speech and return ONLY a valid JSON object — no markdown, no explanation.

Intent categories:
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: user wants to create an empty file or folder
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: user wants code generated and saved to a file
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: user wants text content summarized
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: general conversation, questions, or anything else
- &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: multiple distinct actions in one command

JSON schema to return:
{
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;primary_intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;one of the five categories&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secondary_intents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;additional intents if compound, else empty&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;],
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;high|medium|low&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;suggested filename with extension if applicable, else null&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;text or topic the user wants to act on, if any&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,
  &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reasoning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;one sentence explaining your classification&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;
}&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API call enforces this at the model level using &lt;code&gt;response_format: {"type": "json_object"}&lt;/code&gt;, which tells the model to guarantee well-formed JSON output. This means &lt;code&gt;json.loads()&lt;/code&gt; never throws — even if the model decides to return an unexpected schema, it's still parseable, and &lt;code&gt;dict.get()&lt;/code&gt; with defaults handles missing fields cleanly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GROQ_BASE_URL&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;llama-3.3-70b-versatile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# low temp for deterministic classification
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json_object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Temperature is set to &lt;code&gt;0.1&lt;/code&gt; rather than &lt;code&gt;0&lt;/code&gt; — this avoids the model getting stuck in degenerate outputs while still being close to deterministic for classification tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the JSON schema has &lt;code&gt;target_filename&lt;/code&gt; and &lt;code&gt;extracted_content&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Early versions of the system made the tool executor re-parse the original transcript to figure out things like "what file did they want to name this?" That's fragile — the tool executor would have to implement its own mini NLP layer.&lt;/p&gt;

&lt;p&gt;Instead, the intent classifier does that work once and packages the results into structured fields. When &lt;code&gt;primary_intent&lt;/code&gt; is &lt;code&gt;write_code&lt;/code&gt;, &lt;code&gt;target_filename&lt;/code&gt; already contains something like &lt;code&gt;"retry_handler.py"&lt;/code&gt; and &lt;code&gt;extracted_content&lt;/code&gt; has the description of what to write. The tool executor receives everything it needs without touching the raw text again.&lt;/p&gt;

&lt;h3&gt;
  
  
  Session history as context
&lt;/h3&gt;

&lt;p&gt;The last four turns of conversation history are injected into the intent call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;conversation_history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This solves a real usability problem: after generating a summary, a user might say "now save that to a file." Without context, this classifies as &lt;code&gt;chat&lt;/code&gt; (there's no explicit action in the phrase). With the prior turn in context, the model correctly classifies it as a compound &lt;code&gt;summarize&lt;/code&gt; + &lt;code&gt;create_file&lt;/code&gt; intent with the summary content extracted from the assistant's previous response.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why LLaMA 3.3 70B for classification?
&lt;/h3&gt;

&lt;p&gt;I tested this with smaller models during development. The pattern that caused failures was edge cases near intent boundaries — commands like "create a Python file with a retry function" which could reasonably be &lt;code&gt;create_file&lt;/code&gt; OR &lt;code&gt;write_code&lt;/code&gt; (it's actually both — a compound intent). The 70B model consistently identifies these as compound. Models in the 8B–13B range tend to collapse compound utterances to a single intent and miss the secondary action.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3: Tool Execution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The routing pattern
&lt;/h3&gt;

&lt;p&gt;The tool executor is a dispatcher that maps &lt;code&gt;primary_intent&lt;/code&gt; strings to async handler functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IntentResult&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcribed_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;ToolResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary_intent&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_handle_create_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_handle_write_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcribed_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_handle_summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcribed_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_handle_compound&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcribed_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;_handle_chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;transcribed_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# default fallback
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each handler returns a &lt;code&gt;ToolResult&lt;/code&gt; — a typed Pydantic model with &lt;code&gt;success&lt;/code&gt;, &lt;code&gt;action_taken&lt;/code&gt;, &lt;code&gt;output&lt;/code&gt;, &lt;code&gt;file_path&lt;/code&gt;, and &lt;code&gt;code_content&lt;/code&gt; fields. The frontend renders different UI components depending on which fields are populated (a code block if &lt;code&gt;code_content&lt;/code&gt; is set, a download link if &lt;code&gt;file_path&lt;/code&gt; is set).&lt;/p&gt;

&lt;h3&gt;
  
  
  The filesystem sandbox
&lt;/h3&gt;

&lt;p&gt;Every file operation goes through two validation layers before anything touches disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1 — Filename sanitization:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_filename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# strip anything that could escape the output directory
&lt;/span&gt;    &lt;span class="n"&gt;clean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^\w.\-]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;clean&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Path(name).name&lt;/code&gt; drops any directory components (so &lt;code&gt;"../../etc/passwd"&lt;/code&gt; becomes &lt;code&gt;"passwd"&lt;/code&gt;). The regex then strips anything that isn't a word character, dot, or hyphen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2 — Resolved path validation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_resolve_output_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_safe_filename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;safe&lt;/span&gt;
    &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;relative_to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c1"&gt;# raises ValueError if outside
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After sanitization, the path is still resolved against the real filesystem and checked against &lt;code&gt;OUTPUT_DIR&lt;/code&gt;. If somehow a sanitized filename still resolves outside &lt;code&gt;output/&lt;/code&gt; (symlink attacks, OS-specific edge cases), this raises a &lt;code&gt;ValueError&lt;/code&gt; before any write happens. Two independent layers means a bypass of the first doesn't automatically mean a bypass of the second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code generation and the markdown fence problem
&lt;/h3&gt;

&lt;p&gt;LLMs are trained to format code inside markdown fences. Even when you explicitly instruct "return only raw code, no markdown fences," frontier models comply about 95% of the time — but that 5% writes invalid Python to disk because the file starts with &lt;code&gt;&lt;/code&gt;`&lt;code&gt;python&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix is a post-processing strip that runs regardless of whether the model complied:&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;python&lt;br&gt;
code = await _call_llm(messages)&lt;/p&gt;

&lt;h1&gt;
  
  
  strip markdown fences that some models sneak in despite instructions
&lt;/h1&gt;

&lt;p&gt;code = re.sub(r"^&lt;code&gt;&lt;/code&gt;&lt;code&gt;[\w]*\n?", "", code.strip())&lt;br&gt;
code = re.sub(r"\n?&lt;/code&gt;&lt;code&gt;&lt;/code&gt;$", "", code.strip())&lt;/p&gt;

&lt;p&gt;path.write_text(code, encoding="utf-8")&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;This costs a negligible regex pass and makes the code writing 100% reliable instead of 95%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Compound commands
&lt;/h3&gt;

&lt;p&gt;When the intent is &lt;code&gt;compound&lt;/code&gt;, the executor iterates over &lt;code&gt;secondary_intents&lt;/code&gt; and recursively calls &lt;code&gt;execute_tool&lt;/code&gt; for each sub-intent, stitching the results together:&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;python&lt;br&gt;
async def _handle_compound(intent, transcribed_text, history):&lt;br&gt;
    outputs = []&lt;br&gt;
    for sub in intent.secondary_intents or ["chat"]:&lt;br&gt;
        sub_intent = IntentResult(&lt;br&gt;
            primary_intent=sub if sub in VALID_INTENTS else "chat",&lt;br&gt;
            secondary_intents=[],&lt;br&gt;
            target_filename=intent.target_filename,&lt;br&gt;
            extracted_content=intent.extracted_content,&lt;br&gt;
            ...&lt;br&gt;
        )&lt;br&gt;
        result = await execute_tool(sub_intent, transcribed_text, history)&lt;br&gt;
        outputs.append(f"[{sub}] {result.output}")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;return ToolResult(
    action_taken="Executed compound command",
    output="\n\n".join(outputs),
    ...
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;This means a command like "summarize this text and save it to notes.txt" executes as two sequential tool calls — first a summarization, then a file write — and the user sees both results merged in a single response card.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 4: Session Memory
&lt;/h2&gt;

&lt;p&gt;The memory store is intentionally simple:&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;python&lt;br&gt;
_sessions: dict[str, SessionHistory] = {}&lt;/p&gt;

&lt;p&gt;def append_message(session_id: str, role: str, content: str, intent: str | None = None):&lt;br&gt;
    session = _sessions.setdefault(session_id, SessionHistory(session_id=session_id))&lt;br&gt;
    session.messages.append(ChatMessage(role=role, content=content, intent=intent, ...))&lt;/p&gt;

&lt;p&gt;def get_history(session_id: str) -&amp;gt; list[dict]:&lt;br&gt;
    session = _sessions.get(session_id)&lt;br&gt;
    return [{"role": m.role, "content": m.content} for m in session.messages] if session else []&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;A plain Python dict, no database, no Redis. For a single-user local agent this is exactly the right choice — zero setup friction, zero operational overhead, data scoped to the server process lifetime. The frontend generates a UUID on first response and attaches it to every subsequent request, so sessions are naturally isolated.&lt;/p&gt;

&lt;p&gt;The tradeoff is that sessions disappear when you restart the server. For a local development tool this is acceptable. For a production deployment you'd replace the dict with Redis or a lightweight SQLite write — and because of the single-responsibility design, that's a change to &lt;code&gt;memory.py&lt;/code&gt; only.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Frontend
&lt;/h2&gt;

&lt;p&gt;The UI is a single HTML file with no build step, no framework, no node_modules. It opens directly in the browser with &lt;code&gt;double-click&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The three-panel layout was chosen deliberately:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Left&lt;/strong&gt; — input controls (mic, file upload, text, toggles)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Center&lt;/strong&gt; — scrollable conversation feed showing the full pipeline result for each interaction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right&lt;/strong&gt; — output file browser and session history log&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The most interesting frontend engineering challenge was scroll containment. With a CSS grid layout, if you set &lt;code&gt;overflow: hidden&lt;/code&gt; anywhere in the ancestor chain, the flex children can't scroll independently. The fix requires a specific combination of properties:&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;css&lt;/p&gt;

&lt;h1&gt;
  
  
  feed-col {
&lt;/h1&gt;

&lt;p&gt;display: flex;&lt;br&gt;
  flex-direction: column;&lt;br&gt;
  min-height: 0;   /* critical: without this, the column refuses to shrink */&lt;br&gt;
}&lt;/p&gt;

&lt;h1&gt;
  
  
  feed {
&lt;/h1&gt;

&lt;p&gt;flex: 1;&lt;br&gt;
  overflow-y: auto;&lt;br&gt;
  min-height: 0;   /* allows the feed to scroll rather than grow infinitely */&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;.msg-card {&lt;br&gt;
  flex-shrink: 0;  /* prevents cards from being squished by the layout */&lt;br&gt;
}&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;min-height: 0&lt;/code&gt; on both the column and the scrollable child, the feed grows to fit its content instead of scrolling — so after 3–4 messages the layout breaks and scroll stops working. This is a subtle CSS flexbox behaviour that isn't obvious from reading the spec.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;When the HITL toggle is on, the &lt;code&gt;/agent/audio&lt;/code&gt; endpoint returns HTTP 202 (Accepted but not yet acted upon) instead of executing immediately:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;`python&lt;br&gt;
if require_confirmation and detected_intent.primary_intent in ("create_file", "write_code"):&lt;br&gt;
    return JSONResponse(&lt;br&gt;
        status_code=202,&lt;br&gt;
        content={&lt;br&gt;
            "status": "awaiting_confirmation",&lt;br&gt;
            "session_id": sid,&lt;br&gt;
            "transcription": transcription.model_dump(),&lt;br&gt;
            "intent": detected_intent.model_dump(),&lt;br&gt;
        },&lt;br&gt;
    )&lt;br&gt;
`&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The frontend detects the 202, renders an Execute / Cancel prompt, and only calls the &lt;code&gt;/agent/confirm&lt;/code&gt; endpoint if the user approves. This follows the HTTP semantic correctly — 202 means "the request was understood and will be processed pending further action" — rather than inventing a custom status code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model and Provider Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why Groq over OpenAI, Anthropic, or local models?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;vs. OpenAI:&lt;/strong&gt; The API surface is identical — Groq implements the OpenAI spec, so switching would be a one-line URL change. The difference is latency. Groq's LPU (Language Processing Unit) is purpose-built silicon for transformer inference and delivers roughly 3–5× faster token throughput than GPU-based providers. For a voice pipeline, where you're waiting for STT + LLM sequentially, that difference is the gap between an interaction that feels responsive and one that feels like a web search from 2008.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs. local inference (Ollama, llama.cpp, LM Studio):&lt;/strong&gt; Running LLaMA 3.3 70B locally requires either 16–24GB of VRAM for acceptable GPU inference, or a 15–30 second response time on CPU. Neither is usable for a voice agent. The honest tradeoff is: Groq makes EchoKernel work on any laptop with an internet connection, at the cost of one API key and a few cents per day of usage. The architecture is designed so swapping back to local inference is 10 lines of code:&lt;/p&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;python&lt;/p&gt;

&lt;h1&gt;
  
  
  stt.py — replace Groq call with faster-whisper
&lt;/h1&gt;

&lt;p&gt;import faster_whisper&lt;br&gt;
model = faster_whisper.WhisperModel("large-v3", device="cpu")&lt;br&gt;
segments, _ = model.transcribe(audio_bytes_io)&lt;br&gt;
transcript = " ".join(s.text for s in segments)&lt;/p&gt;

&lt;h1&gt;
  
  
  intent.py — replace Groq call with Ollama
&lt;/h1&gt;

&lt;p&gt;import ollama&lt;br&gt;
response = ollama.chat(model="llama3.3", messages=messages)&lt;br&gt;
raw = response["message"]["content"]&lt;br&gt;
&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;vs. Anthropic / Google:&lt;/strong&gt; Both offer excellent LLMs but neither provides a Whisper-equivalent STT API. You'd need two providers for one pipeline. Keeping everything on Groq means one API key, one base URL, one billing account.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The MIME type suffix problem&lt;/strong&gt; was the most frustrating bug — a 415 error with no obvious cause until I checked what &lt;code&gt;audio.content_type&lt;/code&gt; actually contained in the server logs. The browser string &lt;code&gt;audio/webm;codecs=opus&lt;/code&gt; is technically a valid MIME type with parameters (RFC 2045), but treating it as an exact match against a set of strings silently rejects every microphone recording.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Getting the intent classifier to reliably produce compound intents&lt;/strong&gt; required several prompt iterations. The model would correctly identify compound commands about 70% of the time with a basic prompt. Adding explicit examples of compound vs. single intents in the system prompt, and reducing temperature to 0.1, pushed this above 95%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CSS scroll containment in a grid layout&lt;/strong&gt; took longer than it should have. The symptom was that the feed stopped scrolling after a few messages. The root cause was missing &lt;code&gt;min-height: 0&lt;/code&gt; on flex children inside a grid column — a property that's meaningless in most contexts but critical here. The browser's default &lt;code&gt;min-height: auto&lt;/code&gt; for flex items means they will never shrink below their content size, so the scrollable container grows instead of scrolling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The &lt;code&gt;session_id&lt;/code&gt; 422 error&lt;/strong&gt; came from a Pydantic schema where &lt;code&gt;session_id: str&lt;/code&gt; was a required field with no default. The frontend correctly sends &lt;code&gt;null&lt;/code&gt; on first request (there's no session ID yet), but Pydantic rejected &lt;code&gt;null&lt;/code&gt; for a non-optional &lt;code&gt;str&lt;/code&gt;. The fix was &lt;code&gt;session_id: Optional[str] = None&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Build Next
&lt;/h2&gt;

&lt;p&gt;The architecture has four clean extension points:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent memory&lt;/strong&gt; — replace the in-process dict with SQLite via &lt;code&gt;aiosqlite&lt;/code&gt;. Sessions would survive server restarts and you could browse history across days.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More tools&lt;/strong&gt; — the tool executor is just a dispatcher. Adding a &lt;code&gt;search_web&lt;/code&gt; tool, a &lt;code&gt;run_shell_command&lt;/code&gt; tool (with appropriate confirmation gates), or a &lt;code&gt;read_file&lt;/code&gt; tool for context injection are all isolated additions to &lt;code&gt;tools.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming responses&lt;/strong&gt; — the current architecture returns a complete response after the LLM finishes. Adding Server-Sent Events would let the UI render code token-by-token as the model generates it, which dramatically improves perceived latency for long outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local model benchmarking&lt;/strong&gt; — running the same test suite against &lt;code&gt;faster-whisper medium&lt;/code&gt; vs &lt;code&gt;large-v3&lt;/code&gt; on CPU, and &lt;code&gt;llama3.1 8B&lt;/code&gt; vs &lt;code&gt;llama3.3 70B&lt;/code&gt; on Ollama, would produce concrete latency/accuracy numbers that justify the current model choices with data rather than intuition.&lt;/p&gt;




&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;

&lt;p&gt;`&lt;code&gt;&lt;/code&gt;bash&lt;br&gt;
git clone &lt;a href="https://github.com/ManasRanjanJena253/EchoKernel" rel="noopener noreferrer"&gt;https://github.com/ManasRanjanJena253/EchoKernel&lt;/a&gt;&lt;br&gt;
cd EchoKernel&lt;/p&gt;

&lt;p&gt;cp .env.example .env&lt;/p&gt;

&lt;h1&gt;
  
  
  add your GROQ_API_KEY to .env
&lt;/h1&gt;

&lt;p&gt;pip install -r backend/requirements.txt&lt;br&gt;
python run.py&lt;/p&gt;

&lt;h1&gt;
  
  
  then open frontend/index.html in your browser
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`&lt;/p&gt;

&lt;p&gt;Get a free Groq API key at &lt;a href="https://console.groq.com" rel="noopener noreferrer"&gt;console.groq.com&lt;/a&gt;. The free tier is more than enough for development and demos.&lt;/p&gt;




&lt;p&gt;If you have questions about any part of the architecture or hit a bug I didn't cover, the GitHub issues are open. And if you end up extending it with new tools or a local model swap, I'd genuinely like to see it.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
