<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manish Reddy</title>
    <description>The latest articles on DEV Community by Manish Reddy (@vem_manish).</description>
    <link>https://dev.to/vem_manish</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3885750%2F476538ed-ffac-4f3c-ac6a-ee3abe1909e3.png</url>
      <title>DEV Community: Manish Reddy</title>
      <link>https://dev.to/vem_manish</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vem_manish"/>
    <language>en</language>
    <item>
      <title>I Built a Voice-Controlled AI Agent That Runs Entirely on Your Machine</title>
      <dc:creator>Manish Reddy</dc:creator>
      <pubDate>Sat, 18 Apr 2026 08:33:26 +0000</pubDate>
      <link>https://dev.to/vem_manish/i-built-a-voice-controlled-ai-agent-that-runs-entirely-on-your-machine-37g1</link>
      <guid>https://dev.to/vem_manish/i-built-a-voice-controlled-ai-agent-that-runs-entirely-on-your-machine-37g1</guid>
      <description>&lt;h2&gt;
  
  
  No cloud. No API keys. Just your voice, a local LLM, and a clean pipeline that actually works.
&lt;/h2&gt;




&lt;h2&gt;
  
  
  The Idea
&lt;/h2&gt;

&lt;p&gt;Most AI assistants are cloud-dependent. You say something, it goes to a server somewhere, gets processed, and comes back. That works fine — until you care about privacy, latency, or just want to understand what's actually happening under the hood.&lt;/p&gt;

&lt;p&gt;I wanted to build something different: a voice-controlled AI agent that runs completely locally. You speak (or type), it figures out what you want, and it does it — whether that's writing code, creating files, summarizing text, or having a conversation. Everything happens on your machine.&lt;/p&gt;

&lt;p&gt;This article walks through how I built it, the models I chose, and the real challenges I ran into along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Agent Can Do
&lt;/h2&gt;

&lt;p&gt;Before diving into architecture, here's what the finished system looks like from a user's perspective:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Say &lt;em&gt;"Write a Python retry function and save it as retry.py"&lt;/em&gt; → it generates the code and saves it to an &lt;code&gt;output/&lt;/code&gt; folder&lt;/li&gt;
&lt;li&gt;Say &lt;em&gt;"Summarize this text and save it to notes.txt"&lt;/em&gt; → it summarizes the content, then writes the file&lt;/li&gt;
&lt;li&gt;Say &lt;em&gt;"Create a folder called projects"&lt;/em&gt; → done&lt;/li&gt;
&lt;li&gt;Say &lt;em&gt;"What is recursion?"&lt;/em&gt; → it responds conversationally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also supports &lt;strong&gt;chaining&lt;/strong&gt; — one voice command can trigger multiple steps in sequence. And before any file is written to disk, a confirmation panel appears so you stay in control.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system is a linear pipeline. Each stage has one job and passes its output to the next.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🎤 Audio Input (mic or file upload)
        ↓
🗣️  Speech-to-Text        [faster-whisper, Whisper base]
        ↓
🧠  Intent Classifier      [llama3.1:8b via Ollama]
        ↓
⚙️  Tool Executor
   ├── WRITE_CODE          → qwen2.5-coder:7b
   ├── SAVE_FILE           → writes to output/
   ├── CREATE_FILE         → creates empty file
   ├── CREATE_FOLDER       → creates directory
   ├── SUMMARIZE_TEXT      → llama3.1:8b
   └── GENERAL_CHAT        → llama3.1:8b
        ↓
🖥️  Gradio UI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The project is split into six focused modules:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;voice.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Converts audio to text using faster-whisper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;intent_classifier.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Sends text to the LLM, parses a JSON plan of steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;executor.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runs each step in order, chains outputs between steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The actual tool functions — file ops, code gen, chat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;memory.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maintains a rolling conversation history within the session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Gradio UI and event wiring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each module is independent. You can swap out the STT model, replace Ollama with an API call, or add new tools without touching anything else.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Models I Chose (and Why)
&lt;/h2&gt;

&lt;p&gt;Choosing the right model for each job was more important than it might seem. Using a single general-purpose model for everything would have been simpler, but the quality difference when using specialized models is significant.&lt;/p&gt;

&lt;h3&gt;
  
  
  Speech-to-Text: &lt;code&gt;faster-whisper&lt;/code&gt; (Whisper base)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/guillaumekynast/faster-whisper" rel="noopener noreferrer"&gt;faster-whisper&lt;/a&gt; is a reimplementation of OpenAI's Whisper using CTranslate2. For this project, I used the &lt;code&gt;base&lt;/code&gt; model with int8 quantization on CPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this over full Whisper?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;int8 quantization cuts memory usage roughly in half compared to the standard float32 model&lt;/li&gt;
&lt;li&gt;On CPU, it's noticeably faster — typically 1–3 seconds for a 5-second voice clip&lt;/li&gt;
&lt;li&gt;VAD (Voice Activity Detection) filtering is built in, which means it skips silent segments automatically and reduces hallucinations on quiet recordings&lt;/li&gt;
&lt;li&gt;The base model is small enough to load instantly and accurate enough for clear English commands&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For voice commands (short, purposeful sentences), the base model performs very well. You'd only need a larger model if you were transcribing long, nuanced speech.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intent Classification: &lt;code&gt;llama3.1:8b&lt;/code&gt; via Ollama
&lt;/h3&gt;

&lt;p&gt;The heart of the system. After transcription, this model reads the user's text and returns a structured JSON plan describing exactly what steps to take.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why llama3.1:8b?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It follows instructions reliably, which is critical here — I need it to return valid JSON every single time, not prose&lt;/li&gt;
&lt;li&gt;At 8 billion parameters, it's large enough to understand nuanced commands but small enough to run on a machine with 16 GB RAM&lt;/li&gt;
&lt;li&gt;It handles multi-step command decomposition well — given "write code and save it", it correctly outputs two separate steps in the right order&lt;/li&gt;
&lt;li&gt;Temperature is set to 0 for the classifier, which makes responses deterministic and consistent&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code Generation: &lt;code&gt;qwen2.5-coder:7b&lt;/code&gt; via Ollama
&lt;/h3&gt;

&lt;p&gt;When the intent is &lt;code&gt;WRITE_CODE&lt;/code&gt;, the request goes to a separate, code-specialized model instead of the general-purpose one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why a separate model for code?&lt;/strong&gt;&lt;br&gt;
Because it genuinely writes better code. &lt;code&gt;qwen2.5-coder&lt;/code&gt; is fine-tuned specifically on programming tasks. In practice, the difference is noticeable — cleaner structure, better variable names, more idiomatic patterns. Using a general model for code generation works, but a code-specialized model works better.&lt;/p&gt;


&lt;h2&gt;
  
  
  How the Intent Classifier Works
&lt;/h2&gt;

&lt;p&gt;This is the most interesting part of the system, so it's worth explaining in detail.&lt;/p&gt;

&lt;p&gt;When a user's text arrives, it's sent to &lt;code&gt;llama3.1:8b&lt;/code&gt; with a carefully crafted system prompt. The prompt instructs the model to return &lt;strong&gt;only&lt;/strong&gt; a JSON object — no preamble, no explanation, no markdown fences. The JSON describes an ordered list of steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WRITE_CODE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Python retry function"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SAVE_FILE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"query"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"save code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content_source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"previous_step"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;content_source: "previous_step"&lt;/code&gt; field is how step chaining works. When the executor reaches &lt;code&gt;SAVE_FILE&lt;/code&gt;, it checks this flag and uses the output from the previous step (the generated code) as the file content. No manual wiring required.&lt;/p&gt;

&lt;p&gt;After the model responds, the output goes through two layers of validation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;_extract_json()&lt;/code&gt;&lt;/strong&gt; — Strips any surrounding text and pulls out the first valid &lt;code&gt;{...}&lt;/code&gt; block, in case the model added any prose despite being told not to&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;_normalize()&lt;/code&gt;&lt;/strong&gt; — Ensures every step has all required keys, and silently replaces any unknown intent with &lt;code&gt;GENERAL_CHAT&lt;/code&gt; instead of crashing&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This two-layer approach means the system &lt;strong&gt;always&lt;/strong&gt; produces something useful, even when the model misbehaves.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step Execution and Output Chaining
&lt;/h2&gt;

&lt;p&gt;Once the intent classifier returns its plan, the executor runs each step in sequence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WRITE_CODE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;write_code_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;previous_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;  &lt;span class="c1"&gt;# stored for the next step
&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SAVE_FILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;previous_output&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content_source&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;previous_step&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;original_text&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;save_file_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;previous_output&lt;/code&gt; variable acts as a simple pipe between steps. This is what makes compound commands work without any complex orchestration logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Human-in-the-Loop Confirmation
&lt;/h2&gt;

&lt;p&gt;Any step that writes to disk — &lt;code&gt;SAVE_FILE&lt;/code&gt;, &lt;code&gt;CREATE_FILE&lt;/code&gt;, &lt;code&gt;CREATE_FOLDER&lt;/code&gt; — is flagged before execution. Instead of immediately writing, the UI shows a confirmation panel:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;⚠ Confirm File Operation
💾 Save file: retry.py
[Filename input — editable]
[Confirm]  [Cancel]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user can rename the file before confirming, or cancel entirely. Only after confirmation does the executor run. This was an intentional design choice: an AI agent that silently writes files to your machine without asking is a liability. A two-second confirmation step prevents a lot of potential headaches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Session Memory
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;ConversationMemory&lt;/code&gt; class maintains a rolling window of the last 10 messages using Python's &lt;code&gt;deque&lt;/code&gt; with &lt;code&gt;maxlen&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ConversationMemory&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the window is full, the oldest message is automatically dropped. This keeps memory usage bounded while still giving the LLM enough context for follow-up questions like &lt;em&gt;"save that to a file"&lt;/em&gt; to make sense.&lt;/p&gt;

&lt;p&gt;The conversation history is passed to every LLM call — classification, code generation, summarization, and chat. This means the agent can understand references to previous turns without any extra logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Safety: Sandboxed File Operations
&lt;/h2&gt;

&lt;p&gt;All file writes go through a &lt;code&gt;_safe_path()&lt;/code&gt; function before touching disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;lstrip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;resolved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resolved&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unsafe path. All writes must stay inside output/.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;resolved&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents directory traversal attacks. A command like &lt;em&gt;"save to ../../etc/hosts"&lt;/em&gt; resolves to a path outside &lt;code&gt;output/&lt;/code&gt; and is rejected before any disk write happens. All generated files are contained within the &lt;code&gt;output/&lt;/code&gt; folder in the project directory.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Getting the LLM to Return Consistent JSON
&lt;/h3&gt;

&lt;p&gt;The single hardest part. LLMs are trained to be helpful and conversational, which means they want to explain what they're doing. Even with explicit instructions like "return ONLY valid JSON", the model would occasionally wrap the output in markdown fences, add a sentence before it, or return slightly malformed JSON.&lt;/p&gt;

&lt;p&gt;The solution was three-pronged:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;temperature: 0&lt;/code&gt; to make outputs as deterministic as possible&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;_extract_json()&lt;/code&gt; to pull out the JSON block regardless of surrounding text&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;_normalize()&lt;/code&gt; to handle missing keys and unknown intents gracefully&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After these three layers, the classifier became reliable enough to use in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context Window Overflow on Summarization
&lt;/h3&gt;

&lt;p&gt;When a user pastes a very long piece of text and asks for a summary, the combined system prompt + conversation history + user text can exceed the model's context window. In practice, this caused silent failures — the model would return empty output or crash.&lt;/p&gt;

&lt;p&gt;The fix was a &lt;code&gt;_truncate()&lt;/code&gt; function in the executor that caps input at 12,000 characters before sending it to the summarization tool. It tries to cut at a sentence boundary rather than mid-word, and appends a note so the model knows the text was trimmed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;[... text truncated for summarization ...]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, but it completely eliminated the overflow crashes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Wiring Multi-Output Gradio Events
&lt;/h3&gt;

&lt;p&gt;Gradio's event system requires you to declare all outputs upfront, and the number of outputs must match exactly across every &lt;code&gt;yield&lt;/code&gt; in a generator function. When I added the confirmation panel (which introduced new state variables), every existing &lt;code&gt;yield&lt;/code&gt; in &lt;code&gt;run_classify()&lt;/code&gt; had to be updated to include the new outputs.&lt;/p&gt;

&lt;p&gt;This led to a subtle bug where some code paths returned the wrong number of values, causing silent failures in the UI. The fix was centralizing the blank/default state into a &lt;code&gt;_blank()&lt;/code&gt; helper so every yield point returned the same shape.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Build Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streaming output&lt;/strong&gt; — Show the LLM's response token by token instead of waiting for the full response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More intents&lt;/strong&gt; — Open applications, search the web, run terminal commands&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory&lt;/strong&gt; — Save conversation history to disk so context survives across sessions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model benchmarking&lt;/strong&gt; — Systematically measure latency and accuracy across different Ollama models to find the best tradeoff for each task&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Building this taught me that the hard part of an AI agent isn't the AI — it's the plumbing. Getting models to return structured output reliably, handling edge cases gracefully, and making the UI feel responsive despite slow local inference are all engineering problems, not AI problems.&lt;/p&gt;

&lt;p&gt;The result is a system that genuinely works: speak a command, watch it get classified, confirm if needed, and see the result. Entirely local, entirely transparent, and easy to extend.&lt;/p&gt;

&lt;p&gt;The full code is available on GitHub: &lt;code&gt;[your repo link here]&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with faster-whisper, llama3.1:8b, qwen2.5-coder:7b, Ollama, and Gradio.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
