<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kuruv Patel</title>
    <description>The latest articles on DEV Community by Kuruv Patel (@kuruvpatel).</description>
    <link>https://dev.to/kuruvpatel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2050665%2Fe3fb5d92-b782-405b-b6e5-4fd9594ded9d.gif</url>
      <title>DEV Community: Kuruv Patel</title>
      <link>https://dev.to/kuruvpatel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kuruvpatel"/>
    <language>en</language>
    <item>
      <title>Building Voca: I Built a 100% Local Voice AI Agent — No Cloud, No Compromise</title>
      <dc:creator>Kuruv Patel</dc:creator>
      <pubDate>Sun, 12 Apr 2026 06:45:22 +0000</pubDate>
      <link>https://dev.to/kuruvpatel/building-voca-i-built-a-100-local-voice-ai-agent-no-cloud-no-compromise-3dc2</link>
      <guid>https://dev.to/kuruvpatel/building-voca-i-built-a-100-local-voice-ai-agent-no-cloud-no-compromise-3dc2</guid>
      <description>&lt;p&gt;Voice assistants are everywhere. But here's the uncomfortable truth hiding behind every "Hey Siri" and "OK Google": your raw audio, your personal context, your sensitive queries — all of it is getting shipped to a cloud server you don't control, processed by a model you can't inspect, and logged in ways you can't audit.&lt;/p&gt;

&lt;p&gt;For developers working on proprietary codebases, or anyone who simply refuses to accept that trade-off, this is a non-starter.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;Voca&lt;/strong&gt; — a fully open-source, 100% local voice AI agent. It can create files, generate code, summarize text, and hold conversational memory across a session. Not a single byte ever leaves your machine.&lt;/p&gt;

&lt;p&gt;Here's exactly how I built it, every architectural decision behind it, and every wall I ran into along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Constraint: Stateless by Design
&lt;/h2&gt;

&lt;p&gt;The foundational architectural goal was ruthless: keep the backend &lt;strong&gt;stateless&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Early prototypes cached conversation history server-side. It worked, but it introduced session drift the moment you opened a second tab or restarted the dev server. The backend became a liability — something that could desync, leak context between sessions, or bloat in memory.&lt;/p&gt;

&lt;p&gt;The fix was counterintuitive: &lt;strong&gt;push all state to the client&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The frontend maintains two strictly bounded arrays:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;chatContextState&lt;/code&gt; — rolling conversational dialogue, capped at 20 frames&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;actionLogState&lt;/code&gt; — a ledger of every file creation or code write authorized this session&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every time you send a voice command or text input, the frontend serializes this entire footprint and ships it &lt;em&gt;alongside&lt;/em&gt; the audio blob into the inference pipeline. The backend receives everything it needs in a single request and forgets everything the moment it responds. Clean, fast, reproducible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;p&gt;The backend is a single FastAPI async route: &lt;code&gt;POST /api/process&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That's intentionally it. FastAPI + Uvicorn handles raw &lt;code&gt;multipart/form-data&lt;/code&gt; natively — audio blob plus stringified JSON state in one shot — with zero WebSocket overhead. The pipeline that executes on every request has exactly three stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Blob → [STT] → [Intent Classification] → [Tool Dispatcher] → Response
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three run locally. None phone home.&lt;/p&gt;

&lt;p&gt;The frontend is deliberately vanilla — plain HTML, CSS, and JavaScript. No framework tax, no build step, no abstraction overhead. The UI is a thin shell over the state arrays; keeping it that way meant the rendering logic never had to fight the inference logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 1 — Speech-to-Text: &lt;code&gt;faster-whisper&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Standard Whisper deployments carry significant PyTorch cold-start latency. On repeated inference that penalty compounds fast.&lt;/p&gt;

&lt;p&gt;The fix: initialize the model &lt;strong&gt;once&lt;/strong&gt; at module load time, locking it into &lt;code&gt;float16&lt;/code&gt; precision directly on the CUDA buffers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;faster_whisper&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WhisperModel&lt;/span&gt;

&lt;span class="c1"&gt;# Loaded once. Stays warm on GPU for the lifetime of the process.
&lt;/span&gt;&lt;span class="n"&gt;stt_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WhisperModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large-v3-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cuda&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;compute_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float16&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With weights resting warm on the GPU, transcriptions resolve in milliseconds rather than seconds.&lt;/p&gt;

&lt;p&gt;But raw transcription isn't enough. Voca calculates the &lt;code&gt;avg_logprob&lt;/code&gt; across all returned Whisper segments. If confidence drops below &lt;code&gt;-0.8&lt;/code&gt;, the pipeline trips into &lt;strong&gt;Graceful Degradation&lt;/strong&gt; — execution halts, the user gets a clear warning, and no code touches the filesystem based on a mishear. An offhand cough will never trigger a file write.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 2 — Intent Classification: Ollama + Structured Output
&lt;/h2&gt;

&lt;p&gt;This is where most local voice agent projects fall apart: getting a 4-billion parameter model to reliably produce machine-parseable output from ambiguous natural language.&lt;/p&gt;

&lt;p&gt;The naive approach — prompt the model and hope for valid JSON — breaks constantly. Smaller models hallucinate schema, merge distinct actions into nonsensical compound intents, or drop required fields entirely.&lt;/p&gt;

&lt;p&gt;The solution was to &lt;strong&gt;remove the model's ability to produce malformed output&lt;/strong&gt; by binding Ollama's &lt;code&gt;format=&lt;/code&gt; argument to a strict JSON schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;selected_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;array&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;object&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;properties&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;enum&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;write_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;general_chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;filename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;required&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model can no longer return &lt;code&gt;{intent: "create_folder_and_write_file"}&lt;/code&gt;. It &lt;em&gt;must&lt;/em&gt; return &lt;code&gt;[{intent: "create_file"}, {intent: "write_code"}]&lt;/code&gt;. Compound commands become sequential, deterministic actions. The hallucination problem becomes a schema constraint problem — and schema constraints are solved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contextual Ambiguity
&lt;/h3&gt;

&lt;p&gt;A user says: &lt;em&gt;"make that function async"&lt;/em&gt; — 30 seconds after creating a file named &lt;code&gt;server.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Without context, the model has no referent for "that function" or "that file." With the &lt;code&gt;actionLogState&lt;/code&gt; passed from the frontend, &lt;code&gt;intent.py&lt;/code&gt; dynamically builds a secondary system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Files modified this session:
- output/server.py (write_code, 14:32)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftt2kly5b4eedfgf54k7q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftt2kly5b4eedfgf54k7q.png" alt="Session History" width="800" height="131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model now has exactly what it needs. Contextual ambiguity solved without a persistent backend session.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model Switching
&lt;/h3&gt;

&lt;p&gt;Different tasks have different VRAM budgets. Summarizing a README doesn't need the same model as generating a multi-file TypeScript module. Voca polls &lt;code&gt;/api/models&lt;/code&gt; from the local Ollama daemon on startup, surfaces every available model in a dropdown, and hot-swaps mid-session. Switch from &lt;code&gt;gemma3:4b&lt;/code&gt; to &lt;code&gt;deepseek-r1:7b&lt;/code&gt; between tasks without restarting anything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage 3 — The Tool Dispatcher
&lt;/h2&gt;

&lt;p&gt;The LLM never touches the filesystem directly. It produces structured JSON. The dispatcher reads that JSON and routes to one of four isolated Python functions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;create_file&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Creates a blank file or directory inside &lt;code&gt;output/&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;write_code&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prompts the LLM as a code generator, strips markdown fencing, writes raw script to disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;summarize&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Feeds text into an LLM tuned for short bulleted output, returns to chat stream&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;general_chat&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Conversational fallback; passes rolling context, returns response&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every tool is wrapped in explicit &lt;code&gt;try/except&lt;/code&gt; bounds. Every file operation passes through &lt;code&gt;safe_path()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;safe_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Path escape attempt blocked.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Path traversal, escape sequences, absolute overrides — all rejected before a byte hits the drive.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Safety Layer: Human-in-the-Loop
&lt;/h2&gt;

&lt;p&gt;Autonomous code generation from spoken word is genuinely dangerous. A transcription error on a destructive command, a hallucinated filename, a mishear at the wrong moment — any of these could cause real damage.&lt;/p&gt;

&lt;p&gt;Voca's answer: &lt;strong&gt;any intent that writes to disk halts the pipeline entirely&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of executing, the system bounces the full proposed action back to the client. The UI renders an explicit confirmation panel showing the exact filename and content that would be written. The user must approve before a single byte is allocated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtw279of08epxfdw7urv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtw279of08epxfdw7urv.png" alt="Intent Confirmation" width="800" height="229"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Voice commands are fast. Humans need to stay in the loop on irreversible actions. This boundary is not optional and not bypassable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Five Concepts That Hold It Together
&lt;/h2&gt;

&lt;p&gt;Building this system clarified five ideas that I'd apply to any autonomous local AI agent:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intents over shell access.&lt;/strong&gt; The LLM is a classifier, not an executor. It converts speech into structured JSON objectives. It never gets a shell. This single constraint eliminates an entire class of security issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tools as sandboxed functions.&lt;/strong&gt; Side effects live in isolated Python functions with explicit error handling. The LLM triggers them by name. It cannot modify them, escape them, or chain them in ways the schema doesn't permit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory on the client.&lt;/strong&gt; Backend session state is a liability. Pushing state to the frontend makes every request self-contained, eliminates session drift, and makes the backend trivially horizontal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-loop as a hard gate.&lt;/strong&gt; Disk writes require human approval. This is not a setting. It's the architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graceful degradation over silent failure.&lt;/strong&gt; Low transcription confidence, malformed JSON output, Ollama connectivity issues — all of these have explicit failure paths that surface clearly to the user rather than producing silent bad behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;A few things I'd revisit if starting over:&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;chatContextState&lt;/code&gt; 20-frame cap is a blunt instrument. A smarter approach would score messages by relevance and prune semantically rather than chronologically — older context sometimes matters more than recent filler.&lt;/p&gt;

&lt;p&gt;The single &lt;code&gt;/api/process&lt;/code&gt; route handles too much. Splitting STT, intent classification, and tool dispatch into separate endpoints would make each stage independently testable and easier to swap out.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;large-v3-turbo&lt;/code&gt; is genuinely overkill for most voice commands. A tiered approach — fast small model for simple intents, larger model only when complexity warrants — would cut latency significantly on typical usage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Get the Code
&lt;/h2&gt;

&lt;p&gt;Voca is fully open-source. If you're building something where privacy isn't negotiable, or you just want a local voice agent you actually understand end-to-end, the full source is on GitHub.&lt;/p&gt;

&lt;p&gt;Drop questions in the comments — especially if you've hit the compound-intent hallucination problem in your own local LLM work. It's a nastier problem than it looks and I'd be curious how others are handling it.&lt;/p&gt;




</description>
      <category>webdev</category>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
