<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tishya Jha</title>
    <description>The latest articles on DEV Community by Tishya Jha (@tj04).</description>
    <link>https://dev.to/tj04</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3877080%2F7d5a132c-c2ce-4b3d-8d44-1a00403dab7d.jpg</url>
      <title>DEV Community: Tishya Jha</title>
      <link>https://dev.to/tj04</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tj04"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent</title>
      <dc:creator>Tishya Jha</dc:creator>
      <pubDate>Mon, 13 Apr 2026 18:01:14 +0000</pubDate>
      <link>https://dev.to/tj04/building-a-voice-controlled-local-ai-agent-37mj</link>
      <guid>https://dev.to/tj04/building-a-voice-controlled-local-ai-agent-37mj</guid>
      <description>&lt;p&gt;When I was tasked with building a Voice-Controlled Local AI Agent, I imagined a smooth, Jarvis-like experience. The reality? I was running a CPU-only Windows machine, which meant every massive AI model I threw at it crawled at a snail's pace.&lt;/p&gt;

&lt;p&gt;But constraints breed creativity. Here is how I tackled the assignment, structuring the build as a flow of tasks, the architecture decisions I had to make, and the challenges I overcame—including a complete UI redesign. Onto my story of new learning upon a new journey! Let us tackle the journey in several phases and deep dive into them.&lt;/p&gt;

&lt;p&gt;Phase 1: The "Ears" — Capturing and Transcribing Speech&lt;br&gt;
The Task: Listen to the user via a microphone or file upload and turn that speech into text. The File: audio_processor.py&lt;/p&gt;

&lt;p&gt;The Architecture &amp;amp; Model: My initial thought was to handle everything locally using HuggingFace's Whisper model. However, I immediately hit a massive roadblock. Running whisper-large-v3 on my CPU-only machine required ~4GB of RAM and took anywhere from 30 to 120 seconds just to transcribe a short 5-second sentence. If a voice agent isn't fast, it's unusable. Furthermore, my laptop's low storage and low processing made it worse.&lt;/p&gt;

&lt;p&gt;The Solution: I pivoted to using Groq's Whisper API. By offloading just the Speech-to-Text (STT) component to Groq, I achieved sub-second transcriptions. This was a crucial architectural trade-off: use a fast API for the ears, keeping my local CPU entirely free for the "brain" (the LLM). I used sounddevice package in python to handle the raw microphone recording cross-platform without issues, streaming the bytes directly to Groq.&lt;/p&gt;

&lt;p&gt;Phase 2: The "Brain" — Intent Understanding&lt;br&gt;
The Task: Analyze the transcript and figure out exactly what the user wants to do, returning it as a structured JSON action plan. The File: app.py (LangChain Orchestration)&lt;/p&gt;

&lt;p&gt;The Architecture &amp;amp; Model: I chose Ollama for local inference and LangChain to handle the prompt formatting. I needed a model that was smart enough to output strict JSON arrays (to support compound commands like "Summarize this and save it as a file").&lt;/p&gt;

&lt;p&gt;The Challenge: I initially downloaded qwen3-vl:4b (a 3.3 GB vision-language model). While powerful, my CPU struggled. During testing, classifying a simple intent took agonizingly long, it took 7 minutes for 4 second intent analysis.&lt;/p&gt;

&lt;p&gt;I eventually deleted the 4B model and pulled qwen2.5:1.5b. It's under 1GB and lightning-fast. To optimize it further in code, I restricted the context window (num_ctx=2048) and output tokens (num_predict=256). Suddenly, intent classification went down to 2-5 seconds.&lt;/p&gt;

&lt;p&gt;Phase 3: The "Hands" — Tool Execution &amp;amp; Safety&lt;br&gt;
The Task: Actually do the things the brain decided on—summarize text, chat, create files, and write code. The Files: tools.py and the output/ directory.&lt;/p&gt;

&lt;p&gt;Function Breakdown: I built a clean dictionary mapping in Python called TOOL_MAP. If the LLM outputted {"intent": "write_code", ...}, it triggered the write_code function.&lt;/p&gt;

&lt;p&gt;The Challenge (Safety First): If a local AI hallucinates, it could accidentally overwrite C:\Windows\System32. To prevent this, I built a strict sandbox. Every file operation first goes through a _safe_path function using Python's pathlib. It actively strips out traversal attempts (like ../) and ensures the final resolved path lives only inside the output/ folder.&lt;/p&gt;

&lt;p&gt;Phase 4: The "Face" — Obsessing Over the UI&lt;br&gt;
The Task: Build a UI to track the entire pipeline (Transcript -&amp;gt; Intent -&amp;gt; Action -&amp;gt; Result). The File: app.py (Streamlit)&lt;/p&gt;

&lt;p&gt;The Challenge &amp;amp; The Redesign: Initially, I built a 60/40 split-screen dashboard. The left side had controls and massive colored badges tracking every pipeline step. The right side had chat. It functioned perfectly for debugging, but it felt like a dashboard, not an agent.&lt;/p&gt;

&lt;p&gt;I wanted a modern, intuitive experience. I stopped coding, mapped out a "Chat-First" layout, and completely rewrote the Streamlit UI logic.&lt;/p&gt;

&lt;p&gt;The Final Architecture:&lt;/p&gt;

&lt;p&gt;The Sidebar Toolbox: I moved all utilities (Mic, Upload, Clear History) and the output/ folder directory tree into a collapsible sidebar. Out of sight, out of mind.&lt;br&gt;
The Main Canvas: Pinned st.chat_input to the bottom. Native chat bubbles dominate the screen which contains the chat history in the session, also incorporated space for list of historical files so created; this was attained via streamlit session and langchain.&lt;br&gt;
The "Thought Process": To keep the UI clean but still prove the technical pipeline worked, I nested the colored pipeline badges (Transcript -&amp;gt; Intent -&amp;gt; Action) inside a collapsible accordion labeled "View Execution Steps" under the AI's response.&lt;br&gt;
Human-in-the-Loop: For dangerous actions (file creation), the system pauses and injects an elegant Approve / Deny card directly into the chat flow.&lt;br&gt;
Conclusion&lt;br&gt;
Building this agent was less about knowing the right libraries and more about iteration. Finding out local Whisper was too slow, figuring out how to optimize a local LLM for a CPU by swapping sizes, battling IDE vs Virtual Environment path errors (a classic Windows headache!), and realizing a functional UI isn't always a good UI.&lt;/p&gt;

&lt;p&gt;By strategically mixing cloud STT with a highly-optimized local LLM and a chat-first interface, I managed to build a responsive, safe, and beautiful agent on a machine that initially seemed underpowered for the task.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>learning</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
