<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aquib</title>
    <description>The latest articles on DEV Community by Aquib (@aquib09103).</description>
    <link>https://dev.to/aquib09103</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3878820%2F56ebf714-bf00-4396-a447-351165efd3c8.jpg</url>
      <title>DEV Community: Aquib</title>
      <link>https://dev.to/aquib09103</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aquib09103"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent with Whisper, LLaMA, and LangGraph</title>
      <dc:creator>Aquib</dc:creator>
      <pubDate>Wed, 15 Apr 2026 06:10:11 +0000</pubDate>
      <link>https://dev.to/aquib09103/building-a-voice-controlled-local-ai-agent-with-whisper-llama-and-langgraph-5cog</link>
      <guid>https://dev.to/aquib09103/building-a-voice-controlled-local-ai-agent-with-whisper-llama-and-langgraph-5cog</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
What if you could just speak to your computer and have it write code, create files, or summarize text — without sending a single byte to the cloud? That's exactly what I built for my internship assignment: a fully local, voice-controlled AI agent using only open-source tools.&lt;/p&gt;

&lt;p&gt;In this article I'll walk through the architecture, the models I chose, and the challenges I faced building it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;What It Does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The agent accepts a voice or text command, transcribes it, figures out what the user wants, and then acts — creating files, generating code, summarizing text, or just answering a question. Everything runs on your local machine.&lt;/p&gt;

&lt;p&gt;Supported intents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Write Code&lt;/strong&gt; — generates code and saves it to a file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Create File / Folder&lt;/strong&gt; — plain file or directory creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize&lt;/strong&gt; — summarizes any provided text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;General Chat&lt;/strong&gt; — LLM-powered Q&amp;amp;A&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyaexe4y9zdnssh424de.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjyaexe4y9zdnssh424de.png" alt="Architecture diagram of the Voice-Controlled Local AI Agent showing a six-stage left-to-right pipeline: (1) Input layer with Microphone,   &amp;lt;br&amp;gt;
  File Upload, and Text Input; (2) STT and Pre-processing with silence detection, audio normalization, and OpenAI Whisper small model; (3)       Classification with a regex rule-based pre-classifier feeding into LLaMA 3.2 3B via Ollama as a fallback, producing one of four intents; (4)   HITL checkpoint requiring user confirmation before any file operation; (5) Tool Execution via LangGraph StateGraph routing to write_code,      create_file, summarize, or general_chat nodes; (6) Output layer showing generated files, folders, text results, and the Streamlit UI. All &amp;lt;br&amp;gt;
  components run locally with no cloud API calls." width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The pipeline has five stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audio Input&lt;/strong&gt; — Microphone recording (sounddevice) or file upload. Auto-stops on 1.5 seconds of silence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-Text&lt;/strong&gt; — OpenAI Whisper (&lt;code&gt;small&lt;/code&gt; model) runs locally. On an RTX 3050, transcription takes under 2 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent Classification&lt;/strong&gt; — A two-layer classifier: first a regex rule-based pre-classifier for high-confidence patterns (e.g. "create a python file"), then an Ollama-hosted LLaMA 3.2 model for everything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Execution&lt;/strong&gt; — LangGraph &lt;code&gt;StateGraph&lt;/code&gt; routes to the right tool node and executes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UI&lt;/strong&gt; — Streamlit displays the pipeline progress, result, and session history in real time.&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Models I Chose&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whisper (small):&lt;/strong&gt; I chose the &lt;code&gt;small&lt;/code&gt; model because it strikes a good balance between accuracy and speed on a 6 GB GPU. The &lt;code&gt;tiny&lt;/code&gt; model is faster but struggled with technical vocabulary like "fibonacci" or "math_utils". I added a &lt;code&gt;initial_prompt&lt;/code&gt; to prime Whisper with programming terms, which significantly reduced transcription errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLaMA 3.2 3B via Ollama:&lt;/strong&gt; For intent classification, a 3B model is surprisingly capable when guided with a well-structured system prompt and few-shot examples. I found that small models need explicit, concrete examples — abstract descriptions alone are not enough. I ended up adding a rule-based pre-classifier layer on top because the LLM occasionally misclassified clear-cut commands like "create a Python file inside the utils folder." The rules catch these deterministically.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Challenges I Faced&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Whisper hallucinations on silence&lt;/strong&gt;&lt;br&gt;
When the microphone records silence, Whisper doesn't return an empty string — it fabricates random sentences. I fixed this by checking the raw audio RMS (energy level) before calling Whisper. If the signal energy is below a threshold, the pipeline rejects the input immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Small LLM misclassification&lt;/strong&gt;&lt;br&gt;
LLaMA 3.2 3B is capable but not perfectly reliable for complex natural language patterns. The phrase "create a python file inside an already existing folder" was consistently classified as &lt;code&gt;general_chat&lt;/code&gt;. I solved this by building a deterministic rule-based pre-classifier using regex that catches common code-generation patterns before the LLM is even called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. LangGraph + Streamlit state management&lt;/strong&gt;&lt;br&gt;
Streamlit re-renders the entire page on every interaction. Managing pipeline state across reruns — especially for Human-in-the-Loop confirmation and compound commands — required careful use of &lt;code&gt;st.session_state&lt;/code&gt; to persist intermediate results and prevent re-triggering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Subfolder file creation&lt;/strong&gt;&lt;br&gt;
The original tool stripped directory paths from filenames (using &lt;code&gt;Path.name&lt;/code&gt;), so "write code inside math_utils folder" always saved the file at the top level. I updated both the intent classifier (to extract subfolder context) and the file tool (to safely resolve one level of subdirectory within &lt;code&gt;output/&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Compound "summarize and save" commands misclassified&lt;/strong&gt;&lt;br&gt;
The command "Summarize the given text and store it in summary.txt file …" was consistently classified as &lt;code&gt;general_chat&lt;/code&gt; by LLaMA 3.2 3B, even though the system prompt included a matching few-shot example. The small model failed to generalise from "save it to" (in the example) to "store it in" (in the actual input). Adding more examples to the prompt did not help reliably. I solved this by adding a dedicated rule to the pre-classifier: a regex that detects the co-occurrence of &lt;code&gt;\bsummariz\w*\b&lt;/code&gt; and &lt;code&gt;\b(store|save)\b&lt;/code&gt; followed by a filename with an extension, then extracts the filename and the text to summarise deterministically — bypassing the LLM entirely for this pattern.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Bonus Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compound commands:&lt;/strong&gt; A single command like "write a fibonacci function and save it inside the math_utils folder" triggers multiple chained pipeline steps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-Loop:&lt;/strong&gt; Any file operation shows a confirmation prompt before execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful degradation:&lt;/strong&gt; Silent audio, keyboard mashing, and Ollama crashes are all handled with clear error messages and retry logic.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Building a fully local AI agent taught me a lot about the gap between a model's theoretical capability and its real-world reliability. The architecture itself is straightforward; the hard work is in the edge cases — hallucinations, misclassifications, and state management.&lt;/p&gt;

&lt;p&gt;The full source code is available on GitHub: &lt;strong&gt;&lt;a href="https://github.com/aquibkhanjb-pixel/Voice-Controlled-Local-AI-Agent.git" rel="noopener noreferrer"&gt;https://github.com/aquibkhanjb-pixel/Voice-Controlled-Local-AI-Agent.git&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you have questions or want to discuss the architecture, feel free to reach out in the comments.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
