<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sagnik Chattopadhyay</title>
    <description>The latest articles on DEV Community by Sagnik Chattopadhyay (@sagnik_chattopadhyay).</description>
    <link>https://dev.to/sagnik_chattopadhyay</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873068%2F775f76a8-423d-45bd-8170-9e77c77c229e.jpg</url>
      <title>DEV Community: Sagnik Chattopadhyay</title>
      <link>https://dev.to/sagnik_chattopadhyay</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sagnik_chattopadhyay"/>
    <language>en</language>
    <item>
      <title>Building A.I.V.A: A Fully Local, Multi-Intent Voice Assistant</title>
      <dc:creator>Sagnik Chattopadhyay</dc:creator>
      <pubDate>Sat, 11 Apr 2026 07:11:08 +0000</pubDate>
      <link>https://dev.to/sagnik_chattopadhyay/building-aiva-a-fully-local-multi-intent-voice-assistant-3oc3</link>
      <guid>https://dev.to/sagnik_chattopadhyay/building-aiva-a-fully-local-multi-intent-voice-assistant-3oc3</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;In an era dominated by cloud-based AI, the challenge of building a private, offline voice assistant is more relevant than ever. This article explores the development of &lt;strong&gt;A.I.V.A (Advanced Intelligent Voice Assistant)&lt;/strong&gt;, a local AI agent capable of complex, multi-tasking operations using only local hardware. A.I.V.A isn't just a voice-to-text tool; it's a modular agent designed to bridge the gap between human speech and OS-level execution.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Architecture: A Decoupled 5-Layer Pipeline
&lt;/h3&gt;

&lt;p&gt;A.I.V.A is built on a modular architecture designed for low-latency local execution:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audio Layer (&lt;code&gt;audio_handler.py&lt;/code&gt;)&lt;/strong&gt;: Real-time microphone monitoring using &lt;code&gt;sounddevice&lt;/code&gt;. We implemented a custom RMS-based state machine for automatic silence detection (1.5s threshold).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STT Layer (&lt;code&gt;stt.py&lt;/code&gt;)&lt;/strong&gt;: Powered by &lt;strong&gt;Faster-Whisper&lt;/strong&gt; (CTranslate2). We optimized this for 8GB-16GB RAM by utilizing a Singleton pattern to keep the &lt;code&gt;tiny&lt;/code&gt; model resident in memory, avoiding re-load latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent Layer (&lt;code&gt;intent_classifier.py&lt;/code&gt;)&lt;/strong&gt;: The "Brain" of the project. We used &lt;strong&gt;Qwen2.5-Coder:7b&lt;/strong&gt; via Ollama. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool Layer (&lt;code&gt;tools/&lt;/code&gt;)&lt;/strong&gt;: A sandboxed execution environment. Tools are decoupled from the core, allowing for easy expansion of capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UI Layer (&lt;code&gt;app.py&lt;/code&gt;)&lt;/strong&gt;: A premium Streamlit dashboard providing a "Chain of Thought" visualizer and real-time result panels.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  The Technical Breakthroughs
&lt;/h3&gt;

&lt;h4&gt;
  
  
  1. Compound Command Extraction ("Raw-Reasoning")
&lt;/h4&gt;

&lt;p&gt;Standard LLM-based intent classifiers often fail when asked for multiple tasks (e.g., &lt;em&gt;"Create a directory and search Google"&lt;/em&gt;). If forced into a strict JSON format by the API, small local models often return only one task. &lt;br&gt;
&lt;strong&gt;Our Solution&lt;/strong&gt;: We moved to a "Raw-Reasoning" prompt. We allow the LLM to output natural reasoning before providing a JSON array. We then use a robust regex-based extraction layer in Python to reliably parse multiple intents. This increased our compound command accuracy by over 60%.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Contextual File Awareness (Option C)
&lt;/h4&gt;

&lt;p&gt;A.I.V.A supports multi-modal context. Users can upload &lt;code&gt;.txt&lt;/code&gt; files via the dashboard. We implemented a &lt;strong&gt;Context Injection&lt;/strong&gt; system where metadata about the attached file is prepended to every voice transcript. This allows for seamless interactions like &lt;em&gt;"Summarize this file"&lt;/em&gt; without the user ever repeating the filename.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Human-in-the-Loop Safety System
&lt;/h4&gt;

&lt;p&gt;To comply with strict security standards, we built a &lt;strong&gt;Queue &amp;amp; Confirm&lt;/strong&gt; architecture. Sensitive actions (like file creation or code writing) are held in a &lt;code&gt;pending_actions&lt;/code&gt; session state. The UI dynamically renders "Safety Cards," and no OS-level write operations occur until the user manually triggers the &lt;strong&gt;"Approve"&lt;/strong&gt; button.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. The "Action Layer" (Bonus Integration)
&lt;/h4&gt;

&lt;p&gt;Beyond file operations, we implemented a dedicated &lt;strong&gt;Action Layer&lt;/strong&gt; using Python’s &lt;code&gt;webbrowser&lt;/code&gt; modules. This allows A.I.V.A to intelligently route requests to the system's default browser for Google searches or direct URL navigation, making it a true workflow assistant.&lt;/p&gt;




&lt;h3&gt;
  
  
  Challenges &amp;amp; Optimizations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Memory Tightrope&lt;/strong&gt;: Running a transformer-based STT and a 7B LLM simultaneously required aggressive quantization. We leveraged &lt;code&gt;int8&lt;/code&gt; for Whisper and 4-6 bit GGUF/Ollama weights for Qwen.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Transcription Fault-Tolerance&lt;/strong&gt;: Voice transcripts are rarely perfect. We implemented "Aggressive Linking" logic that maps messy transcripts (e.g., "from this state" instead of "from this file") to the correct contextual file pointers.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Building A.I.V.A demonstrates that privacy-first AI agents don't need massive GPU clusters. By chaining specialized local models and a robust orchestration layer, we've created an assistant that is fast, secure, and capable of managing complex local workflows.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tech Stack&lt;/strong&gt;: Python 3.11, Streamlit, Ollama (Qwen2.5-Coder), Faster-Whisper, SoundDevice.&lt;br&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/Sagnik-Chattopadhyay/A.I.V.A-A-Multi-Tasking-Local-Voice-Assistant-with-Contextual-Memory" rel="noopener noreferrer"&gt;https://github.com/Sagnik-Chattopadhyay/A.I.V.A-A-Multi-Tasking-Local-Voice-Assistant-with-Contextual-Memory&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Video Demo&lt;/strong&gt;: &lt;a href="https://youtu.be/EuzzVmRGKdA" rel="noopener noreferrer"&gt;https://youtu.be/EuzzVmRGKdA&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>nlp</category>
      <category>python</category>
    </item>
  </channel>
</rss>
