Building A.I.V.A: A Fully Local, Multi-Intent Voice Assistant

#agents #ai #nlp #python

Introduction

In an era dominated by cloud-based AI, the challenge of building a private, offline voice assistant is more relevant than ever. This article explores the development of A.I.V.A (Advanced Intelligent Voice Assistant), a local AI agent capable of complex, multi-tasking operations using only local hardware. A.I.V.A isn't just a voice-to-text tool; it's a modular agent designed to bridge the gap between human speech and OS-level execution.

The Architecture: A Decoupled 5-Layer Pipeline

A.I.V.A is built on a modular architecture designed for low-latency local execution:

Audio Layer (audio_handler.py): Real-time microphone monitoring using sounddevice. We implemented a custom RMS-based state machine for automatic silence detection (1.5s threshold).
STT Layer (stt.py): Powered by Faster-Whisper (CTranslate2). We optimized this for 8GB-16GB RAM by utilizing a Singleton pattern to keep the tiny model resident in memory, avoiding re-load latency.
Intent Layer (intent_classifier.py): The "Brain" of the project. We used Qwen2.5-Coder:7b via Ollama.
Tool Layer (tools/): A sandboxed execution environment. Tools are decoupled from the core, allowing for easy expansion of capabilities.
UI Layer (app.py): A premium Streamlit dashboard providing a "Chain of Thought" visualizer and real-time result panels.

The Technical Breakthroughs

1. Compound Command Extraction ("Raw-Reasoning")

Standard LLM-based intent classifiers often fail when asked for multiple tasks (e.g., "Create a directory and search Google"). If forced into a strict JSON format by the API, small local models often return only one task.
Our Solution: We moved to a "Raw-Reasoning" prompt. We allow the LLM to output natural reasoning before providing a JSON array. We then use a robust regex-based extraction layer in Python to reliably parse multiple intents. This increased our compound command accuracy by over 60%.

2. Contextual File Awareness (Option C)

A.I.V.A supports multi-modal context. Users can upload .txt files via the dashboard. We implemented a Context Injection system where metadata about the attached file is prepended to every voice transcript. This allows for seamless interactions like "Summarize this file" without the user ever repeating the filename.

3. Human-in-the-Loop Safety System

To comply with strict security standards, we built a Queue & Confirm architecture. Sensitive actions (like file creation or code writing) are held in a pending_actions session state. The UI dynamically renders "Safety Cards," and no OS-level write operations occur until the user manually triggers the "Approve" button.

4. The "Action Layer" (Bonus Integration)

Beyond file operations, we implemented a dedicated Action Layer using Python’s webbrowser modules. This allows A.I.V.A to intelligently route requests to the system's default browser for Google searches or direct URL navigation, making it a true workflow assistant.

Challenges & Optimizations

Memory Tightrope: Running a transformer-based STT and a 7B LLM simultaneously required aggressive quantization. We leveraged int8 for Whisper and 4-6 bit GGUF/Ollama weights for Qwen.
Transcription Fault-Tolerance: Voice transcripts are rarely perfect. We implemented "Aggressive Linking" logic that maps messy transcripts (e.g., "from this state" instead of "from this file") to the correct contextual file pointers.

Conclusion

Building A.I.V.A demonstrates that privacy-first AI agents don't need massive GPU clusters. By chaining specialized local models and a robust orchestration layer, we've created an assistant that is fast, secure, and capable of managing complex local workflows.

Tech Stack: Python 3.11, Streamlit, Ollama (Qwen2.5-Coder), Faster-Whisper, SoundDevice.
GitHub: https://github.com/Sagnik-Chattopadhyay/A.I.V.A-A-Multi-Tasking-Local-Voice-Assistant-with-Contextual-Memory
Video Demo: https://youtu.be/EuzzVmRGKdA