Building a Voice-Controlled Local AI Agent

Tishya Jha — Mon, 13 Apr 2026 18:01:14 +0000

When I was tasked with building a Voice-Controlled Local AI Agent, I imagined a smooth, Jarvis-like experience. The reality? I was running a CPU-only Windows machine, which meant every massive AI model I threw at it crawled at a snail's pace.

But constraints breed creativity. Here is how I tackled the assignment, structuring the build as a flow of tasks, the architecture decisions I had to make, and the challenges I overcame—including a complete UI redesign. Onto my story of new learning upon a new journey! Let us tackle the journey in several phases and deep dive into them.

Phase 1: The "Ears" — Capturing and Transcribing Speech
The Task: Listen to the user via a microphone or file upload and turn that speech into text. The File: audio_processor.py

The Architecture & Model: My initial thought was to handle everything locally using HuggingFace's Whisper model. However, I immediately hit a massive roadblock. Running whisper-large-v3 on my CPU-only machine required ~4GB of RAM and took anywhere from 30 to 120 seconds just to transcribe a short 5-second sentence. If a voice agent isn't fast, it's unusable. Furthermore, my laptop's low storage and low processing made it worse.

The Solution: I pivoted to using Groq's Whisper API. By offloading just the Speech-to-Text (STT) component to Groq, I achieved sub-second transcriptions. This was a crucial architectural trade-off: use a fast API for the ears, keeping my local CPU entirely free for the "brain" (the LLM). I used sounddevice package in python to handle the raw microphone recording cross-platform without issues, streaming the bytes directly to Groq.

Phase 2: The "Brain" — Intent Understanding
The Task: Analyze the transcript and figure out exactly what the user wants to do, returning it as a structured JSON action plan. The File: app.py (LangChain Orchestration)

The Architecture & Model: I chose Ollama for local inference and LangChain to handle the prompt formatting. I needed a model that was smart enough to output strict JSON arrays (to support compound commands like "Summarize this and save it as a file").

The Challenge: I initially downloaded qwen3-vl:4b (a 3.3 GB vision-language model). While powerful, my CPU struggled. During testing, classifying a simple intent took agonizingly long, it took 7 minutes for 4 second intent analysis.

I eventually deleted the 4B model and pulled qwen2.5:1.5b. It's under 1GB and lightning-fast. To optimize it further in code, I restricted the context window (num_ctx=2048) and output tokens (num_predict=256). Suddenly, intent classification went down to 2-5 seconds.

Phase 3: The "Hands" — Tool Execution & Safety
The Task: Actually do the things the brain decided on—summarize text, chat, create files, and write code. The Files: tools.py and the output/ directory.

Function Breakdown: I built a clean dictionary mapping in Python called TOOL_MAP. If the LLM outputted {"intent": "write_code", ...}, it triggered the write_code function.

The Challenge (Safety First): If a local AI hallucinates, it could accidentally overwrite C:\Windows\System32. To prevent this, I built a strict sandbox. Every file operation first goes through a _safe_path function using Python's pathlib. It actively strips out traversal attempts (like ../) and ensures the final resolved path lives only inside the output/ folder.

Phase 4: The "Face" — Obsessing Over the UI
The Task: Build a UI to track the entire pipeline (Transcript -> Intent -> Action -> Result). The File: app.py (Streamlit)

The Challenge & The Redesign: Initially, I built a 60/40 split-screen dashboard. The left side had controls and massive colored badges tracking every pipeline step. The right side had chat. It functioned perfectly for debugging, but it felt like a dashboard, not an agent.

I wanted a modern, intuitive experience. I stopped coding, mapped out a "Chat-First" layout, and completely rewrote the Streamlit UI logic.

The Final Architecture:

The Sidebar Toolbox: I moved all utilities (Mic, Upload, Clear History) and the output/ folder directory tree into a collapsible sidebar. Out of sight, out of mind.
The Main Canvas: Pinned st.chat_input to the bottom. Native chat bubbles dominate the screen which contains the chat history in the session, also incorporated space for list of historical files so created; this was attained via streamlit session and langchain.
The "Thought Process": To keep the UI clean but still prove the technical pipeline worked, I nested the colored pipeline badges (Transcript -> Intent -> Action) inside a collapsible accordion labeled "View Execution Steps" under the AI's response.
Human-in-the-Loop: For dangerous actions (file creation), the system pauses and injects an elegant Approve / Deny card directly into the chat flow.
Conclusion
Building this agent was less about knowing the right libraries and more about iteration. Finding out local Whisper was too slow, figuring out how to optimize a local LLM for a CPU by swapping sizes, battling IDE vs Virtual Environment path errors (a classic Windows headache!), and realizing a functional UI isn't always a good UI.

By strategically mixing cloud STT with a highly-optimized local LLM and a chat-first interface, I managed to build a responsive, safe, and beautiful agent on a machine that initially seemed underpowered for the task.

DEV Community: Tishya Jha

Building a Voice-Controlled Local AI Agent