How I Built a Voice-Controlled Local AI Agent from Scratch

23B01A05J5 CSE — Fri, 10 Apr 2026 08:48:32 +0000

Introduction
When I first read the assignment brief — "build a voice-controlled AI agent that runs locally" — it sounded simple. Record audio, transcribe it, do something with it. But as I started building, I realized there were a dozen small problems hiding inside that one big one. This article walks through the architecture I chose, the models I used, and the real challenges I faced along the way.

What the System Does
The agent accepts voice input (microphone or uploaded audio file), converts it to text, classifies the user's intent using an LLM, and then executes the right action on your local machine — creating files, generating code, summarizing text, or having a general conversation. The entire pipeline is displayed in a clean Streamlit UI.

Architecture Overview
The system has four layers:

Audio Input — Streamlit's built-in st.audio_input() handles browser microphone recording. File upload supports .wav, .mp3, and .m4a.
Speech-to-Text (STT) — I used Groq's hosted Whisper API (whisper-large-v3). More on why below.
Intent Classification — An LLM reads the transcribed text and returns a structured JSON object with the detected intent and parameters. I used Ollama (llama3.2) as the primary option, with Groq LLaMA-3 as a cloud fallback.
Tool Execution — Based on the intent, the system runs one of five tools: write_code, create_file, summarize, general_chat, or compound (multiple actions chained together). All file writes are restricted to an output/ folder for safety.

Models I Chose and Why
STT: Groq Whisper API instead of local HuggingFace
The assignment suggested using a HuggingFace Whisper model locally. I tried it. Running whisper-large-v3 locally on CPU took 45–60 seconds per 10-second clip, which made the demo feel broken. On a machine without a dedicated GPU, this is simply not practical for a real-time demo.
Groq's hosted endpoint runs the exact same model (whisper-large-v3) in under 2 seconds. It's free to start and requires no local GPU. I documented this tradeoff in the README and kept the local HuggingFace path as a fallback in the code for anyone with a capable GPU.
LLM: Ollama (llama3.2) with Groq fallback
For intent classification and text generation, I used Ollama running locally with llama3.2. This keeps everything on-device — no data leaves your machine. For users who don't have Ollama installed, the system automatically falls back to Groq's LLaMA-3.3-70b API.
I also built a pure rule-based classifier as a last resort, so the app never crashes even with no API key and no Ollama.

Key Challenges
Challenge 1: Getting structured output from the LLM
The intent classifier needs to return a clean JSON object every time. LLMs sometimes wrap their response in markdown fences (


json) or add extra explanation text. I solved this by writing a _parse_json() helper that strips fences before parsing, and falls back to general_chat if parsing fails completely.
Challenge 2: Compound commands
Supporting "write a function AND summarize it" in one audio clip required a compound intent type. The LLM detects this and returns a list of sub_intents. The tool executor then chains them, running each sub-tool in sequence and combining the outputs.
Challenge 3: File safety
Letting an AI agent write files to your machine is genuinely risky. I restricted all file operations to a single output/ folder and added a _safe_filename() function that strips path separators, preventing directory traversal attacks like ../../etc/passwd.
I also added a Human-in-the-Loop toggle in the UI — when enabled, the agent asks for your confirmation before executing any file operation.
Challenge 4: Session memory
Each LLM call is stateless by default. To give the agent memory within a session, I maintain a chat_context list in Streamlit's session state and pass the last 6 messages to every LLM call. This lets the user say "make that function async" as a follow-up and have the agent understand what "that function" refers to.

What I Would Do Differently

Add a vector store (like ChromaDB) for persistent memory across sessions, not just within one session.
Use function calling instead of prompt-based JSON parsing for more reliable intent extraction.
Add streaming output so the user sees the LLM's response token by token instead of waiting for the full result.


Conclusion
Building this agent taught me that the hard part of AI systems is not the AI — it's the plumbing around it. Parsing outputs reliably, handling failures gracefully, keeping file operations safe, and making the UI feel responsive are all non-trivial engineering problems. The models are the easy part once the infrastructure is solid.
The full source code is available on GitHub. Feel free to fork it, extend the intents, or swap in a different LLM.

DEV Community: 23B01A05J5 CSE

How I Built a Voice-Controlled Local AI Agent from Scratch