Building a Voice-Controlled Local AI Agent with Whisper, LLaMA, and LangGraph

Aquib — Wed, 15 Apr 2026 06:10:11 +0000

Introduction
What if you could just speak to your computer and have it write code, create files, or summarize text — without sending a single byte to the cloud? That's exactly what I built for my internship assignment: a fully local, voice-controlled AI agent using only open-source tools.

In this article I'll walk through the architecture, the models I chose, and the challenges I faced building it.

What It Does

The agent accepts a voice or text command, transcribes it, figures out what the user wants, and then acts — creating files, generating code, summarizing text, or just answering a question. Everything runs on your local machine.

Supported intents:

Write Code — generates code and saves it to a file
Create File / Folder — plain file or directory creation
Summarize — summarizes any provided text
General Chat — LLM-powered Q&A

Architecture

File Upload, and Text Input; (2) STT and Pre-processing with silence detection, audio normalization, and OpenAI Whisper small model; (3) Classification with a regex rule-based pre-classifier feeding into LLaMA 3.2 3B via Ollama as a fallback, producing one of four intents; (4) HITL checkpoint requiring user confirmation before any file operation; (5) Tool Execution via LangGraph StateGraph routing to write_code, create_file, summarize, or general_chat nodes; (6) Output layer showing generated files, folders, text results, and the Streamlit UI. All
components run locally with no cloud API calls." width="800" height="439">

The pipeline has five stages:

Audio Input — Microphone recording (sounddevice) or file upload. Auto-stops on 1.5 seconds of silence.
Speech-to-Text — OpenAI Whisper (small model) runs locally. On an RTX 3050, transcription takes under 2 seconds.
Intent Classification — A two-layer classifier: first a regex rule-based pre-classifier for high-confidence patterns (e.g. "create a python file"), then an Ollama-hosted LLaMA 3.2 model for everything else.
Tool Execution — LangGraph StateGraph routes to the right tool node and executes.
UI — Streamlit displays the pipeline progress, result, and session history in real time.

Models I Chose

Whisper (small): I chose the small model because it strikes a good balance between accuracy and speed on a 6 GB GPU. The tiny model is faster but struggled with technical vocabulary like "fibonacci" or "math_utils". I added a initial_prompt to prime Whisper with programming terms, which significantly reduced transcription errors.

LLaMA 3.2 3B via Ollama: For intent classification, a 3B model is surprisingly capable when guided with a well-structured system prompt and few-shot examples. I found that small models need explicit, concrete examples — abstract descriptions alone are not enough. I ended up adding a rule-based pre-classifier layer on top because the LLM occasionally misclassified clear-cut commands like "create a Python file inside the utils folder." The rules catch these deterministically.

Challenges I Faced

1. Whisper hallucinations on silence
When the microphone records silence, Whisper doesn't return an empty string — it fabricates random sentences. I fixed this by checking the raw audio RMS (energy level) before calling Whisper. If the signal energy is below a threshold, the pipeline rejects the input immediately.

2. Small LLM misclassification
LLaMA 3.2 3B is capable but not perfectly reliable for complex natural language patterns. The phrase "create a python file inside an already existing folder" was consistently classified as general_chat. I solved this by building a deterministic rule-based pre-classifier using regex that catches common code-generation patterns before the LLM is even called.

3. LangGraph + Streamlit state management
Streamlit re-renders the entire page on every interaction. Managing pipeline state across reruns — especially for Human-in-the-Loop confirmation and compound commands — required careful use of st.session_state to persist intermediate results and prevent re-triggering.

4. Subfolder file creation
The original tool stripped directory paths from filenames (using Path.name), so "write code inside math_utils folder" always saved the file at the top level. I updated both the intent classifier (to extract subfolder context) and the file tool (to safely resolve one level of subdirectory within output/).

5. Compound "summarize and save" commands misclassified
The command "Summarize the given text and store it in summary.txt file …" was consistently classified as general_chat by LLaMA 3.2 3B, even though the system prompt included a matching few-shot example. The small model failed to generalise from "save it to" (in the example) to "store it in" (in the actual input). Adding more examples to the prompt did not help reliably. I solved this by adding a dedicated rule to the pre-classifier: a regex that detects the co-occurrence of \bsummariz\w*\b and \b(store|save)\b followed by a filename with an extension, then extracts the filename and the text to summarise deterministically — bypassing the LLM entirely for this pattern.

Bonus Features

Compound commands: A single command like "write a fibonacci function and save it inside the math_utils folder" triggers multiple chained pipeline steps.
Human-in-the-Loop: Any file operation shows a confirmation prompt before execution.
Graceful degradation: Silent audio, keyboard mashing, and Ollama crashes are all handled with clear error messages and retry logic.

Conclusion

Building a fully local AI agent taught me a lot about the gap between a model's theoretical capability and its real-world reliability. The architecture itself is straightforward; the hard work is in the edge cases — hallucinations, misclassifications, and state management.

The full source code is available on GitHub: https://github.com/aquibkhanjb-pixel/Voice-Controlled-Local-AI-Agent.git

If you have questions or want to discuss the architecture, feel free to reach out in the comments.

DEV Community: Aquib

Building a Voice-Controlled Local AI Agent with Whisper, LLaMA, and LangGraph