No cloud. No API keys. Just your voice, a local LLM, and a clean pipeline that actually works.
The Idea
Most AI assistants are cloud-dependent. You say something, it goes to a server somewhere, gets processed, and comes back. That works fine — until you care about privacy, latency, or just want to understand what's actually happening under the hood.
I wanted to build something different: a voice-controlled AI agent that runs completely locally. You speak (or type), it figures out what you want, and it does it — whether that's writing code, creating files, summarizing text, or having a conversation. Everything happens on your machine.
This article walks through how I built it, the models I chose, and the real challenges I ran into along the way.
What the Agent Can Do
Before diving into architecture, here's what the finished system looks like from a user's perspective:
- Say "Write a Python retry function and save it as retry.py" → it generates the code and saves it to an
output/folder - Say "Summarize this text and save it to notes.txt" → it summarizes the content, then writes the file
- Say "Create a folder called projects" → done
- Say "What is recursion?" → it responds conversationally
It also supports chaining — one voice command can trigger multiple steps in sequence. And before any file is written to disk, a confirmation panel appears so you stay in control.
Architecture Overview
The system is a linear pipeline. Each stage has one job and passes its output to the next.
🎤 Audio Input (mic or file upload)
↓
🗣️ Speech-to-Text [faster-whisper, Whisper base]
↓
🧠 Intent Classifier [llama3.1:8b via Ollama]
↓
⚙️ Tool Executor
├── WRITE_CODE → qwen2.5-coder:7b
├── SAVE_FILE → writes to output/
├── CREATE_FILE → creates empty file
├── CREATE_FOLDER → creates directory
├── SUMMARIZE_TEXT → llama3.1:8b
└── GENERAL_CHAT → llama3.1:8b
↓
🖥️ Gradio UI
The project is split into six focused modules:
| File | Responsibility |
|---|---|
voice.py |
Converts audio to text using faster-whisper |
intent_classifier.py |
Sends text to the LLM, parses a JSON plan of steps |
executor.py |
Runs each step in order, chains outputs between steps |
tools.py |
The actual tool functions — file ops, code gen, chat |
memory.py |
Maintains a rolling conversation history within the session |
main.py |
Gradio UI and event wiring |
Each module is independent. You can swap out the STT model, replace Ollama with an API call, or add new tools without touching anything else.
The Models I Chose (and Why)
Choosing the right model for each job was more important than it might seem. Using a single general-purpose model for everything would have been simpler, but the quality difference when using specialized models is significant.
Speech-to-Text: faster-whisper (Whisper base)
faster-whisper is a reimplementation of OpenAI's Whisper using CTranslate2. For this project, I used the base model with int8 quantization on CPU.
Why this over full Whisper?
- int8 quantization cuts memory usage roughly in half compared to the standard float32 model
- On CPU, it's noticeably faster — typically 1–3 seconds for a 5-second voice clip
- VAD (Voice Activity Detection) filtering is built in, which means it skips silent segments automatically and reduces hallucinations on quiet recordings
- The base model is small enough to load instantly and accurate enough for clear English commands
For voice commands (short, purposeful sentences), the base model performs very well. You'd only need a larger model if you were transcribing long, nuanced speech.
Intent Classification: llama3.1:8b via Ollama
The heart of the system. After transcription, this model reads the user's text and returns a structured JSON plan describing exactly what steps to take.
Why llama3.1:8b?
- It follows instructions reliably, which is critical here — I need it to return valid JSON every single time, not prose
- At 8 billion parameters, it's large enough to understand nuanced commands but small enough to run on a machine with 16 GB RAM
- It handles multi-step command decomposition well — given "write code and save it", it correctly outputs two separate steps in the right order
- Temperature is set to 0 for the classifier, which makes responses deterministic and consistent
Code Generation: qwen2.5-coder:7b via Ollama
When the intent is WRITE_CODE, the request goes to a separate, code-specialized model instead of the general-purpose one.
Why a separate model for code?
Because it genuinely writes better code. qwen2.5-coder is fine-tuned specifically on programming tasks. In practice, the difference is noticeable — cleaner structure, better variable names, more idiomatic patterns. Using a general model for code generation works, but a code-specialized model works better.
How the Intent Classifier Works
This is the most interesting part of the system, so it's worth explaining in detail.
When a user's text arrives, it's sent to llama3.1:8b with a carefully crafted system prompt. The prompt instructs the model to return only a JSON object — no preamble, no explanation, no markdown fences. The JSON describes an ordered list of steps:
{
"steps": [
{
"intent": "WRITE_CODE",
"query": "Python retry function",
"meta": { "language": "python" }
},
{
"intent": "SAVE_FILE",
"query": "save code",
"meta": { "filename": "retry.py", "content_source": "previous_step" }
}
]
}
The content_source: "previous_step" field is how step chaining works. When the executor reaches SAVE_FILE, it checks this flag and uses the output from the previous step (the generated code) as the file content. No manual wiring required.
After the model responds, the output goes through two layers of validation:
-
_extract_json()— Strips any surrounding text and pulls out the first valid{...}block, in case the model added any prose despite being told not to -
_normalize()— Ensures every step has all required keys, and silently replaces any unknown intent withGENERAL_CHATinstead of crashing
This two-layer approach means the system always produces something useful, even when the model misbehaves.
Step Execution and Output Chaining
Once the intent classifier returns its plan, the executor runs each step in sequence.
for i, step in enumerate(steps, start=1):
intent = step.get("intent")
if intent == "WRITE_CODE":
result = write_code_tool(query, language=language)
previous_output = result # stored for the next step
elif intent == "SAVE_FILE":
content = previous_output if content_source == "previous_step" else original_text
result = save_file_tool(filename, content)
The previous_output variable acts as a simple pipe between steps. This is what makes compound commands work without any complex orchestration logic.
Human-in-the-Loop Confirmation
Any step that writes to disk — SAVE_FILE, CREATE_FILE, CREATE_FOLDER — is flagged before execution. Instead of immediately writing, the UI shows a confirmation panel:
⚠ Confirm File Operation
💾 Save file: retry.py
[Filename input — editable]
[Confirm] [Cancel]
The user can rename the file before confirming, or cancel entirely. Only after confirmation does the executor run. This was an intentional design choice: an AI agent that silently writes files to your machine without asking is a liability. A two-second confirmation step prevents a lot of potential headaches.
Session Memory
The ConversationMemory class maintains a rolling window of the last 10 messages using Python's deque with maxlen:
class ConversationMemory:
def __init__(self, max_messages: int = 10):
self.messages = deque(maxlen=max_messages)
When the window is full, the oldest message is automatically dropped. This keeps memory usage bounded while still giving the LLM enough context for follow-up questions like "save that to a file" to make sense.
The conversation history is passed to every LLM call — classification, code generation, summarization, and chat. This means the agent can understand references to previous turns without any extra logic.
Safety: Sandboxed File Operations
All file writes go through a _safe_path() function before touching disk:
def _safe_path(name: str) -> Path:
name = name.strip().lstrip("/\\")
resolved = (OUTPUT_DIR / name).resolve()
if not str(resolved).startswith(str(OUTPUT_DIR)):
raise ValueError("Unsafe path. All writes must stay inside output/.")
return resolved
This prevents directory traversal attacks. A command like "save to ../../etc/hosts" resolves to a path outside output/ and is rejected before any disk write happens. All generated files are contained within the output/ folder in the project directory.
Challenges I Faced
1. Getting the LLM to Return Consistent JSON
The single hardest part. LLMs are trained to be helpful and conversational, which means they want to explain what they're doing. Even with explicit instructions like "return ONLY valid JSON", the model would occasionally wrap the output in markdown fences, add a sentence before it, or return slightly malformed JSON.
The solution was three-pronged:
- Set
temperature: 0to make outputs as deterministic as possible - Use
_extract_json()to pull out the JSON block regardless of surrounding text - Use
_normalize()to handle missing keys and unknown intents gracefully
After these three layers, the classifier became reliable enough to use in production.
2. Context Window Overflow on Summarization
When a user pastes a very long piece of text and asks for a summary, the combined system prompt + conversation history + user text can exceed the model's context window. In practice, this caused silent failures — the model would return empty output or crash.
The fix was a _truncate() function in the executor that caps input at 12,000 characters before sending it to the summarization tool. It tries to cut at a sentence boundary rather than mid-word, and appends a note so the model knows the text was trimmed:
return cutoff + "\n\n[... text truncated for summarization ...]"
Simple, but it completely eliminated the overflow crashes.
3. Wiring Multi-Output Gradio Events
Gradio's event system requires you to declare all outputs upfront, and the number of outputs must match exactly across every yield in a generator function. When I added the confirmation panel (which introduced new state variables), every existing yield in run_classify() had to be updated to include the new outputs.
This led to a subtle bug where some code paths returned the wrong number of values, causing silent failures in the UI. The fix was centralizing the blank/default state into a _blank() helper so every yield point returned the same shape.
What I'd Build Next
- Streaming output — Show the LLM's response token by token instead of waiting for the full response
- More intents — Open applications, search the web, run terminal commands
- Persistent memory — Save conversation history to disk so context survives across sessions
- Model benchmarking — Systematically measure latency and accuracy across different Ollama models to find the best tradeoff for each task
Final Thoughts
Building this taught me that the hard part of an AI agent isn't the AI — it's the plumbing. Getting models to return structured output reliably, handling edge cases gracefully, and making the UI feel responsive despite slow local inference are all engineering problems, not AI problems.
The result is a system that genuinely works: speak a command, watch it get classified, confirm if needed, and see the result. Entirely local, entirely transparent, and easy to extend.
The full code is available on GitHub: [your repo link here]
Built with faster-whisper, llama3.1:8b, qwen2.5-coder:7b, Ollama, and Gradio.
Top comments (0)