Voice assistants are everywhere. But here's the uncomfortable truth hiding behind every "Hey Siri" and "OK Google": your raw audio, your personal context, your sensitive queries — all of it is getting shipped to a cloud server you don't control, processed by a model you can't inspect, and logged in ways you can't audit.
For developers working on proprietary codebases, or anyone who simply refuses to accept that trade-off, this is a non-starter.
So I built Voca — a fully open-source, 100% local voice AI agent. It can create files, generate code, summarize text, and hold conversational memory across a session. Not a single byte ever leaves your machine.
Here's exactly how I built it, every architectural decision behind it, and every wall I ran into along the way.
The Core Constraint: Stateless by Design
The foundational architectural goal was ruthless: keep the backend stateless.
Early prototypes cached conversation history server-side. It worked, but it introduced session drift the moment you opened a second tab or restarted the dev server. The backend became a liability — something that could desync, leak context between sessions, or bloat in memory.
The fix was counterintuitive: push all state to the client.
The frontend maintains two strictly bounded arrays:
-
chatContextState— rolling conversational dialogue, capped at 20 frames -
actionLogState— a ledger of every file creation or code write authorized this session
Every time you send a voice command or text input, the frontend serializes this entire footprint and ships it alongside the audio blob into the inference pipeline. The backend receives everything it needs in a single request and forgets everything the moment it responds. Clean, fast, reproducible.
The Architecture
The backend is a single FastAPI async route: POST /api/process.
That's intentionally it. FastAPI + Uvicorn handles raw multipart/form-data natively — audio blob plus stringified JSON state in one shot — with zero WebSocket overhead. The pipeline that executes on every request has exactly three stages:
Audio Blob → [STT] → [Intent Classification] → [Tool Dispatcher] → Response
All three run locally. None phone home.
The frontend is deliberately vanilla — plain HTML, CSS, and JavaScript. No framework tax, no build step, no abstraction overhead. The UI is a thin shell over the state arrays; keeping it that way meant the rendering logic never had to fight the inference logic.
Stage 1 — Speech-to-Text: faster-whisper
Standard Whisper deployments carry significant PyTorch cold-start latency. On repeated inference that penalty compounds fast.
The fix: initialize the model once at module load time, locking it into float16 precision directly on the CUDA buffers.
from faster_whisper import WhisperModel
# Loaded once. Stays warm on GPU for the lifetime of the process.
stt_model = WhisperModel("large-v3-turbo", device="cuda", compute_type="float16")
With weights resting warm on the GPU, transcriptions resolve in milliseconds rather than seconds.
But raw transcription isn't enough. Voca calculates the avg_logprob across all returned Whisper segments. If confidence drops below -0.8, the pipeline trips into Graceful Degradation — execution halts, the user gets a clear warning, and no code touches the filesystem based on a mishear. An offhand cough will never trigger a file write.
Stage 2 — Intent Classification: Ollama + Structured Output
This is where most local voice agent projects fall apart: getting a 4-billion parameter model to reliably produce machine-parseable output from ambiguous natural language.
The naive approach — prompt the model and hope for valid JSON — breaks constantly. Smaller models hallucinate schema, merge distinct actions into nonsensical compound intents, or drop required fields entirely.
The solution was to remove the model's ability to produce malformed output by binding Ollama's format= argument to a strict JSON schema:
response = ollama.chat(
model=selected_model,
messages=messages,
format={
"type": "array",
"items": {
"type": "object",
"properties": {
"intent": {"type": "string", "enum": ["create_file", "write_code", "summarize", "general_chat"]},
"filename": {"type": "string"},
"content": {"type": "string"}
},
"required": ["intent"]
}
}
)
The model can no longer return {intent: "create_folder_and_write_file"}. It must return [{intent: "create_file"}, {intent: "write_code"}]. Compound commands become sequential, deterministic actions. The hallucination problem becomes a schema constraint problem — and schema constraints are solved.
Contextual Ambiguity
A user says: "make that function async" — 30 seconds after creating a file named server.py.
Without context, the model has no referent for "that function" or "that file." With the actionLogState passed from the frontend, intent.py dynamically builds a secondary system prompt:
Files modified this session:
- output/server.py (write_code, 14:32)
The model now has exactly what it needs. Contextual ambiguity solved without a persistent backend session.
Model Switching
Different tasks have different VRAM budgets. Summarizing a README doesn't need the same model as generating a multi-file TypeScript module. Voca polls /api/models from the local Ollama daemon on startup, surfaces every available model in a dropdown, and hot-swaps mid-session. Switch from gemma3:4b to deepseek-r1:7b between tasks without restarting anything.
Stage 3 — The Tool Dispatcher
The LLM never touches the filesystem directly. It produces structured JSON. The dispatcher reads that JSON and routes to one of four isolated Python functions:
| Tool | What it does |
|---|---|
create_file |
Creates a blank file or directory inside output/
|
write_code |
Prompts the LLM as a code generator, strips markdown fencing, writes raw script to disk |
summarize |
Feeds text into an LLM tuned for short bulleted output, returns to chat stream |
general_chat |
Conversational fallback; passes rolling context, returns response |
Every tool is wrapped in explicit try/except bounds. Every file operation passes through safe_path():
def safe_path(filename: str) -> Path:
base = Path("output").resolve()
target = (base / filename).resolve()
if not str(target).startswith(str(base)):
raise ValueError("Path escape attempt blocked.")
return target
Path traversal, escape sequences, absolute overrides — all rejected before a byte hits the drive.
The Safety Layer: Human-in-the-Loop
Autonomous code generation from spoken word is genuinely dangerous. A transcription error on a destructive command, a hallucinated filename, a mishear at the wrong moment — any of these could cause real damage.
Voca's answer: any intent that writes to disk halts the pipeline entirely.
Instead of executing, the system bounces the full proposed action back to the client. The UI renders an explicit confirmation panel showing the exact filename and content that would be written. The user must approve before a single byte is allocated.
Voice commands are fast. Humans need to stay in the loop on irreversible actions. This boundary is not optional and not bypassable.
The Five Concepts That Hold It Together
Building this system clarified five ideas that I'd apply to any autonomous local AI agent:
Intents over shell access. The LLM is a classifier, not an executor. It converts speech into structured JSON objectives. It never gets a shell. This single constraint eliminates an entire class of security issues.
Tools as sandboxed functions. Side effects live in isolated Python functions with explicit error handling. The LLM triggers them by name. It cannot modify them, escape them, or chain them in ways the schema doesn't permit.
Memory on the client. Backend session state is a liability. Pushing state to the frontend makes every request self-contained, eliminates session drift, and makes the backend trivially horizontal.
Human-in-the-loop as a hard gate. Disk writes require human approval. This is not a setting. It's the architecture.
Graceful degradation over silent failure. Low transcription confidence, malformed JSON output, Ollama connectivity issues — all of these have explicit failure paths that surface clearly to the user rather than producing silent bad behavior.
What I'd Do Differently
A few things I'd revisit if starting over:
The chatContextState 20-frame cap is a blunt instrument. A smarter approach would score messages by relevance and prune semantically rather than chronologically — older context sometimes matters more than recent filler.
The single /api/process route handles too much. Splitting STT, intent classification, and tool dispatch into separate endpoints would make each stage independently testable and easier to swap out.
And large-v3-turbo is genuinely overkill for most voice commands. A tiered approach — fast small model for simple intents, larger model only when complexity warrants — would cut latency significantly on typical usage.
Get the Code
Voca is fully open-source. If you're building something where privacy isn't negotiable, or you just want a local voice agent you actually understand end-to-end, the full source is on GitHub.
Drop questions in the comments — especially if you've hit the compound-intent hallucination problem in your own local LLM work. It's a nastier problem than it looks and I'd be curious how others are handling it.


Top comments (0)