Building VoiceForge: A Voice-Controlled Local AI Agent That Actually Does Things

#agents #ai #automation #showdev

How I built a system that listens to your voice, understands your intent, and executes actions on your machine — all with open-source models.

THE IDEA

Most AI assistants answer questions. VoiceForge does things.

You speak a command. It transcribes your audio, figures out what you want, and executes it — creating files, generating code, summarizing text — right on your local machine. The entire pipeline is visible in a clean UI so you can see exactly what happened at every step.

This post breaks down how it's built, the model choices I made, and the problems I ran into along the way.

ARCHITECTURE OVERVIEW

Audio Input (mic / file upload)
↓
Speech-to-Text ──── Groq Whisper API (whisper-large-v3-turbo)
↓
Intent Classification ── Ollama (local) → Groq LLM (fallback)
↓
Tool Execution ─────── create_file / write_code / summarize / general_chat
↓
Streamlit UI ────────── Transcription + Intent + Action + Output

The system has four distinct layers, each swappable independently.

LAYER 1: AUDIO INPUT

The UI supports two input methods:

Microphone recording via audio-recorder-streamlit — records directly in the browser, no native app needed
File upload — supports WAV, MP3, M4A, OGG, FLAC, and WebM

Audio format is detected from magic bytes rather than file extensions, which handles cases where users rename files or browsers record in unexpected formats (Chrome records WebM, not WAV).

LAYER 2: SPEECH-TO-TEXT

Model used: whisper-large-v3-turbo via the Groq API

Why not local Whisper?

The honest answer: my development machine doesn't have a CUDA-capable GPU. Running faster-whisper or HuggingFace's Whisper implementation on CPU is technically possible but produces transcription speeds of 3-5x real time — meaning a 10-second clip takes 30-50 seconds to process. That's unusable for a voice interface.

Groq runs the same Whisper Large V3 weights on their custom LPU hardware and returns results in under a second. The free tier is generous enough for development and demos. If you have an NVIDIA GPU, swapping in faster-whisper locally is a one-file change in agent/stt.py.

LAYER 3: INTENT CLASSIFICATION

This is where the intelligence lives. The transcribed text is sent to an LLM with a structured prompt that asks it to return JSON:

Primary: Ollama (local)
Ollama runs models like llama3.2 entirely on your machine. No API calls, no cost, works offline. The intent classification prompt uses temperature: 0.1 to keep outputs consistent and deterministic.

Fallback: Groq (llama-3.3-70b-versatile)
If Ollama isn't running or returns an error, the system silently falls back to Groq's hosted LLM. The user never sees the failure — they just get a result.

Compound intent detection is a natural side effect of this design. Because the LLM returns a list of intents rather than a single one, a command like "Write a sorting algorithm and explain how it works" correctly produces ["write_code", "general_chat"] and both actions execute in sequence.

LAYER 4: TOOL EXECUTION

Four tools are implemented:

Intent	What it does
create_file	Creates an empty file or folder in output/
write_code	Calls the LLM to generate code, saves to output/
summarize	Sends text to the LLM for summarization
general_chat	Conversational response with session history as context

Safety constraint: All file writes are restricted to the output/ directory. The _safe_path() function strips path traversal characters (../) from any filename the LLM returns, so a model hallucinating ../../system32/evil.exe as a filename won't do anything dangerous.

Human-in-the-loop: Before any file operation executes, the UI shows a confirmation dialog. Non-destructive actions (chat, summarize) run immediately.

SESSION MEMORY

The SessionMemory class maintains a history of all interactions within a session. For general chat, the last 5 exchanges are included as context in the LLM messages array, giving the agent conversational continuity. For other intents, history is displayed in the sidebar for reference.

CHALLENGES

1. LLM JSON Reliability
Getting an LLM to return clean JSON every time is harder than it sounds. Early tests showed the model occasionally wrapping the JSON in markdown code fences, adding explanatory text before or after, or producing slightly malformed JSON with trailing commas.

The fix was a layered parser: try direct json.loads() first, then fall back to a regex that extracts the first {...} block, then fall back to a safe default of general_chat. This means the system degrades gracefully even when the model misbehaves.

2. Function Definition Order in Streamlit
Streamlit re-runs the entire script top to bottom on every interaction. I had _execute_action() defined at the bottom of the file but called it from the confirmation dialog UI code in the middle. In a normal Python module this would work fine — but in Streamlit's execution model, when the confirm button is clicked the call happens before the def statement is reached, causing a NameError.

The fix was moving all function definitions above the UI rendering code. A simple fix, but one that's easy to miss if you're used to JavaScript-style hoisting.

3. Audio Format Detection
Browsers don't always record in the format you expect. Chrome records WebM, Safari records MP4/M4A, and some systems produce OGG. Relying on file extensions broke silently — the upload appeared to work but Groq's API returned a transcription error.

Switching to magic byte detection (reading the first 4-8 bytes of the audio data to identify the format) fixed this completely. The correct MIME type is now passed to the API regardless of what the file is named.

4. Ollama Cold Start
The first request to Ollama after a model loads takes 3-5 seconds as the weights are loaded into memory. Subsequent requests are fast. This isn't something you can fix, but it's worth knowing when you first run the app and wonder why the first classification is slow.

MODEL PERFORMANCE COMPARISON

Task	Model	Avg. Latency
STT	Groq Whisper (API)	~0.8s
STT	Whisper CPU (local)	~30s
Intent classification	Ollama llama3.2 (local)	~2-4s
Intent classification	Groq llama-3.3-70b (API)	~1s
Code generation	Ollama llama3.2 (local)	~8-15s
Code generation	Groq llama-3.3-70b (API)	~2-3s

For a voice interface, latency matters a lot. The Groq API path is noticeably snappier end-to-end. Local Ollama is slower but completely private and free.

WHAT I'D DO DIFFERENTLY

Streaming output — code generation currently waits for the full response before displaying anything. Streaming tokens to the UI would make it feel much faster.
Local STT — with a proper GPU, replacing Groq Whisper with faster-whisper locally would make the entire pipeline offline-capable.
Wake word detection — instead of clicking a button to record, listening for a trigger word would make it feel like a real voice assistant.