hamsiniananya

Posted on Apr 13

Building a Voice-Controlled Local AI Agent: Architecture, Models, and Hard-Won Lessons

#agents #ai #architecture #llm

I recently built a voice-controlled AI agent that runs almost entirely on my local machine. You speak a command, it transcribes you, figures out what you want, and actually does it — creates files, writes code, summarises text, or just chats back. Here's how I built it, the architectural decisions I made, and the surprises along the way.

What We're Building

The agent has four stages in its pipeline:

Speech-to-Text (STT) — converts your voice to text
Intent Classification — an LLM determines what you want
Tool Execution — the correct action is performed on your machine
Streamlit UI — displays every stage transparently

The guiding principle was local-first: I wanted this running on my laptop without monthly API bills. Cloud providers are available as fallbacks.

Architecture Deep Dive

Stage 1 — Speech-to-Text

The obvious choice is OpenAI's Whisper. I used the openai-whisper pip package, which lets you run the model entirely offline. I went with the base model (~74M parameters) as a balance between accuracy and speed on CPU. On my machine (Intel i7, 16GB RAM, no GPU), it transcribes a 10-second clip in about 12 seconds. Acceptable for a demo; I'd switch to a GPU or Groq's API for production.

import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.wav")
print(result["text"])

Why not wav2vec? wav2vec2 is excellent for short, clean speech but less robust to diverse accents and background noise. Whisper is trained on 680,000 hours of multilingual audio — it just handles the real world better.

Hardware workaround: If your machine can't run Whisper in real time, Groq's Whisper API is free-tier friendly and returns results in under a second. I built this as a selectable option in the sidebar. In the README I document this choice explicitly, as required.

Stage 2 — Intent Classification

This is where LLM prompt engineering gets interesting. Rather than fine-tuning a model, I use a structured zero-shot classification prompt that forces the model to return a JSON object with intents, reasoning, and entities:

Given a user command, identify ALL applicable intents from this list:
create_file, write_code, summarize_text, general_chat, unknown

Return ONLY:
{
  "intents": ["intent1"],
  "reasoning": "...",
  "entities": { "filename": "...", "language": "...", "content": "..." }
}

The entities field is crucial — it lets the tool executor pick up the filename, programming language, or text content mentioned in the command without needing another LLM call.

I used Ollama with llama3.2 for local inference. Ollama runs as a local HTTP server, which means calling it from Python is just a POST request — dead simple and no GPU required (though it helps).

Compound command support: Because I extract a list of intents, a command like "Summarize this text and save it to summary.txt" correctly returns ["summarize_text"] with filename: "summary.txt" in entities — the tool executor then both generates the summary and saves it.

Stage 3 — Tool Execution

Each intent maps to a tool function. All file operations are restricted to an output/ directory — a critical safety constraint I implemented by calling Path(filename).name to strip any parent directory components before constructing the output path.

def _safe_output_path(filename: str) -> Path:
    safe_name = Path(filename).name   # strips "../../../etc/passwd" attacks
    return OUTPUT_DIR / safe_name

For code generation, I send the user's request back to the LLM with a code-only prompt. For summarization, a summarization prompt. For general chat, a straightforward conversational prompt. Three prompts, one LLM call each.

Stage 4 — Streamlit UI

Streamlit was the natural fit for a rapid Python UI. It required no JavaScript, and the entire UI state (session history, settings) lives in st.session_state. I used custom CSS injected via st.markdown(..., unsafe_allow_html=True) to give it a dark, terminal-like feel that matches the "local agent" aesthetic.

The Human-in-the-Loop feature — a toggle in the sidebar — intercepts any file-writing intent and shows a confirmation dialog before executing. This is implemented with a simple boolean in session state.

The Challenges

1. Parsing LLM JSON Reliably

The biggest headache was getting consistent JSON back from the LLM. Even with explicit instructions, models occasionally wrap their response in markdown fences or add a preamble like "Sure, here is the JSON:". My solution: strip markdown fences with regex, then use re.search(r"\{.*\}", text, re.DOTALL) to extract the JSON object, then json.loads(). Never trust raw LLM output.

2. Whisper Audio Format

Whisper is finicky about input formats. Streamlit's st.audio_input returns bytes in a format that soundfile doesn't always parse cleanly. The fix: write to a temp .wav file and pass the path to Whisper, then clean up.

3. Ollama Cold Start

The first inference call after starting Ollama takes 3–8 seconds to load the model into memory. Subsequent calls are fast (~1s for classification). I added a spinner in the UI so users don't think the app has frozen.

4. Compound Intents

Supporting "Summarize this and save it to file.txt" required rethinking the tool dispatcher. My first version mapped one intent to one tool. The fix was to always prioritise write_code → create_file → summarize_text → general_chat in that order, while passing the full entities dict to every tool so the filename is always available regardless of which tool runs.

Model Choices Summary

Stage	Local Model	Cloud Fallback	Why
STT	Whisper base	Groq Whisper-large-v3	Robustness, multilingual
LLM	Ollama llama3.2	Groq llama-3.1-8b-instant	JSON compliance, speed

Speed comparison (informal benchmarking on my machine):

Whisper base (CPU): ~12s for 10s clip
Groq Whisper API: ~0.8s for same clip
Ollama llama3.2 (CPU): ~4s for intent classification
Groq llama-3.1-8b: ~0.5s for same prompt

The cloud APIs are 5–15× faster, but the local stack costs nothing after setup and keeps all your data on your machine.

What I'd Build Next

Voice Activity Detection (VAD): Instead of pressing a button to record, use Silero VAD to auto-start/stop recording when speech is detected.
Streaming code output: Stream the LLM's code generation token-by-token into the UI for a ChatGPT-style typing effect.
Persistent memory across sessions: Store chat history and created files in SQLite for true agent memory.
Tool plugins: A simple plugin system where new tools can be registered by dropping a Python file into a tools/ directory.

Conclusion

The most surprising thing about this project was how accessible the local AI stack has become. A year ago, running a capable LLM on a laptop felt impossible. Today, Ollama + llama3.2 gives you a genuinely useful language model in one terminal command. Combine that with Whisper for STT and Streamlit for UI, and you have a full voice AI agent in under 400 lines of Python.

The code is on GitHub: https://github.com/hamsiniananya/Voice-Controlled-Local-AI-Agent.git

*All opinions are my own. Built as part of an AI engineering assignment.

DEV Community