Building a Voice-Controlled AI Agent That Runs Entirely on Your Laptop

#agents #ai #llm #showdev

As part of an internship assignment, I built a voice-controlled AI agent that takes audio input, understands what you're asking, and actually does it creates files, writes code, summarizes text, or just chats. Everything runs locally. No external APIs, no data leaving your machine.

This post covers the architecture, the models I picked, and the problems I ran into along the way.

What it does

You speak or type a command, and the system handles the rest:

Converts audio to text
Detects your intent using a local LLM
Executes the action
Shows the result in a simple UI

Some examples of what it can handle:

"Create a Python file with bubble sort" — generates the code and saves it
"Summarize this text..." — returns a concise summary
"Summarize this and save it to summary.txt" — runs both actions in sequence

Architecture

The pipeline is simple by design:
Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI Output

I kept it modular so each part can be swapped or improved without touching the rest. The whole thing is four files — stt.py, intent.py, tools.py, and app.py.

Tech Stack

Python
Streamlit for the UI
faster-whisper for speech-to-text, running locally on CPU
Ollama with llama3.2 for intent detection
sounddevice and scipy for audio recording

Why These Tools

faster-whisper

I started with Whisper through HuggingFace transformers. On my CPU it was taking 30 to 40 seconds per clip, which made it completely unusable for anything interactive. faster-whisper with the base model in int8 mode brings that down to 3 to 6 seconds with similar accuracy. That was the difference between something that felt broken and something that actually worked.

Model comparison on my machine:

Model	Size	Avg time on CPU	Accuracy
whisper tiny	75MB	1-2s	Often misses words
whisper base	150MB	3-6s	Good enough for clear speech
whisper small	500MB	8-15s	Better but too slow

Ollama + llama3.2

I wanted intent detection to run locally without any API dependency. llama3.2 through Ollama handles this well takes 2 to 4 seconds per request and follows a JSON schema consistently when you give it clear examples in the system prompt. Without examples it gets creative with the structure, which breaks the parser.

Tool-based execution

Instead of letting the LLM do everything end to end, I separated execution into specific functions. Each intent maps to a function that does one thing. This made the system predictable and much easier to debug when something went wrong.

Problems I Ran Into

Streamlit state resets

Streamlit reruns the entire script on every button click. So when the user clicked confirm, the transcript and intent result that were already computed would disappear. The fix was storing everything in st.session_state. Once I did that the pipeline became stable, but it took me a while to understand why the confirm button wasn't doing anything.

Whisper transcribing noise as gibberish

Early on I was getting completely wrong output random characters, wrong words, sometimes a different language entirely. Two things caused this: recording was starting before I finished speaking, and short clips with mostly silence confused the model. Adding a 3 second countdown before recording starts and setting a minimum duration of 5 seconds fixed it almost completely.

LLM returning inconsistent JSON

The whole intent detection pipeline depends on the LLM returning valid JSON. That didn't always happen sometimes it wrapped the output in markdown fences, sometimes it added an explanation before the JSON. I handled this by stripping markdown before parsing and wrapping the whole thing in a try/except that falls back to a chat intent if parsing fails. The app never crashes now, it just treats anything it can't parse as a general chat message.

Compound commands

This was the most interesting part to build. The idea is that a single sentence like "summarize this and save it to summary.txt" should trigger two actions in sequence, where the second action uses the output of the first.

I handled this by changing the intent schema to support a compound flag and a commands array. When the second command depends on the first, it uses a __PREVIOUS_OUTPUT__ placeholder that gets substituted at runtime.

What I Would Do Differently

Intent detection is still prompt-based, which works but is not the most reliable approach. A small fine-tuned classifier would be faster and more consistent for this specific task.

I would also add persistent memory across sessions using something like Chroma or FAISS. Right now the session history resets when you close the app.