Ritvika Mishra

Posted on Apr 15

Local Voice Controlled AI Agent

#ai #agents #opensource #programming

a voice controlled agent is a piece of software that sits on your machine, listens to what you say, and actually does things in response. not a chatbot that talks back — an agent that acts. it hears intent, decides what to do, and does the thing. speech goes in, the filesystem moves, apps open, files get written, screens get captured. the gap between a thought and a side effect on your computer shrinks to a sentence.

i decided to build one of my own. here's what it can do right now:

create files and folders, sandboxed to a dedicated output directory so nothing on the machine gets stomped
generate code from a spoken description — "write a python function that reverses a string and save it as reverse.py" — and pop the file straight open in VS Code so you can see what was made
open apps on the device — Preview, Spotify, Terminal, Firefox, Chrome, whatever's installed — aliases and fuzzy names handled
open websites in the browser, with or without a specific browser named; search Google when asked to "look something up online"
open local files by searching for them via macOS Spotlight — "open the DDIA pdf" finds it wherever it lives
take screenshots — full screen, a window you pick, or a region you drag — auto-opens in Preview
summarize local files by actually reading them (pypdf for PDFs, plain read for text) and feeding the contents to the model
general chat when you just want to ask it something and have it respond in words

baby steps. nothing close to what i imagine an ideal version looking like — some hardware constraints and a few deliberate tradeoffs carved the scope down to what you see. the sections below are a slow
ramble on how the thing got constructed; the questions that surfaced while building, the choices that were made, the moments where self-posed questions looped until something clicked.

i like thinking by asking myself questions and answering them and re-questioning the answers — repeating that until there's some satisfaction (which, there never really is).

implementation was the comparatively easy part (much love to my dear claudette!) the actual interesting work was in the shape of the design, thinking and making elaborate plans.

the shape of the thing

the whole system is a pipeline. five layers. audio comes in one end, actions come out the other, and every layer in between does exactly one job and hands its output to the next.

audio bytes
→ [speech-to-text] → text string
→ [intent classifier] → structured JSON plan
→ [orchestrator] → handler calls
→ [tool handlers] → side effects on the machine
→ [UI] → displays every stage back to you

each arrow is a contract. "i promise to give you something of this shape, and you promise to produce something of that shape. neither of us cares how the other works internally." those contracts are the
whole point. they're what lets the pipeline be swappable — pull out any single box, replace its implementation, and nothing else in the system notices.

now layer by layer.

1. audio input

input capture. the browser already knows how to do this — a microphone button captures audio, an upload field takes a file. the web ui hands off bytes. no custom audio code to write. running in a
browser means the browser solves threading, permissions, and buffering — problems that don't belong to the agent.

2. speech-to-text

bytes become words. a whisper model runs locally (faster-whisper, specifically — the CTranslate2-backed variant, which avoids pulling in a heavy deep-learning framework just to run inference). input:
audio path. output: a string. stateless. deterministic-ish. no memory across calls. this layer is essentially a pure function.

3. intent classifier

the only layer where real intelligence lives. a local llm reads the transcribed text and produces a structured JSON plan — a list of actions with parameters. this is the hard layer. natural language is
infinite in its phrasings; the output has to be a small, bounded, structured thing. the llm is doing compression here — from an unbounded input space to a finite one.

the classifier does not execute anything. it does not touch the filesystem. it does not know how any of the tools work internally. it only describes what the user wants in a format the rest of the system can read. the understanding of language is cleanly separated from the taking of action.

4. orchestrator

the interface between intent and action. the layer that takes the plan and makes it happen. zero intelligence here. no llm calls. the orchestrator is a table lookup:

  REGISTRY = {
      "create_file":  handle_filesystem,
      "write_code":   handle_llm_generate,
      "summarize":    handle_llm_generate,
      "open_app":     handle_open_app,
      "screenshot":   handle_screenshot,
      "general_chat": handle_llm_generate,
  }

action type goes in, handler function comes out. call it. collect the result. move to the next action. when actions depend on each other (generate code before saving it to a file), a tiny set of
ordering rules reshuffles the list. that's the entire orchestrator.

this is called a registry pattern. dispatch tables instead of if/elif chains. the benefit is quiet but enormous — adding a new capability is one entry in the table. the orchestrator itself never changes when the system grows.

5. the UI

a thin layer. reads the results of the pipeline and lays them out: transcription, detected intent, actions taken, final output, per-stage timing, session history. the ui has no opinions about what
anything means — it just formats whatever the pipeline produces.

intents, tools, handlers — three words, three meanings

this distinction unlocks a lot of the design:

intent: what the user wants. user-facing. effectively infinite — every new phrasing is a new intent.
tool (or capability): what the system can do. system-facing. finite. small.
handler: the code that implements a tool. one python function.

a single intent might need multiple tools. many intents collapse onto a single tool. the classifier is what maps from the infinite space of possible intents to the finite space of tools. without that
mapping layer, you'd have to write a handler for every way a user could phrase something — which is impossible, because the set of phrasings is unbounded.

compress where the complexity actually is. language is complex; keep it in one layer. execution is simple; keep it in another. don't spread intelligence across the system.

what makes a design good

a few questions worth asking any system you build:

can a new capability be added without touching existing code?
is each capability isolated — can one be removed without breaking others?
can any implementation be swapped without cascading changes through the rest of the system?
is there a single place where a given concern lives — not smeared across multiple layers?
do the boundaries between layers stay clean — no internals of one layer leaking into another?

these aren't independent. they compound. clean interfaces enable isolation. isolation enables swapping. swapping enables growth. a system that answers "yes" to all five can be modified without a
rewrite, and that property — the cost of change — is the actual test of whether the design is doing its job.

the opposite property has a name too: abstraction leakage. when one layer has to know the internals of another to function, the boundary has been violated. every violation makes the next change a little
harder. accumulate enough of them and you can't reason about any piece in isolation — the whole system becomes one tangled thing, and adding a capability means understanding all of it.

thanks for reading ^^

DEV Community