A detailed walkthrough of architecture, safety constraints, and lessons learned.
Most assistants stop at answering in a chat box. I wanted something clumsier and more honest: you speak, the machine writes files and runs tools on your laptop—and you can see every step. That sounds simple until you remember that speech is messy, models hallucinate structure, and giving an LLM access to your filesystem without guardrails is a bad idea.
This piece walks through how I built exactly that: a voice-controlled agent in Python with Groq (Whisper for speech, a small Llama-class model for reasoning), LangGraph to keep the pipeline explicit, and Streamlit as the front door. Everything that touches disk stays inside a single output/ folder—by design, not by hope.
The problem with “just use voice”
Typed UIs hide nothing: every character is yours. Voice is different. Audio must become text; text must become intent; intent must become something the computer can execute without wiping the wrong directory. Commercial assistants solve this in the cloud with closed stacks. I wanted a transparent pipeline: transcription on screen, intent on screen, the action taken on screen, and the model’s answer or the path of the file it wrote.
The other constraint was hardware. Running a large speech model and a 70B-parameter LLM locally is not realistic on an everyday laptop. Groq’s APIs became the pragmatic choice: hosted Whisper-class transcription and fast chat inference so the project stays about architecture, not about renting a GPU for a weekend.
What the system actually does
You record audio or upload a clip. The app sends it to Groq Whisper (whisper-large-v3-turbo in this build). The transcript goes to an LLM—not to chat freely at first, but to classify what you meant: create a file, write code, summarize (or generate a short article when there is no long passage), or fall back to general conversation. LangGraph implements that as a small state machine: transcribe, classify, optionally pause for human approval if the next step would write to disk, then execute the right tool. The UI shows the transcript, the label for the intent, what the system did, and the final text or file outcome.
Under the hood the chat model defaults to llama-3.1-8b-instant—fast and broadly available on Groq. You can point GROQ_LLM_MODEL at something heavier if your account supports it.
A mental model of the pipeline
Think of data moving in one direction: sound → text → structured intent → tools → feedback. The diagram below is the same story in boxes.
Nothing here is magic: each rectangle is ordinary Python on the other side of an HTTP call. The point of drawing it is to show where trust enters the system—at the tool boundary, not inside the microphone.
Why LangGraph, not a single giant prompt
It is tempting to stuff “transcribe, decide, act” into one mega-prompt. That becomes impossible to test and painful to debug. LangGraph models the agent as nodes and edges over a typed AgentState: audio payload, transcript, intent, details the model extracted (filename, language, free text), flags for human-in-the-loop, and the strings you show the user at the end.
Classification can return a single intent or a compound list—e.g. “summarize this and save it to summary.txt” becomes two steps in order. The important implementation detail: when the first step produces text, the second step that creates a file must receive that text as content, or you get an empty file and a disappointed user. Wiring that through the tools layer was less glamorous than drawing graphs, but it is what made compound commands feel real.
Tools, intents, and the sandbox
The LLM never gets to “run Bash.” It only suggests structured actions that Python code interprets. Create file and write code touch the filesystem; summarize may compress a long passage or, if you only gave a short topic, generate a small Markdown article instead of apologizing for empty input. General chat covers everything else.
Paths are resolved with pathlib, and every path is checked to stay under output/. Traversal tricks and silly filenames get rejected before anything is written. Secrets live in .env, not in the article and not in git.
If you enable confirmation in the UI, create_file and write_code stop at a gate: you approve or cancel, and only then does the graph run the destructive half without re-transcribing. Session memory keeps a short rolling history in Streamlit state so follow-up utterances are not totally amnesiac—enough for a demo, not a database.
When things go wrong
Networks fail. Models return malformed JSON. Audio is silence. The service layer retries with backoff; the classifier falls back to general_chat when JSON parsing fails; the UI shows a short message instead of a traceback. That is not exciting to list, but it is the difference between a prototype and something you dare to show in a screen recording.
A word on speed
I ran a tiny script against the same stack: short LLM calls averaged on the order of a quarter of a second after warm-up; Whisper on a ~800 KB WAV file sat around one second median over three runs. Those numbers are mine, on my network, on one day—not a universal benchmark. They are enough to say: for interactive use, latency feels closer to “app” than “batch job,” which matters when you are speaking instead of typing.
What I struggled with
The model does not always label the user’s goal the way a human would. “Write an article about AI” might get classified in a way that expects summary of pasted text—so the pipeline had to learn topic-style generation when the input is not a long passage, and chaining between steps so “save to file” actually receives the generated body.
Streamlit taught me a smaller lesson: never return a widget from a ternary expression and let the result leak into the layout—use plain if / else. That kind of bug looks like random garbage on the page and is hard to explain to anyone watching your demo.
Try it yourself
Clone the repo, create a virtual environment, copy .env.example to .env, add GROQ_API_KEY, then streamlit run app.py. The repo is built for clarity over cleverness: services/ hold the Groq clients, agent/ holds the graph and prompts, tools/ holds the side effects.
Closing
Voice-controlled agents are not only about accuracy; they are about visibility. If the user cannot see transcription, intent, and action, you have built a black box with a microphone. This project was an exercise in keeping the box open—and the filesystem narrowed to a single folder—while still relying on capable models I did not have to host myself.
Repository: https://github.com/shanttoosh/voice-controlled-ai-agent
Author: Shanttoosh


Top comments (0)