<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shanttoosh</title>
    <description>The latest articles on DEV Community by Shanttoosh (@shanttoosh_v).</description>
    <link>https://dev.to/shanttoosh_v</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879393%2F7f44bf58-1d67-4963-8aeb-5ad14ccbaffd.jpg</url>
      <title>DEV Community: Shanttoosh</title>
      <link>https://dev.to/shanttoosh_v</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shanttoosh_v"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled AI Agent with Groq, LangGraph, and Streamlit</title>
      <dc:creator>Shanttoosh</dc:creator>
      <pubDate>Tue, 14 Apr 2026 23:16:53 +0000</pubDate>
      <link>https://dev.to/shanttoosh_v/building-a-voice-controlled-ai-agent-with-groq-langgraph-and-streamlit-289g</link>
      <guid>https://dev.to/shanttoosh_v/building-a-voice-controlled-ai-agent-with-groq-langgraph-and-streamlit-289g</guid>
      <description>&lt;p&gt;&lt;em&gt;A detailed walkthrough of architecture, safety constraints, and lessons learned.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most assistants stop at answering in a chat box. I wanted something clumsier and more honest: you &lt;strong&gt;speak&lt;/strong&gt;, the machine &lt;strong&gt;writes files&lt;/strong&gt; and &lt;strong&gt;runs tools&lt;/strong&gt; on your laptop—and you can &lt;strong&gt;see&lt;/strong&gt; every step. That sounds simple until you remember that speech is messy, models hallucinate structure, and giving an LLM access to your filesystem without guardrails is a bad idea.&lt;/p&gt;

&lt;p&gt;This piece walks through how I built exactly that: a &lt;strong&gt;voice-controlled agent&lt;/strong&gt; in Python with &lt;strong&gt;Groq&lt;/strong&gt; (Whisper for speech, a small &lt;strong&gt;Llama&lt;/strong&gt;-class model for reasoning), &lt;strong&gt;LangGraph&lt;/strong&gt; to keep the pipeline explicit, and &lt;strong&gt;Streamlit&lt;/strong&gt; as the front door. Everything that touches disk stays inside a single &lt;strong&gt;&lt;code&gt;output/&lt;/code&gt;&lt;/strong&gt; folder—by design, not by hope.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem with “just use voice”
&lt;/h2&gt;

&lt;p&gt;Typed UIs hide nothing: every character is yours. Voice is different. Audio must become text; text must become &lt;strong&gt;intent&lt;/strong&gt;; intent must become &lt;strong&gt;something the computer can execute&lt;/strong&gt; without wiping the wrong directory. Commercial assistants solve this in the cloud with closed stacks. I wanted a &lt;strong&gt;transparent&lt;/strong&gt; pipeline: transcription on screen, intent on screen, the action taken on screen, and the model’s answer or the path of the file it wrote.&lt;/p&gt;

&lt;p&gt;The other constraint was hardware. Running a large speech model and a 70B-parameter LLM locally is not realistic on an everyday laptop. &lt;strong&gt;Groq’s APIs&lt;/strong&gt; became the pragmatic choice: hosted &lt;strong&gt;Whisper-class&lt;/strong&gt; transcription and fast &lt;strong&gt;chat&lt;/strong&gt; inference so the project stays about &lt;strong&gt;architecture&lt;/strong&gt;, not about renting a GPU for a weekend.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the system actually does
&lt;/h2&gt;

&lt;p&gt;You record audio or upload a clip. The app sends it to &lt;strong&gt;Groq Whisper&lt;/strong&gt; (&lt;code&gt;whisper-large-v3-turbo&lt;/code&gt; in this build). The transcript goes to an LLM—not to chat freely at first, but to &lt;strong&gt;classify&lt;/strong&gt; what you meant: create a file, write code, summarize (or generate a short article when there is no long passage), or fall back to general conversation. &lt;strong&gt;LangGraph&lt;/strong&gt; implements that as a small state machine: transcribe, classify, optionally pause for &lt;strong&gt;human approval&lt;/strong&gt; if the next step would write to disk, then execute the right &lt;strong&gt;tool&lt;/strong&gt;. The UI shows the transcript, the label for the intent, what the system did, and the final text or file outcome.&lt;/p&gt;

&lt;p&gt;Under the hood the chat model defaults to &lt;strong&gt;&lt;code&gt;llama-3.1-8b-instant&lt;/code&gt;&lt;/strong&gt;—fast and broadly available on Groq. You can point &lt;code&gt;GROQ_LLM_MODEL&lt;/code&gt; at something heavier if your account supports it.&lt;/p&gt;




&lt;h2&gt;
  
  
  A mental model of the pipeline
&lt;/h2&gt;

&lt;p&gt;Think of data moving in one direction: &lt;strong&gt;sound → text → structured intent → tools → feedback&lt;/strong&gt;. The diagram below is the same story in boxes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fshanttoosh%2Fvoice-controlled-ai-agent%2Fmain%2FHighlevel_Architecture.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fshanttoosh%2Fvoice-controlled-ai-agent%2Fmain%2FHighlevel_Architecture.png" alt="High-level architecture" width="800" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nothing here is magic: each rectangle is ordinary Python on the other side of an HTTP call. The point of drawing it is to show where &lt;strong&gt;trust&lt;/strong&gt; enters the system—at the tool boundary, not inside the microphone.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LangGraph, not a single giant prompt
&lt;/h2&gt;

&lt;p&gt;It is tempting to stuff “transcribe, decide, act” into one mega-prompt. That becomes impossible to test and painful to debug. &lt;strong&gt;LangGraph&lt;/strong&gt; models the agent as &lt;strong&gt;nodes&lt;/strong&gt; and &lt;strong&gt;edges&lt;/strong&gt; over a typed &lt;strong&gt;&lt;code&gt;AgentState&lt;/code&gt;&lt;/strong&gt;: audio payload, transcript, intent, details the model extracted (filename, language, free text), flags for &lt;strong&gt;human-in-the-loop&lt;/strong&gt;, and the strings you show the user at the end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1cz5w8g0j1e9ry3xe5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1cz5w8g0j1e9ry3xe5a.png" alt="Workflow — LangGraph steps" width="800" height="1305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Classification can return a &lt;strong&gt;single&lt;/strong&gt; intent or a &lt;strong&gt;compound&lt;/strong&gt; list—e.g. “summarize this and save it to &lt;code&gt;summary.txt&lt;/code&gt;” becomes two steps in order. The important implementation detail: when the first step produces text, the second step that &lt;strong&gt;creates a file&lt;/strong&gt; must receive that text as &lt;strong&gt;content&lt;/strong&gt;, or you get an empty file and a disappointed user. Wiring that through the tools layer was less glamorous than drawing graphs, but it is what made compound commands feel real.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tools, intents, and the sandbox
&lt;/h2&gt;

&lt;p&gt;The LLM never gets to “run Bash.” It only suggests &lt;strong&gt;structured&lt;/strong&gt; actions that Python code interprets. &lt;strong&gt;Create file&lt;/strong&gt; and &lt;strong&gt;write code&lt;/strong&gt; touch the filesystem; &lt;strong&gt;summarize&lt;/strong&gt; may compress a long passage or, if you only gave a short topic, &lt;strong&gt;generate&lt;/strong&gt; a small Markdown article instead of apologizing for empty input. &lt;strong&gt;General chat&lt;/strong&gt; covers everything else.&lt;/p&gt;

&lt;p&gt;Paths are resolved with &lt;strong&gt;&lt;code&gt;pathlib&lt;/code&gt;&lt;/strong&gt;, and every path is checked to stay under &lt;strong&gt;&lt;code&gt;output/&lt;/code&gt;&lt;/strong&gt;. Traversal tricks and silly filenames get rejected before anything is written. Secrets live in &lt;strong&gt;&lt;code&gt;.env&lt;/code&gt;&lt;/strong&gt;, not in the article and not in git.&lt;/p&gt;

&lt;p&gt;If you enable confirmation in the UI, &lt;strong&gt;create_file&lt;/strong&gt; and &lt;strong&gt;write_code&lt;/strong&gt; stop at a gate: you approve or cancel, and only then does the graph run the destructive half without re-transcribing. Session &lt;strong&gt;memory&lt;/strong&gt; keeps a short rolling history in Streamlit state so follow-up utterances are not totally amnesiac—enough for a demo, not a database.&lt;/p&gt;




&lt;h2&gt;
  
  
  When things go wrong
&lt;/h2&gt;

&lt;p&gt;Networks fail. Models return malformed JSON. Audio is silence. The service layer retries with backoff; the classifier falls back to &lt;strong&gt;general_chat&lt;/strong&gt; when JSON parsing fails; the UI shows a short message instead of a traceback. That is not exciting to list, but it is the difference between a prototype and something you dare to show in a screen recording.&lt;/p&gt;




&lt;h2&gt;
  
  
  A word on speed
&lt;/h2&gt;

&lt;p&gt;I ran a tiny script against the same stack: short LLM calls averaged on the order of &lt;strong&gt;a quarter of a second&lt;/strong&gt; after warm-up; &lt;strong&gt;Whisper&lt;/strong&gt; on a &lt;strong&gt;~800 KB&lt;/strong&gt; WAV file sat around &lt;strong&gt;one second&lt;/strong&gt; median over three runs. Those numbers are mine, on my network, on one day—not a universal benchmark. They are enough to say: for interactive use, &lt;strong&gt;latency feels closer to “app” than “batch job,”&lt;/strong&gt; which matters when you are speaking instead of typing.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I struggled with
&lt;/h2&gt;

&lt;p&gt;The model does not always label the user’s goal the way a human would. “Write an article about AI” might get classified in a way that expects &lt;strong&gt;summary&lt;/strong&gt; of pasted text—so the pipeline had to learn &lt;strong&gt;topic-style generation&lt;/strong&gt; when the input is not a long passage, and &lt;strong&gt;chaining&lt;/strong&gt; between steps so “save to file” actually receives the generated body.&lt;/p&gt;

&lt;p&gt;Streamlit taught me a smaller lesson: &lt;strong&gt;never&lt;/strong&gt; return a widget from a ternary expression and let the result leak into the layout—use plain &lt;code&gt;if&lt;/code&gt; / &lt;code&gt;else&lt;/code&gt;. That kind of bug looks like random garbage on the page and is hard to explain to anyone watching your demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;Clone the repo, create a virtual environment, copy &lt;code&gt;.env.example&lt;/code&gt; to &lt;code&gt;.env&lt;/code&gt;, add &lt;strong&gt;&lt;code&gt;GROQ_API_KEY&lt;/code&gt;&lt;/strong&gt;, then &lt;code&gt;streamlit run app.py&lt;/code&gt;. The repo is built for clarity over cleverness: &lt;code&gt;services/&lt;/code&gt; hold the Groq clients, &lt;code&gt;agent/&lt;/code&gt; holds the graph and prompts, &lt;code&gt;tools/&lt;/code&gt; holds the side effects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Voice-controlled agents are not only about &lt;strong&gt;accuracy&lt;/strong&gt;; they are about &lt;strong&gt;visibility&lt;/strong&gt;. If the user cannot see transcription, intent, and action, you have built a black box with a microphone. This project was an exercise in keeping the box &lt;strong&gt;open&lt;/strong&gt;—and the filesystem &lt;strong&gt;narrowed&lt;/strong&gt; to a single folder—while still relying on capable models I did not have to host myself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;em&gt;&lt;a href="https://github.com/shanttoosh/voice-controlled-ai-agent" rel="noopener noreferrer"&gt;https://github.com/shanttoosh/voice-controlled-ai-agent&lt;/a&gt;&lt;/em&gt;  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Author:&lt;/strong&gt; &lt;em&gt;Shanttoosh&lt;/em&gt;&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>whisper</category>
      <category>agents</category>
      <category>langchain</category>
    </item>
  </channel>
</rss>
