<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AKSHAT SAXENA</title>
    <description>The latest articles on DEV Community by AKSHAT SAXENA (@akshat_saxena_53bee826693).</description>
    <link>https://dev.to/akshat_saxena_53bee826693</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3715160%2F2745e9b8-572c-4f94-8eec-b64ab6af8d56.jpg</url>
      <title>DEV Community: AKSHAT SAXENA</title>
      <link>https://dev.to/akshat_saxena_53bee826693</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akshat_saxena_53bee826693"/>
    <language>en</language>
    <item>
      <title>🎙️ Building a Local Voice-Controlled AI Agent</title>
      <dc:creator>AKSHAT SAXENA</dc:creator>
      <pubDate>Wed, 15 Apr 2026 08:30:55 +0000</pubDate>
      <link>https://dev.to/akshat_saxena_53bee826693/building-a-local-voice-controlled-ai-agent-4967</link>
      <guid>https://dev.to/akshat_saxena_53bee826693/building-a-local-voice-controlled-ai-agent-4967</guid>
      <description>&lt;p&gt;&lt;strong&gt;I Built a Voice-Controlled AI Agent That Runs Locally — Here's Everything I Learned&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From raw audio to code written on your disk — the architecture, the model choices, and the parts that nearly broke me.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;There's a particular kind of frustration that comes from building something that &lt;em&gt;should&lt;/em&gt; work in theory but keeps surprising you in practice. That's the best way I can describe the two days I spent building a voice-controlled AI agent from scratch — one that listens to what you say, figures out what you want, and actually does it. Creates files. Writes code. Summarises documents. Answers questions. All from a single spoken sentence.&lt;/p&gt;

&lt;p&gt;This isn't a tutorial about stringing together three API calls and calling it a day. This is the real story — the architecture decisions, the model tradeoffs, the bugs that made me laugh, and the one problem that took me six hours to solve because I was looking in entirely the wrong place.&lt;/p&gt;

&lt;p&gt;Let's get into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Actually Trying to Build
&lt;/h2&gt;

&lt;p&gt;The goal was straightforward on paper: an agent that accepts voice input (either from a microphone or an uploaded audio file), converts it to text, classifies the user's intent, and then executes the right action on the local filesystem — all displayed through a clean web UI.&lt;/p&gt;

&lt;p&gt;The key word is &lt;em&gt;local&lt;/em&gt;. I wanted this to run on a normal laptop, without sending everything to the cloud, without requiring a GPU, and without needing to pay per token just to rename a file. The final stack supports four different LLM backends (Gemini, OpenAI-compatible endpoints, Groq, and Ollama) and two STT backends (local Whisper and Groq's hosted API), so you can tune the privacy/cost/latency tradeoff to whatever your machine and budget allow.&lt;/p&gt;

&lt;p&gt;The pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input → STT → Intent Classification → Tool Dispatch → UI Output
                                   ↓
                          Session Memory (sidebar)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough. Except nothing about implementing it was.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture, Layer by Layer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Layer 1: Getting Audio In
&lt;/h3&gt;

&lt;p&gt;I started with the easiest-looking problem: accepting audio. Streamlit doesn't ship with a microphone component out of the box, so I used &lt;code&gt;streamlit-mic-recorder&lt;/code&gt;, a small community package that wraps the browser's MediaRecorder API. It returns raw WAV bytes, which is exactly what Whisper wants.&lt;/p&gt;

&lt;p&gt;For uploaded files, Streamlit's native &lt;code&gt;file_uploader&lt;/code&gt; handles WAV, MP3, OGG, and M4A just fine. The only thing I had to be careful about was preserving the file extension as a hint to downstream processors — Whisper handles format detection internally, but Groq's STT API needs the filename to include the right extension so it knows what codec to expect.&lt;/p&gt;

&lt;p&gt;One small thing that tripped me up: &lt;code&gt;streamlit-mic-recorder&lt;/code&gt; returns audio in its own dictionary format (&lt;code&gt;recording["bytes"]&lt;/code&gt;), not as a plain bytes object. Reading the source code for two hours before noticing this in the docs felt like a very specific kind of stupidity that I suspect I'm not alone in experiencing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2: Speech-to-Text
&lt;/h3&gt;

&lt;p&gt;This is where the first serious tradeoff lives.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI Whisper (local)&lt;/strong&gt; is remarkable for what it is — a model that can run entirely on CPU, handles multiple languages without configuration, and produces transcriptions that are genuinely good even with background noise. The &lt;code&gt;base&lt;/code&gt; model (74 MB) is the sweet spot for most hardware. On a modern CPU it transcribes a 10-second clip in about 5–8 seconds. That's acceptable. The &lt;code&gt;tiny&lt;/code&gt; model is faster but starts making mistakes on accented speech and technical vocabulary. The &lt;code&gt;small&lt;/code&gt; model is noticeably better but slower — 20 to 30 seconds on CPU starts to feel like waiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Groq's hosted Whisper&lt;/strong&gt; (large-v3) does the same job in under a second. It's cloud-based, which means audio leaves your machine, but the quality is the best available and the latency is almost magical compared to local inference. For anyone who can't run Whisper locally — either because of slow hardware or RAM constraints — this is the practical fallback.&lt;/p&gt;

&lt;p&gt;I made the backend configurable through a single environment variable (&lt;code&gt;STT_BACKEND=whisper&lt;/code&gt; or &lt;code&gt;STT_BACKEND=groq&lt;/code&gt;) so switching is a one-line change in &lt;code&gt;.env&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The abstraction layer is clean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_ext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;STT_BACKEND&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groq&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_transcribe_groq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_ext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;_transcribe_whisper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_ext&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. The rest of the pipeline doesn't need to care.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3: Intent Classification — The Hard Part
&lt;/h3&gt;

&lt;p&gt;Once you have text, you need to understand what the user actually wants. This is where LLMs earn their place in the pipeline.&lt;/p&gt;

&lt;p&gt;My first instinct was to do this with a simple keyword matcher — if the text contains "create" and "file", route to the file creation tool. This works for the obvious cases and fails spectacularly for everything else. "Can you make a script that creates files in a loop?" triggers the wrong branch. "Write me something that opens a new document" is ambiguous. Natural language is messy.&lt;/p&gt;

&lt;p&gt;So I handed the problem to an LLM with a structured prompt that asks for JSON output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"A function that retries a failed HTTP request up to 3 times"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"text_to_summarize"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prompt is careful about what it asks for. It defines exactly five valid intent strings, gives explicit rules about when to combine them (compound commands like "write a script and save it as utils.py" map to &lt;code&gt;["write_code"]&lt;/code&gt; not &lt;code&gt;["write_code", "create_file"]&lt;/code&gt; — because code writing implies file creation), and tells the model to return nothing but the JSON object.&lt;/p&gt;

&lt;p&gt;Getting reliable JSON out of LLMs took more iteration than I expected. Gemini, when configured with &lt;code&gt;response_mime_type="application/json"&lt;/code&gt;, is excellent — it almost never wraps output in markdown fences or adds preamble. Other models are less disciplined. My JSON extractor strips fences, searches for the first &lt;code&gt;{...}&lt;/code&gt; block, and parses it — a belt-and-suspenders approach that handles most misbehaviour.&lt;/p&gt;

&lt;p&gt;The bigger challenge was &lt;strong&gt;compound commands&lt;/strong&gt;. Say something like "summarise this article and save it to notes.txt" — the agent needs to recognise two things happening: a summarisation and a file write. The LLM handles this well when prompted correctly, returning &lt;code&gt;["summarize"]&lt;/code&gt; with &lt;code&gt;filename: "notes.txt"&lt;/code&gt;. The tool dispatcher then routes to the summarise tool, which detects the filename and saves automatically.&lt;/p&gt;

&lt;h4&gt;
  
  
  Which LLM to Use?
&lt;/h4&gt;

&lt;p&gt;I tested four backends extensively. Here's my honest take:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini 2.5 Flash&lt;/strong&gt; is where I landed as the default recommendation. It's fast (typically under two seconds for intent classification), produces clean structured JSON, has a generous free tier (15 requests per minute, one million tokens per day via Google AI Studio), and handles the kinds of instructions I'm giving it without complaint. The &lt;code&gt;google-generativeai&lt;/code&gt; SDK is well-maintained and the JSON mode is first-class.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Groq with Llama 3 (8B)&lt;/strong&gt; is the speed champion — sub-second responses, genuinely impressive for a hosted service, and the free tier is very usable (6,000 requests per day). The 8B model is slightly less reliable on complex compound commands compared to Gemini, but for straightforward single-intent commands it's excellent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI's GPT-3.5-turbo&lt;/strong&gt; works well but costs money and has no free tier. I kept it in the codebase because many developers already have API credits, and the JSON mode is rock-solid. I also wired in support for any OpenAI-compatible endpoint — which means OpenRouter, which gives you free access to Llama, Mistral, Gemma, and Qwen models with a free account and no credit card.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ollama&lt;/strong&gt; (local) is the most private option and the one I had originally planned to use as the primary backend. It works beautifully once you have a model pulled. The problem is "once you have a model pulled" — Mistral is 4 GB, Llama 3 is larger, and pulling them requires a fast internet connection and available disk space. For anyone who can't meet those requirements, the cloud backends are the practical answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 4: Tool Execution
&lt;/h3&gt;

&lt;p&gt;Once the intent is classified, one of four tools runs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;create_file&lt;/code&gt;&lt;/strong&gt; — creates an empty file or directory at the specified path inside &lt;code&gt;output/&lt;/code&gt;. There's a path traversal check on every operation: the resolved path must start with the resolved &lt;code&gt;OUTPUT_DIR&lt;/code&gt;, or the operation is rejected. This is non-negotiable safety plumbing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;write_code&lt;/code&gt;&lt;/strong&gt; — sends a code-generation prompt to the LLM, receives the result, strips any markdown fences if the model got enthusiastic, and writes the file. The prompt is explicit: "Return ONLY the code — no markdown fences, no explanation." Gemini follows this instruction reliably. Some models need gentle reminding via the fence-stripping fallback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;summarize&lt;/code&gt;&lt;/strong&gt; — passes the text to the LLM with a summarisation prompt. If a filename was detected in the original command, it saves the summary to that file automatically. This is the compound command case mentioned above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;chat&lt;/code&gt;&lt;/strong&gt; — just talks. Passes the transcription directly to the LLM and returns the response. No file operations.&lt;/p&gt;

&lt;p&gt;Every tool returns an &lt;code&gt;ActionResult&lt;/code&gt; dataclass with &lt;code&gt;success&lt;/code&gt;, &lt;code&gt;action_taken&lt;/code&gt;, &lt;code&gt;output&lt;/code&gt;, &lt;code&gt;file_path&lt;/code&gt;, and &lt;code&gt;error&lt;/code&gt;. The UI renders these uniformly regardless of which tool ran.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 5: Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;This was a bonus feature but ended up being one of the things I'm most glad I built.&lt;/p&gt;

&lt;p&gt;Before any file operation executes, if the HITL toggle is on (it's on by default), the UI shows a confirmation card with the detected intent, the planned filename, and the description. The user can approve or cancel.&lt;/p&gt;

&lt;p&gt;This turns out to be genuinely useful — not just as a safety feature, but as a debugging tool. When the LLM misclassifies something, you see it before anything happens. You can cancel and rephrase. It makes the agent feel less like a black box and more like a collaborator.&lt;/p&gt;

&lt;p&gt;The implementation stores the &lt;code&gt;ParsedIntent&lt;/code&gt; object in Streamlit session state between runs and re-uses it when the user confirms. The pipeline generator yields an &lt;code&gt;awaiting_confirmation&lt;/code&gt; stage that pauses execution until the user interacts with the confirmation UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenges That Actually Hurt
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Whisper Memory Spike
&lt;/h3&gt;

&lt;p&gt;The first time I loaded the Whisper &lt;code&gt;small&lt;/code&gt; model during a Streamlit session, it worked. The second time, I got an out-of-memory error. The third time it worked again.&lt;/p&gt;

&lt;p&gt;The issue: I was loading the model inside the transcription function on every call, which meant it was being garbage-collected and reallocated unpredictably. The fix was lazy loading with a module-level singleton:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_whisper_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_load_whisper&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;_whisper_model&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_whisper_model&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;_whisper_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;whisper&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WHISPER_MODEL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_whisper_model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Load once, reuse forever within the process. Memory stable. This is the kind of thing that's obvious in hindsight and invisible until you spend three hours staring at memory profiler output.&lt;/p&gt;

&lt;h3&gt;
  
  
  Streamlit's Execution Model
&lt;/h3&gt;

&lt;p&gt;Streamlit re-runs your entire script on every user interaction. This is elegant for simple apps and a source of creative suffering for anything stateful.&lt;/p&gt;

&lt;p&gt;The pipeline I built is a generator — it &lt;code&gt;yield&lt;/code&gt;s status updates as each stage completes, which lets the UI show live progress. But when the Human-in-the-Loop confirmation splits the pipeline across two Streamlit runs, you can't just hold the generator open. It gets garbage-collected.&lt;/p&gt;

&lt;p&gt;The solution was to decouple the two phases completely. The first run (STT + intent classification) stores its result in &lt;code&gt;st.session_state&lt;/code&gt;. The confirmation UI reads from session state. The second run (tool execution) pulls the stored intent and executes it. No generator spans the boundary — each run is self-contained.&lt;/p&gt;

&lt;p&gt;This is the right pattern for Streamlit, and it took longer to arrive at than I'd like to admit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting LLMs to Always Return Valid JSON
&lt;/h3&gt;

&lt;p&gt;This sounds like a solved problem and mostly is — if you use JSON mode. But not every backend has a native JSON mode, and even models that do occasionally produce something that looks like JSON but isn't: trailing commas, unquoted keys, truncated output because the response hit a token limit.&lt;/p&gt;

&lt;p&gt;My extraction function (&lt;code&gt;_extract_json&lt;/code&gt;) is deliberately robust:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Strip markdown code fences with a regex&lt;/li&gt;
&lt;li&gt;Find the first &lt;code&gt;{...}&lt;/code&gt; block (handles preamble like "Sure! Here's the JSON:")&lt;/li&gt;
&lt;li&gt;Parse it&lt;/li&gt;
&lt;li&gt;If it fails, return an &lt;code&gt;unknown&lt;/code&gt; intent with the raw output as the error&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The most important lesson: &lt;strong&gt;never crash on bad LLM output&lt;/strong&gt;. Log it, degrade gracefully, show the user something useful. The pipeline continues even if intent classification fails — it just routes to the &lt;code&gt;unknown&lt;/code&gt; handler, which tells the user to try rephrasing.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;streamlit-mic-recorder&lt;/code&gt; Silence Problem
&lt;/h3&gt;

&lt;p&gt;If a user hits record and immediately hits stop without saying anything, the recorded audio is a few hundred milliseconds of silence. Whisper transcribes this as &lt;code&gt;" "&lt;/code&gt; (a single space) or &lt;code&gt;""&lt;/code&gt;. The pipeline then tries to classify an empty string.&lt;/p&gt;

&lt;p&gt;I added a guard: if the transcription is empty or whitespace-only, show a friendly message ("I didn't catch that — please try again") and stop the pipeline. This sounds trivial. It took an embarrassingly long time to track down because Whisper was producing a non-empty string (the single space), which passed the initial &lt;code&gt;if not transcription&lt;/code&gt; check.&lt;/p&gt;

&lt;p&gt;The fix is &lt;code&gt;if not transcribed_text.strip()&lt;/code&gt;. Always strip.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Build the test harness first.&lt;/strong&gt; I wrote &lt;code&gt;test_pipeline.py&lt;/code&gt; — a headless CLI that injects text directly and skips the audio step — halfway through the project. Having it from day one would have saved enormous time. Testing audio input requires a browser session and an audio recording. Testing intent classification just requires a string. Decouple them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invest more in prompt engineering early.&lt;/strong&gt; The intent classification prompt went through about eight revisions. Each revision improved reliability measurably. I wish I had spent the first day on nothing but prompt iteration instead of building UI scaffolding that I later changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Add streaming LLM output to the UI.&lt;/strong&gt; For code generation especially, watching the response arrive token by token feels much better than a spinner that says "generating..." for ten seconds. Gemini and OpenAI both support streaming. It's not architecturally difficult — I just ran out of time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Free Tier Situation
&lt;/h2&gt;

&lt;p&gt;One thing I want to be direct about because the confusion here is real:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI has no free tier.&lt;/strong&gt; You need a paid account and purchased credits to use &lt;code&gt;api.openai.com&lt;/code&gt;. Full stop. If you want free access to capable LLMs with an OpenAI-compatible API, use OpenRouter. You sign up, get free initial credits, and can access Llama 3, Mistral, and Gemma models without entering a credit card. The endpoint is drop-in compatible — just change &lt;code&gt;OPENAI_BASE_URL&lt;/code&gt; to &lt;code&gt;https://openrouter.ai/api/v1&lt;/code&gt; and use a free model name like &lt;code&gt;meta-llama/llama-3-8b-instruct:free&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemini is genuinely free for this use case.&lt;/strong&gt; 15 requests per minute and one million tokens per day on &lt;code&gt;gemini-2.5-flash&lt;/code&gt; via Google AI Studio. For a voice agent that processes one command at a time, you will never hit the rate limit under normal use. This is why I made Gemini the default backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Groq is also genuinely free.&lt;/strong&gt; 6,000 requests per day on the LLM endpoint, and similar limits for STT. If you want fast cloud inference with no local model setup and no money, Groq + Whisper-large-v3 for STT and Groq + Llama 3 for intent is a fully free, performant stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Piece I'm Proudest Of
&lt;/h2&gt;

&lt;p&gt;The compound command handling. When you say "write a Python retry function and save it to utils/retry.py", the agent correctly identifies that this is a &lt;code&gt;write_code&lt;/code&gt; intent (not &lt;code&gt;write_code&lt;/code&gt; + &lt;code&gt;create_file&lt;/code&gt; — because code writing implies file creation), extracts &lt;code&gt;utils/retry.py&lt;/code&gt; as the filename, infers Python as the language, generates the function, creates the &lt;code&gt;utils/&lt;/code&gt; subdirectory if it doesn't exist, and writes the file. All of that happens from twelve words of speech.&lt;/p&gt;

&lt;p&gt;The path traversal guard runs on &lt;code&gt;utils/retry.py&lt;/code&gt; as the filename, resolves it to an absolute path, and verifies it's inside &lt;code&gt;output/&lt;/code&gt; before touching the filesystem. The subdirectory creation is automatic. The whole operation is atomic from the user's perspective.&lt;/p&gt;

&lt;p&gt;That's the moment where the project felt less like a demo and more like something real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running It Yourself
&lt;/h2&gt;

&lt;p&gt;The project is structured to be cloneable and runnable in under ten minutes, assuming you have Python 3.10+ installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/akshat-2600/voice-agent.git
&lt;span class="nb"&gt;cd &lt;/span&gt;voice-agent
bash setup.sh          &lt;span class="c"&gt;# creates venv, installs deps, copies .env&lt;/span&gt;
&lt;span class="c"&gt;# Edit .env: add your GOOGLE_API_KEY&lt;/span&gt;
streamlit run app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the LLM, the fastest path to working is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Get a free Gemini API key from &lt;a href="https://aistudio.google.com" rel="noopener noreferrer"&gt;aistudio.google.com&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;LLM_BACKEND=gemini&lt;/code&gt; and &lt;code&gt;GOOGLE_API_KEY=your_key&lt;/code&gt; in &lt;code&gt;.env&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For STT, Whisper &lt;code&gt;base&lt;/code&gt; runs on any modern CPU. If you want faster transcription, get a free Groq key from &lt;a href="https://console.groq.com" rel="noopener noreferrer"&gt;console.groq.com&lt;/a&gt; and set &lt;code&gt;STT_BACKEND=groq&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgu2sw3f3yfbigoab4tus.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgu2sw3f3yfbigoab4tus.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing Thoughts
&lt;/h2&gt;

&lt;p&gt;Building this reinforced something I already suspected: the &lt;em&gt;easy&lt;/em&gt; part of an AI agent is calling the LLM. The hard parts are everything around it — handling unexpected input gracefully, making state management work across a reactive UI framework, deciding where the safety boundaries are and enforcing them consistently, and writing the kind of prompt that reliably extracts structure from natural language.&lt;/p&gt;

&lt;p&gt;The architecture I landed on is deliberately simple. Each layer does one thing: audio comes in, text comes out, intent is classified, tool runs, result is displayed. There's no magic. The LLM is a sophisticated text transformer sitting in the middle of a pipeline that, at its heart, is just a series of function calls.&lt;/p&gt;

&lt;p&gt;If you build something on top of this or run into something I got wrong, I'd genuinely like to hear about it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Built with: Python, Streamlit, OpenAI Whisper, Google Gemini API, Groq, and more coffee than was probably advisable.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>rag</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
