<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ishan Naikele</title>
    <description>The latest articles on DEV Community by Ishan Naikele (@ishan_naikele_74a5355f972).</description>
    <link>https://dev.to/ishan_naikele_74a5355f972</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3876216%2F2528a182-3cf7-4147-bb59-1a05f4b0855d.png</url>
      <title>DEV Community: Ishan Naikele</title>
      <link>https://dev.to/ishan_naikele_74a5355f972</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ishan_naikele_74a5355f972"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled AI Agent with FastAPI, Groq &amp; Streamlit</title>
      <dc:creator>Ishan Naikele</dc:creator>
      <pubDate>Mon, 13 Apr 2026 08:31:51 +0000</pubDate>
      <link>https://dev.to/ishan_naikele_74a5355f972/building-a-voice-controlled-ai-agent-with-fastapi-groq-streamlit-5g2l</link>
      <guid>https://dev.to/ishan_naikele_74a5355f972/building-a-voice-controlled-ai-agent-with-fastapi-groq-streamlit-5g2l</guid>
      <description>&lt;p&gt;Ever wanted to just &lt;em&gt;talk&lt;/em&gt; to your computer and have it actually do something useful create files, write code, summarize text? That's exactly what I built for this project.&lt;/p&gt;

&lt;p&gt;This article covers the architecture, the models I picked, the challenges I hit, and the lessons learned.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗️ Architecture
&lt;/h2&gt;

&lt;p&gt;The system has two parts talking over HTTP:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;FastAPI backend&lt;/strong&gt; — handles all AI inference and file operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit frontend&lt;/strong&gt; — handles audio input and displays results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every request goes through 3 stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input
    ↓
[STT]  Groq Whisper-large-v3  →  transcribed text
    ↓
[Intent]  Groq Llama-3.1-8b  →  JSON task list
    ↓
[Execute]  Local tools  →  create file / write code / summarize / chat
    ↓
Display result in UI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keeping the backend and frontend separate means I can swap out the UI without touching any AI logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  🤖 Models I Used
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speech-to-Text&lt;/td&gt;
&lt;td&gt;whisper-large-v3 (Groq)&lt;/td&gt;
&lt;td&gt;Best open STT model, fast via Groq&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent Classification&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;Small, fast, reliable at JSON output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Generation&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;Fast enough for short scripts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarization&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;Better quality, acceptable latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  ⚠️ Why Groq API instead of local Whisper?
&lt;/h3&gt;

&lt;p&gt;The assignment recommended running Whisper locally via HuggingFace. However, &lt;code&gt;whisper-large-v3&lt;/code&gt; needs at least 6GB of GPU VRAM to run at a usable speed. On CPU it takes 30–60 seconds per clip — way too slow for an interactive UI.&lt;/p&gt;

&lt;p&gt;Groq runs the &lt;strong&gt;exact same model&lt;/strong&gt; on their hardware, returning results in ~700ms. The model is identical, only the compute location differs.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Intent Classification — The Tricky Part
&lt;/h2&gt;

&lt;p&gt;Getting the LLM to output clean, parseable JSON every time was harder than expected. Language models naturally want to add explanations and wrap things in markdown. Both of those break &lt;code&gt;json.loads()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The fix was a very strict system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SYSTEM_PROMPT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
You are a strict JSON routing agent.
Return ONLY valid JSON. No explanation. No markdown. No extra text.

Available intents:
- create_file  → { filename, content }
- write_code   → { filename, language, description }
- summarize    → { text, save_to }
- chat         → { message }

Always return: { &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [ ...task objects... ] }
Each task: { &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confidence&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; }
Multiple commands → multiple tasks in the list.
If unclear → default to &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;chat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supporting a &lt;code&gt;tasks&lt;/code&gt; array from day one is what enables &lt;strong&gt;compound commands&lt;/strong&gt; — the model naturally puts two intents in the list when the user says &lt;em&gt;"write a file and summarize it."&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔒 File Safety
&lt;/h2&gt;

&lt;p&gt;Since the system writes files based on user voice input, path traversal is a real concern. The fix: a sandboxing function that resolves the absolute path and rejects anything outside &lt;code&gt;output/&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_safe_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abspath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Trailing os.sep is critical — without it, "output_evil/" would pass
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OUTPUT_DIR&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sep&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All generated files go into &lt;code&gt;output/&lt;/code&gt;. Nothing else is writable.&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Session Memory
&lt;/h2&gt;

&lt;p&gt;The agent keeps two parallel histories:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Action history&lt;/strong&gt; — timestamped log shown in the UI sidebar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat context&lt;/strong&gt; — last 3 user/assistant pairs sent to the LLM on every classify call&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This means the user can say &lt;em&gt;"now do the same for the other file"&lt;/em&gt; and the model understands the reference. Without it, every request is completely stateless.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Latency Benchmarks
&lt;/h2&gt;

&lt;p&gt;Averaged across 20 runs with a 5–10 second audio clip:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speech-to-Text&lt;/td&gt;
&lt;td&gt;whisper-large-v3&lt;/td&gt;
&lt;td&gt;~720 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent Classification&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;~380 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Generation&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;~950 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Summarization&lt;/td&gt;
&lt;td&gt;llama-3.1-8b-instant&lt;/td&gt;
&lt;td&gt;~1,350 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total end-to-end for a &lt;code&gt;write_code&lt;/code&gt; request: &lt;strong&gt;~2.0–2.5 seconds&lt;/strong&gt;. Fast enough to feel responsive.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Interesting finding:&lt;/strong&gt; Intent classification is the fastest stage despite being the most "reasoning-heavy" step — because the strict JSON-only prompt forces the model to skip all its natural language preamble. Constraining output format is free speed.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🐛 Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. JSON parse failures&lt;/strong&gt; — Even with a strict prompt, the model occasionally wraps output in markdown fences. Added a fallback that strips backticks before parsing, plus a catch-all that defaults to &lt;code&gt;chat&lt;/code&gt; intent on failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Audio format handling&lt;/strong&gt; — Groq's Whisper API requires the correct file extension. Sending a &lt;code&gt;.wav&lt;/code&gt; file named &lt;code&gt;audio&lt;/code&gt; with no extension caused silent failures. Fix: always preserve the original filename and extension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Two-process state&lt;/strong&gt; — Streamlit and FastAPI are separate processes. If the backend restarts, all session history is lost. A future fix would be writing to SQLite on every &lt;code&gt;memory.add()&lt;/code&gt; call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Browser mic compatibility&lt;/strong&gt; — The &lt;code&gt;streamlit-audiorecorder&lt;/code&gt; component works great in Chrome but has issues in Firefox and Safari. Documented this in the README.&lt;/p&gt;




&lt;h2&gt;
  
  
  ✨ Bonus Features Built
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Compound commands&lt;/strong&gt; — one audio clip triggers multiple tasks&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Human-in-the-loop&lt;/strong&gt; — optional confirmation before executing file operations&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Session memory&lt;/strong&gt; — rolling chat context + action history sidebar&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Latency benchmarking&lt;/strong&gt; — toggle in settings to show model speeds&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔗 Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;💻 GitHub: &lt;a href="https://github.com/IshanNaikele/voice-agent" rel="noopener noreferrer"&gt;github.com/IshanNaikele/voice-agent&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built with FastAPI · Streamlit · Groq API · Whisper · Llama-3&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
