<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AryanJaitely</title>
    <description>The latest articles on DEV Community by AryanJaitely (@aryanjaitely).</description>
    <link>https://dev.to/aryanjaitely</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875085%2Fe242b972-0c3b-466a-8a28-5c20c9b3cba7.png</url>
      <title>DEV Community: AryanJaitely</title>
      <link>https://dev.to/aryanjaitely</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aryanjaitely"/>
    <language>en</language>
    <item>
      <title>How I Built a Voice-Controlled Local AI Agent</title>
      <dc:creator>AryanJaitely</dc:creator>
      <pubDate>Sun, 12 Apr 2026 15:38:22 +0000</pubDate>
      <link>https://dev.to/aryanjaitely/how-i-built-a-voice-controlled-local-ai-agent-2i1p</link>
      <guid>https://dev.to/aryanjaitely/how-i-built-a-voice-controlled-local-ai-agent-2i1p</guid>
      <description>&lt;h1&gt;
  
  
  How I Built a Voice-Controlled Local AI Agent
&lt;/h1&gt;

&lt;h1&gt;
  
  
  python #ai #machinelearning #gradio
&lt;/h1&gt;

&lt;p&gt;&lt;em&gt;From microphone to file creation in under 3 seconds — using Whisper, LLaMA3, and Gradio&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;What if you could just speak to your computer and have it create files, write code, or summarize content — all running locally on your machine? That is exactly what I set out to build for this assignment.&lt;/p&gt;

&lt;p&gt;In this article I will walk through how I built a voice-controlled AI agent from scratch, the architecture decisions I made, the models I chose, and the challenges I faced along the way.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Agent Does
&lt;/h2&gt;

&lt;p&gt;The agent takes voice input (or typed text), runs it through a full 4-stage pipeline, and returns a result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stage 1 — Audio Input:&lt;/strong&gt; Accept microphone recording or uploaded .wav/.mp3 file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 2 — Speech-to-Text:&lt;/strong&gt; Transcribe the audio using OpenAI Whisper&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 3 — Intent Classification:&lt;/strong&gt; Send the transcription to an LLM to classify intent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage 4 — Tool Execution:&lt;/strong&gt; Run the right tool and save output to an &lt;code&gt;output/&lt;/code&gt; folder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is displayed in a clean Gradio web UI showing each stage of the pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Deep Dive
&lt;/h2&gt;

&lt;p&gt;The project is split into 4 Python modules, each handling one stage of the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  stt.py — Speech to Text
&lt;/h3&gt;

&lt;p&gt;I used Whisper Large v3 via the Groq API as my primary STT backend. Groq provides incredibly fast inference for free, which made it perfect for this project. The module also supports local Whisper (openai-whisper package) and OpenAI's Whisper-1 API as fallbacks, selectable via an environment variable.&lt;/p&gt;

&lt;h3&gt;
  
  
  intent.py — Intent Classification
&lt;/h3&gt;

&lt;p&gt;This module sends the transcription to an LLM with a carefully crafted system prompt asking it to return structured JSON with four fields: intent, filename, language, and description. I chose a JSON-only output format to make parsing reliable and consistent.&lt;/p&gt;

&lt;p&gt;The module supports Ollama (local), Groq (cloud), and OpenAI. There is also a rule-based fallback using keyword matching in case the LLM is unavailable.&lt;/p&gt;

&lt;p&gt;The four supported intents are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;create_file&lt;/code&gt; — create a new file or folder&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;write_code&lt;/code&gt; — generate code and save it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;summarize&lt;/code&gt; — summarize provided text&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;general_chat&lt;/code&gt; — conversational response&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  tools.py — Tool Execution
&lt;/h3&gt;

&lt;p&gt;Based on the classified intent, the tools module calls the appropriate function. For &lt;code&gt;write_code&lt;/code&gt;, it prompts the LLM a second time with a code-generation system prompt, cleans the output (stripping markdown fences), and saves the result to the &lt;code&gt;output/&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;All file operations are sandboxed using &lt;code&gt;os.path.basename()&lt;/code&gt; to prevent path traversal attacks — no files can ever be written outside the &lt;code&gt;output/&lt;/code&gt; folder.&lt;/p&gt;

&lt;h3&gt;
  
  
  app.py — Gradio UI
&lt;/h3&gt;

&lt;p&gt;I built the frontend with Gradio 4, using a custom dark theme with CSS variables. The UI shows all four pipeline outputs simultaneously: transcribed text, detected intent, action taken, and final result. A session history panel shows the last 5 actions for context.&lt;/p&gt;




&lt;h2&gt;
  
  
  Models I Chose and Why
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Whisper Large v3 via Groq (STT)
&lt;/h3&gt;

&lt;p&gt;I chose Groq's hosted Whisper Large v3 for speech-to-text for three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It is extremely fast — transcription takes under 1 second for short commands&lt;/li&gt;
&lt;li&gt;Groq provides a generous free tier with no credit card required&lt;/li&gt;
&lt;li&gt;Whisper Large v3 has excellent accuracy even with accented speech and background noise&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For users who want full offline operation, the &lt;code&gt;openai-whisper&lt;/code&gt; package is also supported as a drop-in replacement.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLaMA 3 8B via Groq (Intent + Code Generation)
&lt;/h3&gt;

&lt;p&gt;For the LLM I used LLaMA 3 8B through Groq's API. The 8B model is fast and capable enough for intent classification and short code generation tasks.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For intent classification → temperature set to &lt;code&gt;0&lt;/code&gt; for deterministic JSON output&lt;/li&gt;
&lt;li&gt;For code generation → temperature set to &lt;code&gt;0.3&lt;/code&gt; for slightly more creative outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For users with a capable local machine, Ollama with &lt;code&gt;llama3&lt;/code&gt; or &lt;code&gt;mistral&lt;/code&gt; is fully supported as a drop-in replacement that runs entirely offline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Getting Reliable JSON from the LLM
&lt;/h3&gt;

&lt;p&gt;The biggest challenge was getting the LLM to consistently return valid JSON for intent classification. The solution was writing a strict system prompt that explicitly states:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Respond ONLY with valid JSON, no markdown, no explanation"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I also added a fallback JSON parser that strips markdown fences and uses regex to extract JSON objects from messy responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. File Safety
&lt;/h3&gt;

&lt;p&gt;When the LLM suggests a filename, it sometimes includes relative paths like &lt;code&gt;../../etc/passwd&lt;/code&gt;. I solved this with &lt;code&gt;os.path.basename()&lt;/code&gt; which strips any directory components, combined with a character sanitization step that removes unsafe characters.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Audio Format Compatibility
&lt;/h3&gt;

&lt;p&gt;Gradio records audio in WebM format by default on some browsers, but Whisper works best with WAV or MP3. I handled this by detecting the file extension and setting the correct MIME type in the API request.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Hardware Constraints
&lt;/h3&gt;

&lt;p&gt;Running Whisper Large locally requires a GPU and significant RAM. I addressed this by making every component swappable via environment variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low-end machine → use Groq API for both STT and LLM&lt;/li&gt;
&lt;li&gt;Powerful machine → run everything locally through Ollama + openai-whisper&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example Flow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;User says:&lt;/strong&gt; &lt;em&gt;"Create a Python file with a retry function"&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🎙️  Audio recorded
        ↓
📝  Transcribed: "Create a Python file with a retry function"
        ↓
🧠  Intent: write_code | language: python | filename: retry.py
        ↓
⚡  LLM generates Python retry function code
        ↓
💾  Saved to: output/retry.py
        ↓
✅  UI displays transcription, intent, action, and code preview
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;voice-ai-agent/
├── app.py              # Gradio UI
├── stt.py              # Speech-to-Text (Whisper)
├── intent.py           # Intent classification (LLaMA3)
├── tools.py            # Tool execution
├── requirements.txt    # Dependencies
└── output/             # Generated files (sandboxed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Run It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone the repo&lt;/span&gt;
git clone https://github.com/AryanJaitely/voice-ai-agent.git
&lt;span class="nb"&gt;cd &lt;/span&gt;voice-ai-agent

&lt;span class="c"&gt;# 2. Install dependencies&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;gradio requests python-dotenv

&lt;span class="c"&gt;# 3. Create .env file&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"STT_BACKEND=groq"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; .env
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"LLM_BACKEND=groq"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; .env
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"GROQ_API_KEY=your_key_here"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; .env

&lt;span class="c"&gt;# 4. Run&lt;/span&gt;
python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get a free Groq API key at: &lt;a href="https://console.groq.com" rel="noopener noreferrer"&gt;https://console.groq.com&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building this voice agent taught me a lot about chaining AI models together reliably. The key insight is that each component (STT, intent, tools) should be independently swappable — this makes the system both resilient and flexible for different hardware constraints.&lt;/p&gt;

&lt;p&gt;The full source code is available on GitHub:&lt;br&gt;
👉 &lt;strong&gt;&lt;a href="https://github.com/AryanJaitely/voice-ai-agent" rel="noopener noreferrer"&gt;https://github.com/AryanJaitely/voice-ai-agent&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Python, Gradio, Groq, Whisper &amp;amp; LLaMA 3&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>python</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
