<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: VARUN M</title>
    <description>The latest articles on DEV Community by VARUN M (@varun_m_77).</description>
    <link>https://dev.to/varun_m_77</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875264%2Fc3e949d6-c1b1-4a56-8a60-c2cd23ca93c3.jpg</url>
      <title>DEV Community: VARUN M</title>
      <link>https://dev.to/varun_m_77</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/varun_m_77"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent with Groq, Ollama, and Gradio</title>
      <dc:creator>VARUN M</dc:creator>
      <pubDate>Sun, 12 Apr 2026 17:31:17 +0000</pubDate>
      <link>https://dev.to/varun_m_77/building-a-voice-controlled-local-ai-agent-with-groq-ollama-and-gradio-137p</link>
      <guid>https://dev.to/varun_m_77/building-a-voice-controlled-local-ai-agent-with-groq-ollama-and-gradio-137p</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent with Groq, Ollama, and Gradio
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;What if you could just speak to your computer and have it write code, summarize text, or create files — all locally on your machine? That's exactly what I built for my internship assignment at Mem0 AI.&lt;/p&gt;

&lt;p&gt;In this article, I'll walk you through how I designed and built a voice-controlled local AI agent that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accepts audio via microphone or file upload&lt;/li&gt;
&lt;li&gt;Transcribes speech to text using Groq Whisper&lt;/li&gt;
&lt;li&gt;Classifies intent using LLaMA 3.3 70B&lt;/li&gt;
&lt;li&gt;Executes local tools (file creation, code generation, summarization, chat)&lt;/li&gt;
&lt;li&gt;Displays everything in a clean Gradio UI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Audio Input (Mic / File Upload)&lt;br&gt;
↓&lt;br&gt;
STT: Groq Whisper API (whisper-large-v3)&lt;br&gt;
↓&lt;br&gt;
Intent Classification: Groq API (llama-3.3-70b-versatile)&lt;br&gt;
↓&lt;br&gt;
Tool Execution: Groq API (llama-3.3-70b-versatile)&lt;br&gt;
↓&lt;br&gt;
UI Display: Gradio&lt;/p&gt;

&lt;p&gt;The pipeline is simple and modular — each stage is isolated in its own file (&lt;code&gt;stt.py&lt;/code&gt;, &lt;code&gt;intent.py&lt;/code&gt;, &lt;code&gt;tools.py&lt;/code&gt;), making it easy to swap models or extend functionality.&lt;/p&gt;




&lt;h2&gt;
  
  
  Models Chosen and Why
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Speech-to-Text: Groq Whisper (whisper-large-v3)
&lt;/h3&gt;

&lt;p&gt;The assignment recommended using a HuggingFace model like Whisper locally. However, my machine is a MacBook Air with only 8GB RAM — running Whisper locally would be slow and unreliable. Instead, I used the Groq API to run Whisper, which is significantly faster (typically under 2 seconds for a 10-second clip) and completely free on the Groq free tier.&lt;/p&gt;

&lt;h3&gt;
  
  
  Intent Classification: llama-3.3-70b-versatile via Groq
&lt;/h3&gt;

&lt;p&gt;I initially tried using Ollama with &lt;code&gt;llama3.2:1b&lt;/code&gt; locally for intent classification. The problem was that small models struggle to reliably output structured JSON. Switching to &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt; via Groq gave consistent, accurate JSON intent classification every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool Execution: llama-3.3-70b-versatile via Groq
&lt;/h3&gt;

&lt;p&gt;All tool execution — code generation, summarization, and chat — also uses the same Groq model. This keeps latency low and quality high without any local GPU requirement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Supported Intents
&lt;/h2&gt;

&lt;p&gt;The agent supports four core intents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;write_code + create_file&lt;/strong&gt; — generates code and saves it to &lt;code&gt;output/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;create_file&lt;/strong&gt; — creates a file with specified content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;summarize&lt;/strong&gt; — summarizes provided text (optionally saves to file)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;chat&lt;/strong&gt; — general conversation with memory across the session&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bonus Features Implemented
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Compound Commands
&lt;/h3&gt;

&lt;p&gt;The agent handles multi-intent commands in a single audio input. For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Summarize this text and save it to summary.txt"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The intent classifier returns &lt;code&gt;["summarize", "create_file"]&lt;/code&gt; and the tools pipeline handles both in sequence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;Before executing any file operation, the UI shows a confirmation prompt with Confirm and Cancel buttons. This prevents accidental file writes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;Every function is wrapped in try/except blocks. If the STT fails, the error is displayed cleanly. If intent classification returns an unexpected format, it falls back to chat mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  Persistent Memory
&lt;/h3&gt;

&lt;p&gt;Session history is saved to a &lt;code&gt;session_history.json&lt;/code&gt; file. Every action is logged with its transcription, intent, and result. The history persists across app restarts and is displayed as a table in the UI.&lt;/p&gt;

&lt;p&gt;Chat context is also maintained within a session — the agent remembers previous messages for coherent multi-turn conversations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Benchmarking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Avg Latency&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;Groq Whisper Large v3&lt;/td&gt;
&lt;td&gt;~1.5s&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent&lt;/td&gt;
&lt;td&gt;llama-3.3-70b (Groq)&lt;/td&gt;
&lt;td&gt;~1.2s&lt;/td&gt;
&lt;td&gt;Consistent JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intent (attempted)&lt;/td&gt;
&lt;td&gt;llama3.2:1b (Ollama local)&lt;/td&gt;
&lt;td&gt;~3s&lt;/td&gt;
&lt;td&gt;Inconsistent JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code Gen&lt;/td&gt;
&lt;td&gt;llama-3.3-70b (Groq)&lt;/td&gt;
&lt;td&gt;~2-3s&lt;/td&gt;
&lt;td&gt;Clean, executable code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The biggest performance difference was in intent classification. The local &lt;code&gt;llama3.2:1b&lt;/code&gt; model frequently failed to return valid JSON, causing fallback to chat mode. Switching to the 70B model via Groq solved this completely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Model size vs RAM constraints&lt;/strong&gt;&lt;br&gt;
Running 7B+ parameter models locally on 8GB RAM caused slowdowns and timeouts. The solution was offloading STT and LLM inference to Groq's free API while keeping the architecture "local-first" in spirit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Intent classification reliability&lt;/strong&gt;&lt;br&gt;
Small models are not reliable at following structured output instructions. The fix was using a larger, smarter model with a well-crafted system prompt and few-shot examples.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Browser compatibility&lt;/strong&gt;&lt;br&gt;
Gradio's streaming/generator approach caused Safari to drop WebSocket connections on long requests. Switching to Chrome and using a non-streaming approach with &lt;code&gt;app.queue()&lt;/code&gt; solved the freezing issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Markdown fences in generated code&lt;/strong&gt;&lt;br&gt;
The LLM kept wrapping generated code in markdown fences (&lt;br&gt;
&lt;br&gt;
```python). This was fixed by stripping fences before writing to file.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup Instructions
&lt;/h2&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
git clone https://github.com/varun-2437/local-ai-voice-agent
cd local-ai-voice-agent/voice-agent
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Add your Groq API key to &lt;code&gt;.env&lt;/code&gt;:&lt;br&gt;
GROQ_API_KEY=your_key_here&lt;/p&gt;

&lt;p&gt;Run:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
ollama serve  # in one terminal
python app.py  # in another


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project taught me a lot about building practical AI pipelines — choosing the right model for the right job, handling real hardware constraints, and making systems robust with graceful error handling. The combination of Groq's fast free API and Gradio's simple UI framework made it surprisingly easy to go from idea to working product in a short time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/varun-2437/local-ai-voice-agent" rel="noopener noreferrer"&gt;https://github.com/varun-2437/local-ai-voice-agent&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>gradio</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
