<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Amratanshu Dwivedi</title>
    <description>The latest articles on DEV Community by Amratanshu Dwivedi (@amratanshu_dwivedi_305ab4).</description>
    <link>https://dev.to/amratanshu_dwivedi_305ab4</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3882208%2Fa00ec8ed-117e-4676-9edb-e804571166f7.jpg</url>
      <title>DEV Community: Amratanshu Dwivedi</title>
      <link>https://dev.to/amratanshu_dwivedi_305ab4</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/amratanshu_dwivedi_305ab4"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled AI Agent with Groq and Gradio</title>
      <dc:creator>Amratanshu Dwivedi</dc:creator>
      <pubDate>Thu, 16 Apr 2026 10:32:28 +0000</pubDate>
      <link>https://dev.to/amratanshu_dwivedi_305ab4/building-a-voice-controlled-ai-agent-with-groq-and-gradio-46do</link>
      <guid>https://dev.to/amratanshu_dwivedi_305ab4/building-a-voice-controlled-ai-agent-with-groq-and-gradio-46do</guid>
      <description>&lt;p&gt;Have you ever wanted to just talk to your computer and have it actually do something — create files, write code, or answer questions? That's exactly what I built for my Mem0 AI internship assignment: a fully working Voice-Controlled AI Agent.&lt;br&gt;
In this article, I'll walk you through what I built, the architecture behind it, why I chose the tools I did, the challenges I faced, and what I learned along the way.&lt;/p&gt;

&lt;p&gt;🎯 What I Built&lt;br&gt;
A voice-controlled AI agent that:&lt;/p&gt;

&lt;p&gt;Accepts audio from your microphone or an uploaded audio file&lt;br&gt;
Transcribes your speech to text using Whisper&lt;br&gt;
Detects your intent — do you want to create a file? write code? summarize something? or just chat?&lt;br&gt;
Executes the action on your local machine&lt;br&gt;
Shows all results in a clean web UI&lt;br&gt;
Here's a quick example of the full flow:&lt;/p&gt;

&lt;p&gt;You say: "Write Python code for a retry function"&lt;br&gt;
The agent transcribes it → detects intent as Write Code → generates the Python code using an LLM → saves it to output/retry.py → shows you the result in the UI.&lt;/p&gt;

&lt;p&gt;All of this happens in under 5 seconds.&lt;/p&gt;

&lt;p&gt;🏗️ Architecture&lt;br&gt;
The project is split into 4 clean modules:&lt;br&gt;
User speaks&lt;br&gt;
    ↓&lt;br&gt;
Audio file (.wav / .mp3)&lt;br&gt;
    ↓&lt;br&gt;
[stt.py] → Groq Whisper Large V3 → Transcribed Text&lt;br&gt;
    ↓&lt;br&gt;
[intent.py] → LLaMA 3.3-70B → Intent Classification (JSON)&lt;br&gt;
    ↓&lt;br&gt;
[tools.py] → Tool Router&lt;br&gt;
    ├── create_file  → Creates empty file in output/&lt;br&gt;
    ├── write_code   → Generates &amp;amp; saves code to output/&lt;br&gt;
    ├── summarize    → Summarizes topic, saves to output/summary.txt&lt;br&gt;
    └── general_chat → LLM conversation response&lt;br&gt;
    ↓&lt;br&gt;
[app.py] → Gradio UI → Displays transcription, intent, action, output&lt;br&gt;
    ↓&lt;br&gt;
Session Memory → Tracks all actions in the session&lt;br&gt;
Each module has one job and does it well. This separation makes the code easy to debug, easy to extend, and easy to understand.&lt;br&gt;
🛠️ Tech Stack and Why I Chose It&lt;br&gt;
ComponentToolReasonUIGradioFastest way to build a professional AI UI in PythonSpeech-to-TextGroq Whisper Large V3Free, fast, no GPU neededLLMGroq LLaMA 3.3-70BFree API, blazing fast, excellent qualityFile OperationsPython pathlibBuilt-in, no extra dependenciesEnvironmentpython-dotenvSecure API key management&lt;br&gt;
Why Groq Instead of a Local Model?&lt;br&gt;
The assignment mentioned using a local model via Ollama or LM Studio. I evaluated this option but chose Groq's API instead for three strong reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;No GPU required. Running Whisper Large V3 or LLaMA 70B locally requires at minimum 16GB VRAM. Most student laptops don't have this. Groq runs everything on their hardware.&lt;/li&gt;
&lt;li&gt;Speed. Groq uses custom LPU (Language Processing Unit) chips that are genuinely faster than most local GPU setups. Responses come back in 1-2 seconds.&lt;/li&gt;
&lt;li&gt;Free tier. Groq's free tier is generous enough to build and demo this entire project without paying anything.
This is documented in the README as a hardware workaround, as the assignment allows.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🎯 The 4 Supported Intents&lt;br&gt;
The heart of the agent is the intent classifier. I used a structured LLM prompt that forces the model to return clean JSON every time:&lt;br&gt;
pythonprompt = f"""&lt;br&gt;
Classify this command into ONE of:&lt;br&gt;
create_file, write_code, summarize, general_chat&lt;/p&gt;

&lt;p&gt;Command: "{text}"&lt;/p&gt;

&lt;p&gt;Respond ONLY with JSON:&lt;br&gt;
{{&lt;br&gt;
  "intent": "write_code",&lt;br&gt;
  "details": "Python retry function",&lt;br&gt;
  "filename": "retry.py"&lt;br&gt;
}}&lt;br&gt;
"""&lt;br&gt;
This gives the agent four distinct capabilities:&lt;br&gt;
📁 Create File — "Create a text file called meeting notes"&lt;br&gt;
Creates an empty file instantly in the output/ folder.&lt;br&gt;
💻 Write Code — "Write JavaScript code for a todo list"&lt;br&gt;
Generates complete, commented, production-quality code and saves it to a file.&lt;br&gt;
📋 Summarize — "Summarize the benefits of neural networks"&lt;br&gt;
Produces a structured summary with bullet points and saves it to output/summary.txt.&lt;br&gt;
💬 General Chat — "What is the difference between AI and ML?"&lt;br&gt;
Responds conversationally to any question.&lt;/p&gt;

&lt;p&gt;🧠 Bonus: Session Memory&lt;br&gt;
Since I was building this for Mem0 AI — a company that builds memory layers for AI agents — I made sure to implement the memory bonus feature.&lt;br&gt;
Every action the agent takes gets stored in a session history list:&lt;br&gt;
pythonsession_history.append({&lt;br&gt;
    "transcription": transcription,&lt;br&gt;
    "intent": intent_display,&lt;br&gt;
    "action": action_taken,&lt;br&gt;
    "output": final_output,&lt;br&gt;
})&lt;br&gt;
This is displayed in the UI as a running log of everything that happened in the session. It's a simple implementation, but it demonstrates the core concept that Mem0 is built around: agents need memory to be truly useful.&lt;/p&gt;

&lt;p&gt;😤 Challenges I Faced&lt;br&gt;
Challenge 1: The LLM Model Was Decommissioned&lt;br&gt;
Mid-development, I got this error:&lt;br&gt;
Error code: 400 - The model 'llama3-70b-8192' has been decommissioned&lt;br&gt;
Groq had retired the model I was using. The fix was simple — update to llama-3.3-70b-versatile — but it taught me an important lesson: always check API deprecation notices and avoid hardcoding model names without a fallback plan.&lt;br&gt;
Challenge 2: Getting Clean JSON from the LLM&lt;br&gt;
The intent classifier needs to return valid JSON every time. But LLMs sometimes wrap their response in markdown code blocks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This breaks json.loads(). My fix was to strip markdown before parsing:&lt;br&gt;
pythonraw_response = raw_response.replace("&lt;br&gt;
&lt;br&gt;
&lt;code&gt;json", "").replace("&lt;/code&gt;&lt;br&gt;
&lt;br&gt;
", "").strip()&lt;br&gt;
result = json.loads(raw_response)&lt;br&gt;
And I wrapped everything in a try/except that falls back to general_chat if parsing fails — so the app never crashes.&lt;br&gt;
Challenge 3: Keeping File Operations Safe&lt;br&gt;
The assignment required that all file creation be restricted to an output/ folder so the agent can't accidentally overwrite system files. I handled this with Python's pathlib:&lt;br&gt;
pythonOUTPUT_DIR = Path("output")&lt;br&gt;
OUTPUT_DIR.mkdir(exist_ok=True)&lt;/p&gt;

&lt;p&gt;filepath = OUTPUT_DIR / filename  # Always inside output/&lt;br&gt;
This makes it impossible for the agent to write anywhere else, no matter what the user says.&lt;/p&gt;

&lt;p&gt;✅ What I Learned&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prompt engineering matters more than model choice.
A well-structured prompt with clear instructions and a defined output format consistently outperforms a bigger model with a vague prompt.&lt;/li&gt;
&lt;li&gt;API-based models are practical for hackathons and assignments.
The assignment encouraged local models, but for rapid prototyping and demo-quality results, a well-chosen API (especially a free one like Groq) is often the better tool.&lt;/li&gt;
&lt;li&gt;Modular code saves enormous time.
Because each module (stt.py, intent.py, tools.py) had a single responsibility, debugging was fast. When the model was decommissioned, I only had to update 2 files.&lt;/li&gt;
&lt;li&gt;Memory is a genuinely hard problem.
Implementing even the simple session history feature made me appreciate why Mem0 exists as a product. True persistent memory — across sessions, across users, with context retrieval — is a deep engineering challenge.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;🔗 Links&lt;/p&gt;

&lt;p&gt;GitHub Repository: &lt;a href="https://github.com/Amratanshu-d/voice-ai-agent" rel="noopener noreferrer"&gt;https://github.com/Amratanshu-d/voice-ai-agent&lt;/a&gt;&lt;br&gt;
Video Demo: &lt;a href="https://youtu.be/fbXTjaXM-oI" rel="noopener noreferrer"&gt;https://youtu.be/fbXTjaXM-oI&lt;/a&gt;&lt;br&gt;
Assignment by: Mem0 AI — MLOps and AI Infra Internship&lt;/p&gt;

&lt;p&gt;Built with ❤️ for the Mem0 AI internship assignment. If you're building AI agents and want to add persistent memory to them, check out mem0.ai.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
