<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ishaan-Chaturved1</title>
    <description>The latest articles on DEV Community by Ishaan-Chaturved1 (@ishaanchaturved1).</description>
    <link>https://dev.to/ishaanchaturved1</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3878717%2F8cedae12-4561-41c0-b302-7cc7c831d6c7.png</url>
      <title>DEV Community: Ishaan-Chaturved1</title>
      <link>https://dev.to/ishaanchaturved1</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ishaanchaturved1"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled AI Agent using AssemblyAI and Groq</title>
      <dc:creator>Ishaan-Chaturved1</dc:creator>
      <pubDate>Tue, 14 Apr 2026 17:25:39 +0000</pubDate>
      <link>https://dev.to/ishaanchaturved1/building-a-voice-controlled-ai-agent-using-assemblyai-and-groq-gg7</link>
      <guid>https://dev.to/ishaanchaturved1/building-a-voice-controlled-ai-agent-using-assemblyai-and-groq-gg7</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this project, I built a voice-controlled AI agent that converts spoken commands into executable actions such as generating code, creating files, and summarizing text.&lt;/p&gt;

&lt;p&gt;The goal was to combine speech processing, language models, and tool execution into a single pipeline that feels like a real-world AI assistant.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system follows a simple but powerful pipeline:&lt;/p&gt;

&lt;p&gt;Audio Input → Speech-to-Text → Intent Detection → Tool Execution → Output&lt;/p&gt;

&lt;p&gt;Each stage is modular, making the system easy to extend and debug.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Speech-to-Text: AssemblyAI&lt;/li&gt;
&lt;li&gt;Language Model: Groq (llama-3.1-8b-instant)&lt;/li&gt;
&lt;li&gt;Frontend: Streamlit&lt;/li&gt;
&lt;li&gt;Backend: Python&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Speech-to-Text
&lt;/h3&gt;

&lt;p&gt;The user uploads an audio file, which is transcribed into text using AssemblyAI.&lt;br&gt;
This step converts unstructured voice input into usable text.&lt;/p&gt;


&lt;h3&gt;
  
  
  2. Intent Detection
&lt;/h3&gt;

&lt;p&gt;The transcribed text is sent to a language model hosted on Groq.&lt;/p&gt;

&lt;p&gt;The model analyzes the command and returns structured output like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"create_file"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"retry.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows the system to support multiple actions in a single command.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Tool Execution
&lt;/h3&gt;

&lt;p&gt;Based on detected intents, the system executes actions such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generating code&lt;/li&gt;
&lt;li&gt;Creating files&lt;/li&gt;
&lt;li&gt;Summarizing text&lt;/li&gt;
&lt;li&gt;General chat responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All file operations are restricted to a safe &lt;code&gt;output/&lt;/code&gt; directory.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. User Interface
&lt;/h3&gt;

&lt;p&gt;The UI is built using Streamlit and shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transcribed text&lt;/li&gt;
&lt;li&gt;Detected intent&lt;/li&gt;
&lt;li&gt;Execution results&lt;/li&gt;
&lt;li&gt;Session history&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Input:
&lt;/h3&gt;

&lt;p&gt;"Create a Python file with a Fibonacci function"&lt;/p&gt;

&lt;h3&gt;
  
  
  Output:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Code is generated&lt;/li&gt;
&lt;li&gt;File is created in the output folder&lt;/li&gt;
&lt;li&gt;Results are displayed in the UI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Bonus Features
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Compound Commands
&lt;/h3&gt;

&lt;p&gt;The system supports multiple actions in one input:&lt;/p&gt;

&lt;p&gt;"Summarize this text and save it to summary.txt"&lt;/p&gt;




&lt;h3&gt;
  
  
  Human-in-the-Loop
&lt;/h3&gt;

&lt;p&gt;Before file operations, the user is asked to confirm execution.&lt;br&gt;
This adds a layer of safety and control.&lt;/p&gt;




&lt;h3&gt;
  
  
  Graceful Degradation
&lt;/h3&gt;

&lt;p&gt;If intent detection fails, the system falls back to keyword-based classification instead of crashing.&lt;/p&gt;




&lt;h3&gt;
  
  
  Session Memory
&lt;/h3&gt;

&lt;p&gt;The agent maintains a history of interactions within the session.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Local Model Limitations
&lt;/h3&gt;

&lt;p&gt;Initially, I used local models:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Whisper (HuggingFace) for speech-to-text&lt;/li&gt;
&lt;li&gt;Ollama for language models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, this approach led to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;FFmpeg setup issues on Windows&lt;/li&gt;
&lt;li&gt;High memory usage&lt;/li&gt;
&lt;li&gt;Slow performance on CPU&lt;/li&gt;
&lt;li&gt;Frequent crashes&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. Switching to API-based Models
&lt;/h3&gt;

&lt;p&gt;After exploring developer discussions (including Reddit), I switched to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AssemblyAI for STT&lt;/li&gt;
&lt;li&gt;Groq for LLM inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This significantly improved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Speed&lt;/li&gt;
&lt;li&gt;Stability&lt;/li&gt;
&lt;li&gt;Ease of setup&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Model Deprecation Issues
&lt;/h3&gt;

&lt;p&gt;While using Groq, some models were deprecated during development.&lt;br&gt;
This required updating model names and adapting quickly to API changes.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Output Cleaning
&lt;/h3&gt;

&lt;p&gt;Language models sometimes returned explanations along with code.&lt;br&gt;
This was fixed by enforcing strict prompts and cleaning responses before saving.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Benchmarking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Stability&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;STT&lt;/td&gt;
&lt;td&gt;AssemblyAI&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM (Local)&lt;/td&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;td&gt;Slow&lt;/td&gt;
&lt;td&gt;Unstable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM (API)&lt;/td&gt;
&lt;td&gt;Groq&lt;/td&gt;
&lt;td&gt;Very Fast&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;API-based models clearly outperformed local setups in this project.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Building reliable systems is more important than using local models&lt;/li&gt;
&lt;li&gt;APIs can significantly improve performance and developer experience&lt;/li&gt;
&lt;li&gt;Fallback mechanisms are essential in AI systems&lt;/li&gt;
&lt;li&gt;Debugging agent pipelines requires step-by-step visibility&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Future Improvements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Real-time microphone input&lt;/li&gt;
&lt;li&gt;Persistent memory across sessions&lt;/li&gt;
&lt;li&gt;Streaming responses&lt;/li&gt;
&lt;li&gt;More advanced tool integrations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how speech, language models, and execution logic can be combined to build a practical AI agent.&lt;/p&gt;

&lt;p&gt;It also highlights the tradeoffs between local and API-based approaches, and the importance of choosing the right tools based on system constraints.&lt;/p&gt;




</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
    </item>
    <item>
      <title>Building a Voice-Controlled AI Agent using AssemblyAI and Groq</title>
      <dc:creator>Ishaan-Chaturved1</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:29:12 +0000</pubDate>
      <link>https://dev.to/ishaanchaturved1/building-a-voice-controlled-ai-agent-using-assemblyai-and-groq-3ao6</link>
      <guid>https://dev.to/ishaanchaturved1/building-a-voice-controlled-ai-agent-using-assemblyai-and-groq-3ao6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this project, I built a voice-controlled AI agent that converts spoken commands into executable actions like generating code and creating files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;The system follows a modular pipeline:&lt;/p&gt;

&lt;p&gt;Audio → STT → Intent Detection → Tool Execution → Output&lt;/p&gt;




&lt;h2&gt;
  
  
  Technologies Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AssemblyAI for speech-to-text&lt;/li&gt;
&lt;li&gt;Groq LLM (llama-3.1-8b-instant) for intent classification&lt;/li&gt;
&lt;li&gt;Streamlit for UI&lt;/li&gt;
&lt;li&gt;Python for backend agent logic&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How it Works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;User uploads audio&lt;/li&gt;
&lt;li&gt;Audio is transcribed into text&lt;/li&gt;
&lt;li&gt;LLM detects intent (multi-intent supported)&lt;/li&gt;
&lt;li&gt;Agent executes actions&lt;/li&gt;
&lt;li&gt;Output is displayed and files are created&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Challenges Faced
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ollama instability on local setup&lt;/li&gt;
&lt;li&gt;Model deprecations in Groq&lt;/li&gt;
&lt;li&gt;Handling multi-intent parsing&lt;/li&gt;
&lt;li&gt;Debugging silent failures in Streamlit&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Learnings
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Importance of fallback mechanisms&lt;/li&gt;
&lt;li&gt;API-based models are more stable than local inference&lt;/li&gt;
&lt;li&gt;Proper debugging is critical in agent systems&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Future Work
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Add real-time voice input&lt;/li&gt;
&lt;li&gt;Integrate memory and context&lt;/li&gt;
&lt;li&gt;Add RAG for knowledge-based queries&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how AI agents can combine speech, reasoning, and actions into a seamless user experience.&lt;/p&gt;




</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
    </item>
  </channel>
</rss>
