<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: TARANDEEP SINGH KHURANA</title>
    <description>The latest articles on DEV Community by TARANDEEP SINGH KHURANA (@tarandeep_singhkhurana_8).</description>
    <link>https://dev.to/tarandeep_singhkhurana_8</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3880937%2F0267e509-1627-4631-81dd-0350c2b2e0e8.png</url>
      <title>DEV Community: TARANDEEP SINGH KHURANA</title>
      <link>https://dev.to/tarandeep_singhkhurana_8</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tarandeep_singhkhurana_8"/>
    <language>en</language>
    <item>
      <title>🎙️ Building a Voice-Controlled AI Agent with Tool Execution</title>
      <dc:creator>TARANDEEP SINGH KHURANA</dc:creator>
      <pubDate>Wed, 15 Apr 2026 17:08:54 +0000</pubDate>
      <link>https://dev.to/tarandeep_singhkhurana_8/building-a-voice-controlled-ai-agent-with-tool-execution-5802</link>
      <guid>https://dev.to/tarandeep_singhkhurana_8/building-a-voice-controlled-ai-agent-with-tool-execution-5802</guid>
      <description>&lt;p&gt;In this article, I’ll walk through how I built a &lt;strong&gt;voice-controlled AI agent&lt;/strong&gt; that can understand user commands, decide what action to take, execute tools like file creation or code generation, and respond naturally — all through a simple web interface.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Overview
&lt;/h2&gt;

&lt;p&gt;The goal of this project was to move beyond a basic chatbot and build something closer to an &lt;strong&gt;agentic system&lt;/strong&gt; — where the model doesn’t just respond, but &lt;strong&gt;decides what to do&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The agent supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🎤 Voice input (record or upload audio)&lt;/li&gt;
&lt;li&gt;🧠 Speech-to-text using OpenAI Whisper&lt;/li&gt;
&lt;li&gt;🤖 LLM-based decision making (no hardcoded intent rules)&lt;/li&gt;
&lt;li&gt;🛠️ Tool execution (file creation, code generation)&lt;/li&gt;
&lt;li&gt;💬 Natural language responses&lt;/li&gt;
&lt;li&gt;🖥️ Interactive UI using Streamlit&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🏗️ Architecture
&lt;/h2&gt;

&lt;p&gt;At a high level, the system looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User (Voice Input)
        ↓
Speech-to-Text (whisper-1)
        ↓
Agent (LLM decides action)
        ├── If "final" → respond normally
        └── If tool → execute tool
                    ↓
              Tool Result
                    ↓
              Sent back to LLM
                    ↓
          Final Natural Response
                    ↓
            Streamlit UI
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Agent Design (Core Idea)
&lt;/h2&gt;

&lt;p&gt;Instead of using traditional intent classification, I implemented an &lt;strong&gt;agent loop&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send user input to the LLM&lt;/li&gt;
&lt;li&gt;LLM returns structured JSON like:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;If it's a tool → execute it&lt;/li&gt;
&lt;li&gt;Feed tool result back to the LLM&lt;/li&gt;
&lt;li&gt;LLM generates a final natural response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach removes rigid pipelines and makes the system &lt;strong&gt;fully flexible and extensible&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🛠️ Tools Implemented
&lt;/h2&gt;

&lt;p&gt;Currently, the agent supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📁 &lt;code&gt;create_file&lt;/code&gt; → Create files in a restricted folder&lt;/li&gt;
&lt;li&gt;💻 &lt;code&gt;write_code&lt;/code&gt; → Generate and write code files&lt;/li&gt;
&lt;li&gt;💬 General chat fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All file operations are sandboxed inside an &lt;code&gt;output/&lt;/code&gt; directory for safety.&lt;/p&gt;




&lt;h2&gt;
  
  
  🤖 Models Used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-Text:&lt;/strong&gt; &lt;code&gt;whisper-1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; &lt;code&gt;gpt-4o-mini&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  💡 Why gpt-4o-mini?
&lt;/h3&gt;

&lt;p&gt;Due to limited token budget (I paid for OpenAI credits myself), I had to choose a &lt;strong&gt;cost-efficient model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, I do have hands-on experience working with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4 / GPT-5 series (up to GPT-5.4)&lt;/li&gt;
&lt;li&gt;Claude Sonnet / Opus models&lt;/li&gt;
&lt;li&gt;Other flagship LLMs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the choice here was &lt;strong&gt;purely practical&lt;/strong&gt;, not due to lack of exposure.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚠️ Challenges Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Streamlit (Biggest Pain Point)
&lt;/h3&gt;

&lt;p&gt;Honestly, the hardest part of this project was &lt;strong&gt;not the AI — it was Streamlit&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Problems I faced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session state management is unintuitive&lt;/li&gt;
&lt;li&gt;Frequent unwanted reruns&lt;/li&gt;
&lt;li&gt;Hard to control UI flow&lt;/li&gt;
&lt;li&gt;Debugging is painful&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Audio Input Handling
&lt;/h3&gt;

&lt;p&gt;Handling both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🎤 Recorded audio&lt;/li&gt;
&lt;li&gt;📤 Uploaded audio&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…was surprisingly tricky.&lt;/p&gt;

&lt;p&gt;Issues included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audio not updating correctly&lt;/li&gt;
&lt;li&gt;Previous audio persisting&lt;/li&gt;
&lt;li&gt;Preview disappearing unexpectedly&lt;/li&gt;
&lt;li&gt;Send button not triggering properly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Getting this right required careful control of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;session_state&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;widget keys&lt;/li&gt;
&lt;li&gt;rerun timing&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧩 Key Learning
&lt;/h2&gt;

&lt;p&gt;The biggest takeaway:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Building AI systems is not just about models — it’s about managing state, UI behavior, and system flow.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;strong&gt;agent logic was actually straightforward&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The real complexity came from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UI framework limitations&lt;/li&gt;
&lt;li&gt;State synchronization&lt;/li&gt;
&lt;li&gt;Event-driven behavior&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔥 Example Flow
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;User:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Create a Python file with a retry function"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;System:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audio → transcribed to text&lt;/li&gt;
&lt;li&gt;LLM decides → &lt;code&gt;write_code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tool generates Python code&lt;/li&gt;
&lt;li&gt;File saved in &lt;code&gt;output/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;LLM explains what was done&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🚀 Future Improvements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Streaming responses (ChatGPT-like typing effect)&lt;/li&gt;
&lt;li&gt;More tools (API calls, DB queries, etc.)&lt;/li&gt;
&lt;li&gt;Better UI framework (possibly replacing Streamlit)&lt;/li&gt;
&lt;li&gt;Multi-step reasoning chains&lt;/li&gt;
&lt;li&gt;File preview &amp;amp; download in UI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project was a great exercise in building a &lt;strong&gt;real-world agent system&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;While LLMs make reasoning easy, the real engineering challenge lies in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system design&lt;/li&gt;
&lt;li&gt;tool orchestration&lt;/li&gt;
&lt;li&gt;UI-state synchronization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And surprisingly… debugging Streamlit 😄&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
