TARANDEEP SINGH KHURANA

Posted on Apr 15

🎙️ Building a Voice-Controlled AI Agent with Tool Execution

#agents #ai #llm #tutorial

In this article, I’ll walk through how I built a voice-controlled AI agent that can understand user commands, decide what action to take, execute tools like file creation or code generation, and respond naturally — all through a simple web interface.

🚀 Overview

The goal of this project was to move beyond a basic chatbot and build something closer to an agentic system — where the model doesn’t just respond, but decides what to do.

The agent supports:

🎤 Voice input (record or upload audio)
🧠 Speech-to-text using OpenAI Whisper
🤖 LLM-based decision making (no hardcoded intent rules)
🛠️ Tool execution (file creation, code generation)
💬 Natural language responses
🖥️ Interactive UI using Streamlit

🏗️ Architecture

At a high level, the system looks like this:

User (Voice Input)
        ↓
Speech-to-Text (whisper-1)
        ↓
Agent (LLM decides action)
        ├── If "final" → respond normally
        └── If tool → execute tool
                    ↓
              Tool Result
                    ↓
              Sent back to LLM
                    ↓
          Final Natural Response
                    ↓
            Streamlit UI

🧠 Agent Design (Core Idea)

Instead of using traditional intent classification, I implemented an agent loop:

Send user input to the LLM
LLM returns structured JSON like:

   {
     "action": "write_code",
     "input": { ... }
   }

If it's a tool → execute it
Feed tool result back to the LLM
LLM generates a final natural response

This approach removes rigid pipelines and makes the system fully flexible and extensible.

🛠️ Tools Implemented

Currently, the agent supports:

📁 create_file → Create files in a restricted folder
💻 write_code → Generate and write code files
💬 General chat fallback

All file operations are sandboxed inside an output/ directory for safety.

🤖 Models Used

Speech-to-Text: whisper-1
LLM: gpt-4o-mini

💡 Why gpt-4o-mini?

Due to limited token budget (I paid for OpenAI credits myself), I had to choose a cost-efficient model.

However, I do have hands-on experience working with:

GPT-4 / GPT-5 series (up to GPT-5.4)
Claude Sonnet / Opus models
Other flagship LLMs

So the choice here was purely practical, not due to lack of exposure.

⚠️ Challenges Faced

1. Streamlit (Biggest Pain Point)

Honestly, the hardest part of this project was not the AI — it was Streamlit.

Problems I faced:

Session state management is unintuitive
Frequent unwanted reruns
Hard to control UI flow
Debugging is painful

2. Audio Input Handling

Handling both:

🎤 Recorded audio
📤 Uploaded audio

…was surprisingly tricky.

Issues included:

Audio not updating correctly
Previous audio persisting
Preview disappearing unexpectedly
Send button not triggering properly

Getting this right required careful control of:

session_state
widget keys
rerun timing

🧩 Key Learning

The biggest takeaway:

Building AI systems is not just about models — it’s about managing state, UI behavior, and system flow.

The agent logic was actually straightforward.

The real complexity came from:

UI framework limitations
State synchronization
Event-driven behavior

🔥 Example Flow

User:

"Create a Python file with a retry function"

System:

Audio → transcribed to text
LLM decides → write_code
Tool generates Python code
File saved in output/
LLM explains what was done

🚀 Future Improvements

Streaming responses (ChatGPT-like typing effect)
More tools (API calls, DB queries, etc.)
Better UI framework (possibly replacing Streamlit)
Multi-step reasoning chains
File preview & download in UI

🧠 Final Thoughts

This project was a great exercise in building a real-world agent system.

While LLMs make reasoning easy, the real engineering challenge lies in:

system design
tool orchestration
UI-state synchronization

And surprisingly… debugging Streamlit 😄

DEV Community