DEV Community

TARANDEEP SINGH KHURANA
TARANDEEP SINGH KHURANA

Posted on

πŸŽ™οΈ Building a Voice-Controlled AI Agent with Tool Execution

In this article, I’ll walk through how I built a voice-controlled AI agent that can understand user commands, decide what action to take, execute tools like file creation or code generation, and respond naturally β€” all through a simple web interface.


πŸš€ Overview

The goal of this project was to move beyond a basic chatbot and build something closer to an agentic system β€” where the model doesn’t just respond, but decides what to do.

The agent supports:

  • 🎀 Voice input (record or upload audio)
  • 🧠 Speech-to-text using OpenAI Whisper
  • πŸ€– LLM-based decision making (no hardcoded intent rules)
  • πŸ› οΈ Tool execution (file creation, code generation)
  • πŸ’¬ Natural language responses
  • πŸ–₯️ Interactive UI using Streamlit

πŸ—οΈ Architecture

At a high level, the system looks like this:

User (Voice Input)
        ↓
Speech-to-Text (whisper-1)
        ↓
Agent (LLM decides action)
        β”œβ”€β”€ If "final" β†’ respond normally
        └── If tool β†’ execute tool
                    ↓
              Tool Result
                    ↓
              Sent back to LLM
                    ↓
          Final Natural Response
                    ↓
            Streamlit UI
Enter fullscreen mode Exit fullscreen mode

🧠 Agent Design (Core Idea)

Instead of using traditional intent classification, I implemented an agent loop:

  1. Send user input to the LLM
  2. LLM returns structured JSON like:
   {
     "action": "write_code",
     "input": { ... }
   }
Enter fullscreen mode Exit fullscreen mode
  1. If it's a tool β†’ execute it
  2. Feed tool result back to the LLM
  3. LLM generates a final natural response

This approach removes rigid pipelines and makes the system fully flexible and extensible.


πŸ› οΈ Tools Implemented

Currently, the agent supports:

  • πŸ“ create_file β†’ Create files in a restricted folder
  • πŸ’» write_code β†’ Generate and write code files
  • πŸ’¬ General chat fallback

All file operations are sandboxed inside an output/ directory for safety.


πŸ€– Models Used

  • Speech-to-Text: whisper-1
  • LLM: gpt-4o-mini

πŸ’‘ Why gpt-4o-mini?

Due to limited token budget (I paid for OpenAI credits myself), I had to choose a cost-efficient model.

However, I do have hands-on experience working with:

  • GPT-4 / GPT-5 series (up to GPT-5.4)
  • Claude Sonnet / Opus models
  • Other flagship LLMs

So the choice here was purely practical, not due to lack of exposure.


⚠️ Challenges Faced

1. Streamlit (Biggest Pain Point)

Honestly, the hardest part of this project was not the AI β€” it was Streamlit.

Problems I faced:

  • Session state management is unintuitive
  • Frequent unwanted reruns
  • Hard to control UI flow
  • Debugging is painful

2. Audio Input Handling

Handling both:

  • 🎀 Recorded audio
  • πŸ“€ Uploaded audio

…was surprisingly tricky.

Issues included:

  • Audio not updating correctly
  • Previous audio persisting
  • Preview disappearing unexpectedly
  • Send button not triggering properly

Getting this right required careful control of:

  • session_state
  • widget keys
  • rerun timing

🧩 Key Learning

The biggest takeaway:

Building AI systems is not just about models β€” it’s about managing state, UI behavior, and system flow.

The agent logic was actually straightforward.

The real complexity came from:

  • UI framework limitations
  • State synchronization
  • Event-driven behavior

πŸ”₯ Example Flow

User:

"Create a Python file with a retry function"

System:

  1. Audio β†’ transcribed to text
  2. LLM decides β†’ write_code
  3. Tool generates Python code
  4. File saved in output/
  5. LLM explains what was done

πŸš€ Future Improvements

  • Streaming responses (ChatGPT-like typing effect)
  • More tools (API calls, DB queries, etc.)
  • Better UI framework (possibly replacing Streamlit)
  • Multi-step reasoning chains
  • File preview & download in UI

🧠 Final Thoughts

This project was a great exercise in building a real-world agent system.

While LLMs make reasoning easy, the real engineering challenge lies in:

  • system design
  • tool orchestration
  • UI-state synchronization

And surprisingly… debugging Streamlit πŸ˜„

Top comments (0)