In this article, Iβll walk through how I built a voice-controlled AI agent that can understand user commands, decide what action to take, execute tools like file creation or code generation, and respond naturally β all through a simple web interface.
π Overview
The goal of this project was to move beyond a basic chatbot and build something closer to an agentic system β where the model doesnβt just respond, but decides what to do.
The agent supports:
- π€ Voice input (record or upload audio)
- π§ Speech-to-text using OpenAI Whisper
- π€ LLM-based decision making (no hardcoded intent rules)
- π οΈ Tool execution (file creation, code generation)
- π¬ Natural language responses
- π₯οΈ Interactive UI using Streamlit
ποΈ Architecture
At a high level, the system looks like this:
User (Voice Input)
β
Speech-to-Text (whisper-1)
β
Agent (LLM decides action)
βββ If "final" β respond normally
βββ If tool β execute tool
β
Tool Result
β
Sent back to LLM
β
Final Natural Response
β
Streamlit UI
π§ Agent Design (Core Idea)
Instead of using traditional intent classification, I implemented an agent loop:
- Send user input to the LLM
- LLM returns structured JSON like:
{
"action": "write_code",
"input": { ... }
}
- If it's a tool β execute it
- Feed tool result back to the LLM
- LLM generates a final natural response
This approach removes rigid pipelines and makes the system fully flexible and extensible.
π οΈ Tools Implemented
Currently, the agent supports:
- π
create_fileβ Create files in a restricted folder - π»
write_codeβ Generate and write code files - π¬ General chat fallback
All file operations are sandboxed inside an output/ directory for safety.
π€ Models Used
-
Speech-to-Text:
whisper-1 -
LLM:
gpt-4o-mini
π‘ Why gpt-4o-mini?
Due to limited token budget (I paid for OpenAI credits myself), I had to choose a cost-efficient model.
However, I do have hands-on experience working with:
- GPT-4 / GPT-5 series (up to GPT-5.4)
- Claude Sonnet / Opus models
- Other flagship LLMs
So the choice here was purely practical, not due to lack of exposure.
β οΈ Challenges Faced
1. Streamlit (Biggest Pain Point)
Honestly, the hardest part of this project was not the AI β it was Streamlit.
Problems I faced:
- Session state management is unintuitive
- Frequent unwanted reruns
- Hard to control UI flow
- Debugging is painful
2. Audio Input Handling
Handling both:
- π€ Recorded audio
- π€ Uploaded audio
β¦was surprisingly tricky.
Issues included:
- Audio not updating correctly
- Previous audio persisting
- Preview disappearing unexpectedly
- Send button not triggering properly
Getting this right required careful control of:
session_state- widget keys
- rerun timing
π§© Key Learning
The biggest takeaway:
Building AI systems is not just about models β itβs about managing state, UI behavior, and system flow.
The agent logic was actually straightforward.
The real complexity came from:
- UI framework limitations
- State synchronization
- Event-driven behavior
π₯ Example Flow
User:
"Create a Python file with a retry function"
System:
- Audio β transcribed to text
- LLM decides β
write_code - Tool generates Python code
- File saved in
output/ - LLM explains what was done
π Future Improvements
- Streaming responses (ChatGPT-like typing effect)
- More tools (API calls, DB queries, etc.)
- Better UI framework (possibly replacing Streamlit)
- Multi-step reasoning chains
- File preview & download in UI
π§ Final Thoughts
This project was a great exercise in building a real-world agent system.
While LLMs make reasoning easy, the real engineering challenge lies in:
- system design
- tool orchestration
- UI-state synchronization
And surprisinglyβ¦ debugging Streamlit π
Top comments (0)