I Built a Voice-Controlled AI Agent in Python

Aditya Nagalkar — Sun, 12 Apr 2026 18:44:55 +0000

Here is the complete, final article ready to be copied and published on Substack, Dev.to, or Medium.

I Built a Voice-Controlled AI Agent in Python — Here's What Actually Went Wrong
When I got the assignment to build a voice-controlled local AI agent, my first thought was — how hard can it be? Record audio, transcribe it, run some code. Three days later, I had a much more honest answer.

This is a write-up of what I built, what broke, and what I learned along the way.

What the Project Does
The final system works like this: you speak a command into the browser, it transcribes your voice using Whisper, an LLM figures out what you meant and which tool to call, and then the agent actually does it — creates files, writes code, summarizes text, or just chats with you. Everything is surfaced in a web UI built with Gradio.

The pipeline looks like this:

Plaintext
Your Voice → Whisper (STT) → LLaMA 3.3 70B (Intent) → Tool Execution → UI
Simple on paper. Less simple in practice.

The Architecture: How It Actually Fits Together
I knew from the start that I wanted to keep the codebase modular. Tying LLM reasoning logic directly to Gradio UI components is a fast track to spaghetti code.

I split the system into a clean frontend and a dedicated core/ backend directory. The flow looks like this:

The Interface (app.py)
This is the Gradio frontend. It handles the multi-modal audio input (microphone, file upload, or raw file paths), manages the session state (like the 8-turn conversation memory and the human-in-the-loop toggle), and renders the final output alongside the latency benchmarks.
The Ears (core/stt.py)
When app.py captures an audio payload, it passes it here. This module handles routing the audio to the Groq API to hit the whisper-large-v3 model. Because it's API-based, we get sub-second transcription instead of waiting 10–15 seconds for local processing.
The Brain & Dispatcher (core/agent.py)
This is where the magic happens. It takes the transcribed text and the conversation history, packages them into a highly opinionated prompt, and queries llama-3.3-70b-versatile. Its job isn't to execute commands—its only job is to understand the intent and return that strictly formatted JSON array of commands. Once it parses the JSON, it dispatches the instructions to the appropriate tools.
The Hands (core/tools.py)
This is the execution layer. It maps the LLM's intents to actual Python functions: create_file, write_code, summarize, or chat. This is also exactly where the safe_filepath() security gatekeeper lives. Before any write_code or create_file intent is executed, this module sanitizes the request to guarantee nothing ever gets written outside the designated output/ folder.

The Stack & Model Choices
Speech-to-Text: Whisper large-v3 via Groq API

LLM: LLaMA 3.3 70B Versatile via Groq API

UI: Gradio

Language: Python

I chose Groq over running models entirely locally for one crucial reason: speed. Running Whisper large-v3 locally on my machine requires about 3GB of VRAM and takes 5–15 seconds per clip. LLaMA 70B locally requires 40+ GB of RAM. Groq's LPU hardware delivers sub-second transcription and sub-3-second LLM responses, making the agent feel instant. For an agent that's supposed to feel responsive and conversational, that difference makes or breaks the user experience.

Challenge 1: Getting the LLM to Return Consistent JSON
The first real problem was making the LLM reliably output structured data. I needed it to return a specific schema:

JSON { "commands": [ { "intent": "write_code", "filename": "retry.py", "code": "...", "reply": "Created retry.py" } ] }
Early versions of my prompt would randomly return extra conversational text before the JSON, wrap it in Markdown blocks, or just spit out a plain sentence. Every time the format broke, the entire pipeline crashed.

The fix was a three-pronged approach:

Using Groq's response_format={"type": "json_object"} parameter to force JSON output.

Lowering the temperature to 0.3 for more deterministic responses.

Rewriting the system prompt to be ruthlessly explicit — spelling out exact field names, types, and fallback behaviors for edge cases.

Challenge 2: Compound Commands
The assignment included a bonus feature: supporting multiple commands in a single breath. If a user says, "Create a Python file with a retry function and summarize what it does," that is two distinct intents at once.

The problem? The LLM would often collapse them into one command, execute them out of order, or just drop the second half entirely.

I had to structure the prompt to heavily emphasize that the commands array is ordered and every distinct action requires its own entry. Upgrading from llama-3.1-8b-instant to llama-3.3-70b-versatile was the real game-changer here. The smaller model was fast but sloppy with complex, multi-step instructions. The 70B model parses compound commands flawlessly.

Challenge 3: The Frontend Was the Hardest Part
I did not expect this going in. Gradio is fantastic for quick demos, but fighting its default styling is genuinely painful.

I wanted an interface that felt intentional, but Gradio has a persistent dark theme that bleeds into components like the audio widget and buttons. CSS overrides alone don't fully fix it because Gradio relies heavily on internal CSS variables that take priority.

After about six iterations, here is what kept going wrong:

The audio widget stayed dark even after setting the background to white because it renders its own internal styles.

The theme toggle I built changed the body element's class, but Gradio scopes its CSS to .gradio-container, rendering the toggle useless.

Buttons kept reverting to Gradio's default orange accent regardless of my CSS.

The actual solution: Passing a gr.themes.Base(...).set(...) object directly into gr.Blocks(). This sets Gradio's internal design tokens (input_background_fill, button_primary_background_fill, etc.) before any rendering happens. Once those tokens were aligned, my CSS layer on top worked cleanly.

Challenge 4: File Safety
The assignment required restricting file operations to an output/ folder. My initial implementation just used open(filename, "w"). It didn't take long to realize a user could theoretically say "create a file at ../../.env" and overwrite sensitive system files.

The fix was building a safe_filepath() function in tools.py. It strips any path separators using os.path.basename(), validates the filename against a regex allowing only safe characters, and checks the extension against a strict allowlist. Nothing gets written outside the designated output/ sandbox.

Bonus Features I Added
Human-in-the-loop: A toggleable checkpoint that pauses the pipeline after transcription. You can read exactly what the agent heard before it executes anything — incredibly useful when audio quality is spotty.

Session Memory: The agent retains the last 8 conversation turns and passes them as context to the LLM. You can say "now make that async" after generating code, and it understands exactly what "that" refers to.

Performance Benchmarking: The UI breaks down STT time, LLM time, and total execution time for every single request.

What I'd Do Differently
If I were starting over, I'd respect prompt engineering from day one. I spent hours debugging JSON parsing errors in Python that were really just ambiguous instruction problems. The model does exactly what you tell it — if your instructions are vague, your output will be too.

I'd also look into streaming the LLM output directly to the UI. For longer code generation tasks, there's a noticeable pause before the file appears, and streaming would make the agent feel instantly responsive.

The Code
The full project is open-source on GitHub: Adii22-22/voice-ai-agent

Setup takes about five minutes — clone the repo, install the dependencies, drop a Groq API key into a .env file, and run python app.py.``

DEV Community: Aditya Nagalkar

I Built a Voice-Controlled AI Agent in Python