AryanJaitely

Posted on Apr 12

How I Built a Voice-Controlled Local AI Agent

#agents #ai #python #showdev

How I Built a Voice-Controlled Local AI Agent

python #ai #machinelearning #gradio

From microphone to file creation in under 3 seconds — using Whisper, LLaMA3, and Gradio

Introduction

What if you could just speak to your computer and have it create files, write code, or summarize content — all running locally on your machine? That is exactly what I set out to build for this assignment.

In this article I will walk through how I built a voice-controlled AI agent from scratch, the architecture decisions I made, the models I chose, and the challenges I faced along the way.

What the Agent Does

The agent takes voice input (or typed text), runs it through a full 4-stage pipeline, and returns a result:

Stage 1 — Audio Input: Accept microphone recording or uploaded .wav/.mp3 file
Stage 2 — Speech-to-Text: Transcribe the audio using OpenAI Whisper
Stage 3 — Intent Classification: Send the transcription to an LLM to classify intent
Stage 4 — Tool Execution: Run the right tool and save output to an output/ folder

Everything is displayed in a clean Gradio web UI showing each stage of the pipeline.

Architecture Deep Dive

The project is split into 4 Python modules, each handling one stage of the pipeline.

stt.py — Speech to Text

I used Whisper Large v3 via the Groq API as my primary STT backend. Groq provides incredibly fast inference for free, which made it perfect for this project. The module also supports local Whisper (openai-whisper package) and OpenAI's Whisper-1 API as fallbacks, selectable via an environment variable.

intent.py — Intent Classification

This module sends the transcription to an LLM with a carefully crafted system prompt asking it to return structured JSON with four fields: intent, filename, language, and description. I chose a JSON-only output format to make parsing reliable and consistent.

The module supports Ollama (local), Groq (cloud), and OpenAI. There is also a rule-based fallback using keyword matching in case the LLM is unavailable.

The four supported intents are:

create_file — create a new file or folder
write_code — generate code and save it
summarize — summarize provided text
general_chat — conversational response

tools.py — Tool Execution

Based on the classified intent, the tools module calls the appropriate function. For write_code, it prompts the LLM a second time with a code-generation system prompt, cleans the output (stripping markdown fences), and saves the result to the output/ directory.

All file operations are sandboxed using os.path.basename() to prevent path traversal attacks — no files can ever be written outside the output/ folder.

app.py — Gradio UI

I built the frontend with Gradio 4, using a custom dark theme with CSS variables. The UI shows all four pipeline outputs simultaneously: transcribed text, detected intent, action taken, and final result. A session history panel shows the last 5 actions for context.

Models I Chose and Why

Whisper Large v3 via Groq (STT)

I chose Groq's hosted Whisper Large v3 for speech-to-text for three reasons:

It is extremely fast — transcription takes under 1 second for short commands
Groq provides a generous free tier with no credit card required
Whisper Large v3 has excellent accuracy even with accented speech and background noise

For users who want full offline operation, the openai-whisper package is also supported as a drop-in replacement.

LLaMA 3 8B via Groq (Intent + Code Generation)

For the LLM I used LLaMA 3 8B through Groq's API. The 8B model is fast and capable enough for intent classification and short code generation tasks.

For intent classification → temperature set to 0 for deterministic JSON output
For code generation → temperature set to 0.3 for slightly more creative outputs

For users with a capable local machine, Ollama with llama3 or mistral is fully supported as a drop-in replacement that runs entirely offline.

Challenges I Faced

1. Getting Reliable JSON from the LLM

The biggest challenge was getting the LLM to consistently return valid JSON for intent classification. The solution was writing a strict system prompt that explicitly states:

"Respond ONLY with valid JSON, no markdown, no explanation"

I also added a fallback JSON parser that strips markdown fences and uses regex to extract JSON objects from messy responses.

2. File Safety

When the LLM suggests a filename, it sometimes includes relative paths like ../../etc/passwd. I solved this with os.path.basename() which strips any directory components, combined with a character sanitization step that removes unsafe characters.

3. Audio Format Compatibility

Gradio records audio in WebM format by default on some browsers, but Whisper works best with WAV or MP3. I handled this by detecting the file extension and setting the correct MIME type in the API request.

4. Hardware Constraints

Running Whisper Large locally requires a GPU and significant RAM. I addressed this by making every component swappable via environment variables:

Low-end machine → use Groq API for both STT and LLM
Powerful machine → run everything locally through Ollama + openai-whisper

Example Flow

User says: "Create a Python file with a retry function"

🎙️  Audio recorded
        ↓
📝  Transcribed: "Create a Python file with a retry function"
        ↓
🧠  Intent: write_code | language: python | filename: retry.py
        ↓
⚡  LLM generates Python retry function code
        ↓
💾  Saved to: output/retry.py
        ↓
✅  UI displays transcription, intent, action, and code preview

Project Structure

voice-ai-agent/
├── app.py              # Gradio UI
├── stt.py              # Speech-to-Text (Whisper)
├── intent.py           # Intent classification (LLaMA3)
├── tools.py            # Tool execution
├── requirements.txt    # Dependencies
└── output/             # Generated files (sandboxed)

How to Run It

# 1. Clone the repo
git clone https://github.com/AryanJaitely/voice-ai-agent.git
cd voice-ai-agent

# 2. Install dependencies
pip install gradio requests python-dotenv

# 3. Create .env file
echo "STT_BACKEND=groq" > .env
echo "LLM_BACKEND=groq" >> .env
echo "GROQ_API_KEY=your_key_here" >> .env

# 4. Run
python app.py

Get a free Groq API key at: https://console.groq.com

Conclusion

Building this voice agent taught me a lot about chaining AI models together reliably. The key insight is that each component (STT, intent, tools) should be independently swappable — this makes the system both resilient and flexible for different hardware constraints.

The full source code is available on GitHub:
👉 https://github.com/AryanJaitely/voice-ai-agent

Built with Python, Gradio, Groq, Whisper & LLaMA 3

DEV Community

How I Built a Voice-Controlled Local AI Agent

How I Built a Voice-Controlled Local AI Agent

python #ai #machinelearning #gradio

Introduction

What the Agent Does

Architecture Deep Dive

stt.py — Speech to Text

intent.py — Intent Classification

tools.py — Tool Execution

app.py — Gradio UI

Models I Chose and Why

Whisper Large v3 via Groq (STT)

LLaMA 3 8B via Groq (Intent + Code Generation)

Challenges I Faced

1. Getting Reliable JSON from the LLM

2. File Safety

3. Audio Format Compatibility

4. Hardware Constraints

Example Flow

Project Structure

How to Run It

Conclusion

Top comments (0)