Mahi vignesh Valleti

Posted on Apr 12

Building a Voice-Controlled AI Agent: Architecture, Models & Lessons Learned

#agents #ai #python #showdev

How I built a fully functional voice-to-action AI agent using Groq, Streamlit, and a modular Python pipeline — including the messy parts nobody talks about.

Introduction

What if you could just speak a command and have an AI agent write code, summarize text, and save files to your computer — all without typing a single line?

That's exactly what I set out to build: a Voice-Controlled Local AI Agent that listens to your voice (or typed commands), understands your intent, and executes real actions on your machine. Built with Python, Streamlit, and Groq's blazing-fast LLM inference, this project turned out to be one of the most educational builds I've done — full of unexpected challenges and satisfying breakthroughs.

In this article I'll walk you through:

The full system architecture
Why I chose Groq over OpenAI
How I handled compound commands, memory, and graceful degradation
The model benchmarking system I built (Llama 3.3 70B vs Llama 3.1 8B)
Every major challenge I hit — and how I solved them

System Architecture

The agent follows a clean three-stage pipeline:

🎙️ Audio Input
      ↓
[Stage 1] Speech-to-Text (STT)
      ↓
[Stage 2] Intent Classification (LLM)
      ↓
[Stage 3] Tool Execution
      ↓
📁 Output (files, code, summaries, chat)

Each stage is isolated into its own Python module, making the system easy to swap, extend, or debug independently.

File Structure

voice_agent/
├── app.py          # Streamlit UI + pipeline orchestration
├── stt.py          # Speech-to-Text module (Groq / OpenAI / Whisper)
├── intent.py       # Intent classification via LLM
├── tools.py        # Tool execution engine
├── requirements.txt
└── output/         # All generated files land here

Stage 1: Speech-to-Text (STT)

The STT module (stt.py) supports three backends, selectable from the sidebar:

Backend	Model	Notes
Groq API	`whisper-large-v3`	Fast, free, cloud-based
OpenAI API	`whisper-1`	Accurate, paid
Whisper Local	`base` / `small`	Fully offline, slower

For most users, Groq's Whisper Large V3 is the best choice — it's free, fast, and handles a wide variety of accents and audio quality. The local Whisper fallback is great for privacy-sensitive scenarios where you don't want audio leaving your machine.

def transcribe_audio(audio_path: str, config: dict) -> str:
    backend = config.get("stt_backend", "Groq API")
    if backend == "Groq API":
        return _groq_stt(audio_path, config.get("api_key", ""))
    elif backend == "OpenAI API":
        return _openai_stt(audio_path, config.get("api_key", ""))
    else:
        return _whisper_local(audio_path)

Stage 2: Intent Classification

This is where the intelligence lives. After transcription, the raw text is sent to an LLM with a carefully engineered system prompt that forces structured JSON output.

The System Prompt Design

The key insight here was to design the prompt around a commands array rather than a single intent. This unlocked compound command support from day one:

{
  "intent": "summarize_and_save",
  "detail": "Summarize text and save to file",
  "commands": [
    {
      "intent": "summarize_and_save",
      "description": "Summarize and write to summary.txt",
      "params": {
        "filename": "summary.txt",
        "text_to_summarize": "..."
      }
    }
  ]
}

Supported Intents

Intent	Action
`create_file`	Create a new file or folder
`write_code`	Generate and save code to a file
`summarize`	Summarize text (output in chat)
`summarize_and_save`	Summarize and save result to a file
`general_chat`	Answer questions, explain concepts

LLM Backends

Backend	Model	Speed	Cost
Groq API	`llama-3.3-70b-versatile`	⚡ Very fast	Free tier
OpenAI API	`gpt-4o-mini`	Medium	Paid
Ollama Local	`llama3.2`, `mistral`, etc.	Slow	Free

I strongly recommend Groq for this use case. The inference speed is genuinely remarkable — what takes GPT-4o-mini 3–5 seconds takes Groq under a second.

Stage 3: Tool Execution

The tools.py module routes each classified intent to the correct handler function. Every handler follows the same contract — it receives params, the output directory, and the config, and returns a result dict:

{
    "success": True,
    "message": "Code written to output/retry.py",
    "output": "# Generated code here..."
}

The write_code handler is the most interesting — it makes a second LLM call to actually generate the code, instructing the model to return only raw code with no markdown fences or explanation:

def _generate_code(description: str, language: str, config: dict) -> str:
    prompt = f"""Generate clean, well-commented {language} code for:
{description}

Return ONLY the code. No markdown fences. No explanation outside the code."""
    return _llm_query(prompt, config)

Bonus Features Implemented

1. Compound Commands

A single voice input can trigger multiple actions. For example:

"Summarize this article and save it to notes.txt"

The LLM detects two intents (summarize + create_file) and the agent executes them sequentially. The pipeline loops through the commands array and processes each one — with Human-in-the-Loop confirmation for file operations.

2. Human-in-the-Loop (HITL)

Before executing any file-writing operation (create_file, write_code, summarize_and_save), the agent pauses and shows a confirmation dialog:

⚠️ Confirmation Required
Intent: write_code
Action: Generate Python retry function and save to retry.py
[Parameters preview]

✅ Confirm & Execute     ❌ Cancel

This is a critical safety feature. Without it, a misheard command could overwrite important files. Even cancelled actions are logged to the session history for full traceability.

3. Graceful Degradation

Real-world voice agents fail constantly — bad audio, network timeouts, quota exceeded, unknown intents. Instead of crashing or showing a raw Python traceback, every failure point is caught and routed to a user-friendly message:

except Exception as e:
    err = str(e)
    if "api" in err.lower() or "key" in err.lower():
        friendly = "API key error — check your key in the sidebar."
    elif "timeout" in err.lower():
        friendly = "Network error — check your internet connection."
    elif "format" in err.lower():
        friendly = "Audio format unsupported — try WAV or MP3."
    else:
        friendly = f"STT failed: {err}"

Unknown intents also degrade gracefully — instead of crashing, they fall back to general_chat and the agent explains what it understood.

4. Session Memory

The agent maintains two types of memory throughout a session:

Action History — every command, intent, result, and timestamp is stored and displayed in the History tab with success/failure statistics and intent frequency chips.

Chat Context — a rolling window of the last 20 conversation turns is passed to the LLM on every request, enabling follow-up commands like:

"Now make it handle exceptions too"

The LLM remembers what was just generated and adds exception handling to it — without the user repeating themselves.

5. Model Benchmarking

The benchmarking tab lets you compare two Groq models head-to-head on any prompt:

Model A: Llama 3.3 70B Versatile — large, powerful, deep reasoning
Model B: Llama 3.1 8B Instant — small, ultra-fast, great for simple tasks

Both models run on the same Groq API key — no extra cost. The benchmark shows:

⏱ Latency in seconds
🔤 Total tokens used
🚀 Tokens per second
🏆 Speed winner highlighted in green
📈 Cumulative stats across multiple runs

After running several benchmarks myself, here's what I found:

Task Type	Winner
Complex code generation	Llama 3.3 70B (better quality)
Simple Q&A / chat	Llama 3.1 8B (3–4× faster)
Summarization	Roughly equal

Challenges & How I Solved Them

Challenge 1: Groq API Key vs OpenAI Key — Two Fields, One Config

Problem: The app started with a single API key field, but Groq and OpenAI use different keys. Switching backends required manually changing the key every time.

Solution: Added two separate key fields in the sidebar. The active key is resolved automatically based on which backend is selected:

if "Groq" in stt_backend or "Groq" in llm_backend:
    active_api_key = groq_key
elif "OpenAI" in stt_backend or "OpenAI" in llm_backend:
    active_api_key = openai_key

Challenge 2: LLM Returns Markdown, Not Pure JSON

Problem: Even when instructed to return only JSON, LLMs often wrap their output in markdown fences like json ....

Solution: Built a robust parser that strips fences and extracts the JSON object using regex, with a safe fallback to general_chat if parsing still fails:

def _parse_intent(raw: str) -> dict:
    cleaned = re.sub(r"```

(?:json)?", "", raw).strip().rstrip("`").strip()
    match = re.search(r"\{.*\}", cleaned, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass
    return {"intent": "general_chat", "detail": raw[:500], "commands": [...]}
```

---

### Challenge 3: Decommissioned Models Break Silently at Runtime

**Problem:** During development, Groq deprecated `gemma2-9b-it` — a model I was using for benchmarking. The app crashed at runtime with a 400 error, not at startup.

**Solution:** Replaced the decommissioned model with `llama-3.1-8b-instant` (currently active on Groq). The broader lesson: **always wrap model API calls in try/except** and surface the error message directly to the user — never let a model error become a silent failure.

---

### Challenge 4: Streamlit State Management with HITL

**Problem:** The Human-in-the-Loop confirmation requires the app to pause mid-pipeline, show a confirmation UI, and then resume execution on the next Streamlit rerun. Managing this state without losing the action context was tricky.

**Solution:** Used `st.session_state.pending_action` to store the action dict between reruns. When a file intent is detected, the action is saved to state and `st.rerun()` is called. The confirmation UI reads from state and executes (or cancels) on the next button press.

---

### Challenge 5: OpenAI Quota Errors in Benchmarking

**Problem:** A perfectly valid OpenAI API key can still fail with a `429 insufficient_quota` error if the account has no billing credits — this confused early testers who thought their key was broken.

**Solution:** The benchmarking tab wraps each model call in try/except and displays the error inline inside the result card rather than crashing the whole tab. This way, if one model fails, the other still shows its result.

---

## Tech Stack Summary

| Component | Technology |
|---|---|
| UI Framework | Streamlit |
| STT (Cloud) | Groq Whisper Large V3 |
| STT (Local) | OpenAI Whisper (local) |
| LLM (Primary) | Groq — Llama 3.3 70B Versatile |
| LLM (Benchmark) | Groq — Llama 3.1 8B Instant |
| Language | Python 3.10+ |
| File output | Local filesystem (`output/` directory) |

---

## What I'd Build Next

- **Text-to-Speech output** — have the agent speak its responses back using Groq's TTS or ElevenLabs
- **File reading intent** — "Read my notes.txt and summarize it"
- **Web search intent** — "Search for the latest Python 3.13 features and save a summary"
- **Scheduled commands** — "Run this every morning at 9am"
- **Multi-agent routing** — route different intent types to specialized sub-agents

---

## Conclusion

Building a voice-controlled AI agent from scratch taught me that the hard parts aren't the AI — it's the **plumbing**: state management, error handling, audio format compatibility, and API quirks. The core pipeline took a day to build. Making it robust, user-friendly, and production-grade took much longer.

If you're building something similar, my biggest advice is: **handle failures first, features second**. A voice agent that crashes on bad audio or an expired API key is worse than no agent at all.

The full source code is available and runs with a single command:

``{% endraw %}{% raw %}`bash
pip install -r requirements.txt
streamlit run app.py
```

DEV Community