DEV Community

Mahi vignesh Valleti
Mahi vignesh Valleti

Posted on

Building a Voice-Controlled AI Agent: Architecture, Models & Lessons Learned

How I built a fully functional voice-to-action AI agent using Groq, Streamlit, and a modular Python pipeline — including the messy parts nobody talks about.


Introduction

What if you could just speak a command and have an AI agent write code, summarize text, and save files to your computer — all without typing a single line?

That's exactly what I set out to build: a Voice-Controlled Local AI Agent that listens to your voice (or typed commands), understands your intent, and executes real actions on your machine. Built with Python, Streamlit, and Groq's blazing-fast LLM inference, this project turned out to be one of the most educational builds I've done — full of unexpected challenges and satisfying breakthroughs.

In this article I'll walk you through:

  • The full system architecture
  • Why I chose Groq over OpenAI
  • How I handled compound commands, memory, and graceful degradation
  • The model benchmarking system I built (Llama 3.3 70B vs Llama 3.1 8B)
  • Every major challenge I hit — and how I solved them

System Architecture

The agent follows a clean three-stage pipeline:

🎙️ Audio Input
      ↓
[Stage 1] Speech-to-Text (STT)
      ↓
[Stage 2] Intent Classification (LLM)
      ↓
[Stage 3] Tool Execution
      ↓
📁 Output (files, code, summaries, chat)
Enter fullscreen mode Exit fullscreen mode

Each stage is isolated into its own Python module, making the system easy to swap, extend, or debug independently.

File Structure

voice_agent/
├── app.py          # Streamlit UI + pipeline orchestration
├── stt.py          # Speech-to-Text module (Groq / OpenAI / Whisper)
├── intent.py       # Intent classification via LLM
├── tools.py        # Tool execution engine
├── requirements.txt
└── output/         # All generated files land here
Enter fullscreen mode Exit fullscreen mode

Stage 1: Speech-to-Text (STT)

The STT module (stt.py) supports three backends, selectable from the sidebar:

Backend Model Notes
Groq API whisper-large-v3 Fast, free, cloud-based
OpenAI API whisper-1 Accurate, paid
Whisper Local base / small Fully offline, slower

For most users, Groq's Whisper Large V3 is the best choice — it's free, fast, and handles a wide variety of accents and audio quality. The local Whisper fallback is great for privacy-sensitive scenarios where you don't want audio leaving your machine.

def transcribe_audio(audio_path: str, config: dict) -> str:
    backend = config.get("stt_backend", "Groq API")
    if backend == "Groq API":
        return _groq_stt(audio_path, config.get("api_key", ""))
    elif backend == "OpenAI API":
        return _openai_stt(audio_path, config.get("api_key", ""))
    else:
        return _whisper_local(audio_path)
Enter fullscreen mode Exit fullscreen mode

Stage 2: Intent Classification

This is where the intelligence lives. After transcription, the raw text is sent to an LLM with a carefully engineered system prompt that forces structured JSON output.

The System Prompt Design

The key insight here was to design the prompt around a commands array rather than a single intent. This unlocked compound command support from day one:

{
  "intent": "summarize_and_save",
  "detail": "Summarize text and save to file",
  "commands": [
    {
      "intent": "summarize_and_save",
      "description": "Summarize and write to summary.txt",
      "params": {
        "filename": "summary.txt",
        "text_to_summarize": "..."
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Supported Intents

Intent Action
create_file Create a new file or folder
write_code Generate and save code to a file
summarize Summarize text (output in chat)
summarize_and_save Summarize and save result to a file
general_chat Answer questions, explain concepts

LLM Backends

Backend Model Speed Cost
Groq API llama-3.3-70b-versatile ⚡ Very fast Free tier
OpenAI API gpt-4o-mini Medium Paid
Ollama Local llama3.2, mistral, etc. Slow Free

I strongly recommend Groq for this use case. The inference speed is genuinely remarkable — what takes GPT-4o-mini 3–5 seconds takes Groq under a second.


Stage 3: Tool Execution

The tools.py module routes each classified intent to the correct handler function. Every handler follows the same contract — it receives params, the output directory, and the config, and returns a result dict:

{
    "success": True,
    "message": "Code written to output/retry.py",
    "output": "# Generated code here..."
}
Enter fullscreen mode Exit fullscreen mode

The write_code handler is the most interesting — it makes a second LLM call to actually generate the code, instructing the model to return only raw code with no markdown fences or explanation:

def _generate_code(description: str, language: str, config: dict) -> str:
    prompt = f"""Generate clean, well-commented {language} code for:
{description}

Return ONLY the code. No markdown fences. No explanation outside the code."""
    return _llm_query(prompt, config)
Enter fullscreen mode Exit fullscreen mode

Bonus Features Implemented

1. Compound Commands

A single voice input can trigger multiple actions. For example:

"Summarize this article and save it to notes.txt"

The LLM detects two intents (summarize + create_file) and the agent executes them sequentially. The pipeline loops through the commands array and processes each one — with Human-in-the-Loop confirmation for file operations.


2. Human-in-the-Loop (HITL)

Before executing any file-writing operation (create_file, write_code, summarize_and_save), the agent pauses and shows a confirmation dialog:

⚠️ Confirmation Required
Intent: write_code
Action: Generate Python retry function and save to retry.py
[Parameters preview]

✅ Confirm & Execute     ❌ Cancel
Enter fullscreen mode Exit fullscreen mode

This is a critical safety feature. Without it, a misheard command could overwrite important files. Even cancelled actions are logged to the session history for full traceability.


3. Graceful Degradation

Real-world voice agents fail constantly — bad audio, network timeouts, quota exceeded, unknown intents. Instead of crashing or showing a raw Python traceback, every failure point is caught and routed to a user-friendly message:

except Exception as e:
    err = str(e)
    if "api" in err.lower() or "key" in err.lower():
        friendly = "API key error — check your key in the sidebar."
    elif "timeout" in err.lower():
        friendly = "Network error — check your internet connection."
    elif "format" in err.lower():
        friendly = "Audio format unsupported — try WAV or MP3."
    else:
        friendly = f"STT failed: {err}"
Enter fullscreen mode Exit fullscreen mode

Unknown intents also degrade gracefully — instead of crashing, they fall back to general_chat and the agent explains what it understood.


4. Session Memory

The agent maintains two types of memory throughout a session:

Action History — every command, intent, result, and timestamp is stored and displayed in the History tab with success/failure statistics and intent frequency chips.

Chat Context — a rolling window of the last 20 conversation turns is passed to the LLM on every request, enabling follow-up commands like:

"Now make it handle exceptions too"

The LLM remembers what was just generated and adds exception handling to it — without the user repeating themselves.


5. Model Benchmarking

The benchmarking tab lets you compare two Groq models head-to-head on any prompt:

  • Model A: Llama 3.3 70B Versatile — large, powerful, deep reasoning
  • Model B: Llama 3.1 8B Instant — small, ultra-fast, great for simple tasks

Both models run on the same Groq API key — no extra cost. The benchmark shows:

  • ⏱ Latency in seconds
  • 🔤 Total tokens used
  • 🚀 Tokens per second
  • 🏆 Speed winner highlighted in green
  • 📈 Cumulative stats across multiple runs

After running several benchmarks myself, here's what I found:

Task Type Winner
Complex code generation Llama 3.3 70B (better quality)
Simple Q&A / chat Llama 3.1 8B (3–4× faster)
Summarization Roughly equal

Challenges & How I Solved Them

Challenge 1: Groq API Key vs OpenAI Key — Two Fields, One Config

Problem: The app started with a single API key field, but Groq and OpenAI use different keys. Switching backends required manually changing the key every time.

Solution: Added two separate key fields in the sidebar. The active key is resolved automatically based on which backend is selected:

if "Groq" in stt_backend or "Groq" in llm_backend:
    active_api_key = groq_key
elif "OpenAI" in stt_backend or "OpenAI" in llm_backend:
    active_api_key = openai_key
Enter fullscreen mode Exit fullscreen mode

Challenge 2: LLM Returns Markdown, Not Pure JSON

Problem: Even when instructed to return only JSON, LLMs often wrap their output in markdown fences like json ....

Solution: Built a robust parser that strips fences and extracts the JSON object using regex, with a safe fallback to general_chat if parsing still fails:

def _parse_intent(raw: str) -> dict:
    cleaned = re.sub(r"```

(?:json)?", "", raw).strip().rstrip("`").strip()
    match = re.search(r"\{.*\}", cleaned, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass
    return {"intent": "general_chat", "detail": raw[:500], "commands": [...]}
```

---

### Challenge 3: Decommissioned Models Break Silently at Runtime

**Problem:** During development, Groq deprecated `gemma2-9b-it`  a model I was using for benchmarking. The app crashed at runtime with a 400 error, not at startup.

**Solution:** Replaced the decommissioned model with `llama-3.1-8b-instant` (currently active on Groq). The broader lesson: **always wrap model API calls in try/except** and surface the error message directly to the user  never let a model error become a silent failure.

---

### Challenge 4: Streamlit State Management with HITL

**Problem:** The Human-in-the-Loop confirmation requires the app to pause mid-pipeline, show a confirmation UI, and then resume execution on the next Streamlit rerun. Managing this state without losing the action context was tricky.

**Solution:** Used `st.session_state.pending_action` to store the action dict between reruns. When a file intent is detected, the action is saved to state and `st.rerun()` is called. The confirmation UI reads from state and executes (or cancels) on the next button press.

---

### Challenge 5: OpenAI Quota Errors in Benchmarking

**Problem:** A perfectly valid OpenAI API key can still fail with a `429 insufficient_quota` error if the account has no billing credits  this confused early testers who thought their key was broken.

**Solution:** The benchmarking tab wraps each model call in try/except and displays the error inline inside the result card rather than crashing the whole tab. This way, if one model fails, the other still shows its result.

---

## Tech Stack Summary

| Component | Technology |
|---|---|
| UI Framework | Streamlit |
| STT (Cloud) | Groq Whisper Large V3 |
| STT (Local) | OpenAI Whisper (local) |
| LLM (Primary) | Groq  Llama 3.3 70B Versatile |
| LLM (Benchmark) | Groq  Llama 3.1 8B Instant |
| Language | Python 3.10+ |
| File output | Local filesystem (`output/` directory) |

---

## What I'd Build Next

- **Text-to-Speech output**  have the agent speak its responses back using Groq's TTS or ElevenLabs
- **File reading intent** — "Read my notes.txt and summarize it"
- **Web search intent** — "Search for the latest Python 3.13 features and save a summary"
- **Scheduled commands** — "Run this every morning at 9am"
- **Multi-agent routing** — route different intent types to specialized sub-agents

---

## Conclusion

Building a voice-controlled AI agent from scratch taught me that the hard parts aren't the AI  it's the **plumbing**: state management, error handling, audio format compatibility, and API quirks. The core pipeline took a day to build. Making it robust, user-friendly, and production-grade took much longer.

If you're building something similar, my biggest advice is: **handle failures first, features second**. A voice agent that crashes on bad audio or an expired API key is worse than no agent at all.

The full source code is available and runs with a single command:

``{% endraw %}{% raw %}`bash
pip install -r requirements.txt
streamlit run app.py
```

Enter fullscreen mode Exit fullscreen mode

Top comments (0)