Building a Voice-Controlled AI Agent with FastAPI, Groq & Streamlit

#webdev #ai #python #machinelearning

Ever wanted to just talk to your computer and have it actually do something useful create files, write code, summarize text? That's exactly what I built for this project.

This article covers the architecture, the models I picked, the challenges I hit, and the lessons learned.

🏗️ Architecture

The system has two parts talking over HTTP:

FastAPI backend — handles all AI inference and file operations
Streamlit frontend — handles audio input and displays results

Every request goes through 3 stages:

Audio Input
    ↓
[STT]  Groq Whisper-large-v3  →  transcribed text
    ↓
[Intent]  Groq Llama-3.1-8b  →  JSON task list
    ↓
[Execute]  Local tools  →  create file / write code / summarize / chat
    ↓
Display result in UI

Keeping the backend and frontend separate means I can swap out the UI without touching any AI logic.

🤖 Models I Used

Stage	Model	Why
Speech-to-Text	whisper-large-v3 (Groq)	Best open STT model, fast via Groq
Intent Classification	llama-3.1-8b-instant	Small, fast, reliable at JSON output
Code Generation	llama-3.1-8b-instant	Fast enough for short scripts
Summarization	llama-3.1-8b-instant	Better quality, acceptable latency

⚠️ Why Groq API instead of local Whisper?

The assignment recommended running Whisper locally via HuggingFace. However, whisper-large-v3 needs at least 6GB of GPU VRAM to run at a usable speed. On CPU it takes 30–60 seconds per clip — way too slow for an interactive UI.

Groq runs the exact same model on their hardware, returning results in ~700ms. The model is identical, only the compute location differs.

🧠 Intent Classification — The Tricky Part

Getting the LLM to output clean, parseable JSON every time was harder than expected. Language models naturally want to add explanations and wrap things in markdown. Both of those break json.loads().

The fix was a very strict system prompt:

SYSTEM_PROMPT = """
You are a strict JSON routing agent.
Return ONLY valid JSON. No explanation. No markdown. No extra text.

Available intents:
- create_file  → { filename, content }
- write_code   → { filename, language, description }
- summarize    → { text, save_to }
- chat         → { message }

Always return: { "tasks": [ ...task objects... ] }
Each task: { "intent", "parameters", "confidence" }
Multiple commands → multiple tasks in the list.
If unclear → default to "chat".
"""

Supporting a tasks array from day one is what enables compound commands — the model naturally puts two intents in the list when the user says "write a file and summarize it."

🔒 File Safety

Since the system writes files based on user voice input, path traversal is a real concern. The fix: a sandboxing function that resolves the absolute path and rejects anything outside output/.

OUTPUT_DIR = os.path.abspath("output")

def _safe_path(filename: str) -> str | None:
    target = os.path.abspath(os.path.join(OUTPUT_DIR, filename))
    # Trailing os.sep is critical — without it, "output_evil/" would pass
    if not target.startswith(OUTPUT_DIR + os.sep):
        return None
    return target

All generated files go into output/. Nothing else is writable.

💾 Session Memory

The agent keeps two parallel histories:

Action history — timestamped log shown in the UI sidebar
Chat context — last 3 user/assistant pairs sent to the LLM on every classify call

This means the user can say "now do the same for the other file" and the model understands the reference. Without it, every request is completely stateless.

📊 Latency Benchmarks

Averaged across 20 runs with a 5–10 second audio clip:

Stage	Model	Avg Latency
Speech-to-Text	whisper-large-v3	~720 ms
Intent Classification	llama-3.1-8b-instant	~380 ms
Code Generation	llama-3.1-8b-instant	~950 ms
Summarization	llama-3.1-8b-instant	~1,350 ms

Total end-to-end for a write_code request: ~2.0–2.5 seconds. Fast enough to feel responsive.

Interesting finding: Intent classification is the fastest stage despite being the most "reasoning-heavy" step — because the strict JSON-only prompt forces the model to skip all its natural language preamble. Constraining output format is free speed.

🐛 Challenges I Faced

1. JSON parse failures — Even with a strict prompt, the model occasionally wraps output in markdown fences. Added a fallback that strips backticks before parsing, plus a catch-all that defaults to chat intent on failure.

2. Audio format handling — Groq's Whisper API requires the correct file extension. Sending a .wav file named audio with no extension caused silent failures. Fix: always preserve the original filename and extension.

3. Two-process state — Streamlit and FastAPI are separate processes. If the backend restarts, all session history is lost. A future fix would be writing to SQLite on every memory.add() call.

4. Browser mic compatibility — The streamlit-audiorecorder component works great in Chrome but has issues in Firefox and Safari. Documented this in the README.

✨ Bonus Features Built

✅ Compound commands — one audio clip triggers multiple tasks
✅ Human-in-the-loop — optional confirmation before executing file operations
✅ Session memory — rolling chat context + action history sidebar
✅ Latency benchmarking — toggle in settings to show model speeds