sneha dhir

Posted on Apr 12

Building a Voice-Controlled AI Agent with FastAPI, Groq Whisper & LLaMA

#ai #fastapi #python #machinelearning

Introduction

What if you could speak a command and have an AI agent create files, write code, or summarize text on your local machine in real time? That is exactly what I built for the Mem0 AI/ML Generative AI Developer Intern Assignment.

In this article I will walk through the architecture, the models I chose, the challenges I faced, and how I solved them.

What the Agent Does

The Voice AI Agent is a full-stack application that accepts voice input via microphone or audio file upload, converts speech to text using Groq Whisper Large v3, classifies the intent using LLaMA 3.3 70B, executes local tools, and displays the full pipeline in a dark-mode chat UI.

Supported intents:

create_file - Create a text or markdown file
write_code - Generate code and save to a file
summarize - Summarize provided text
general_chat - Conversational response
compound - Multiple intents in one command (bonus feature)

Architecture

The system has 5 main components:

Frontend (index.html) - Vanilla HTML, CSS, and JavaScript. Uses the MediaRecorder API for mic input and supports audio file upload.

Backend (main.py) - FastAPI server exposing POST /process-audio, POST /process-text, GET /history, and GET /benchmark endpoints.

stt.py - Speech-to-Text using Groq Whisper Large v3 with local Whisper as primary and Groq as fallback.

intent.py - Intent classification using LLaMA 3.3 70B, returns structured JSON with intent, filename, language, content_hint, and confidence.

tools.py - Tool execution handling file creation, code generation, summarization, and general chat. All files are sandboxed inside an output/ folder.

Pipeline Flow

Audio blob is sent to the backend
stt.py transcribes it via Groq Whisper
intent.py classifies the transcription via LLaMA 3.3 70B
tools.py executes the correct tool
The result is sent back and displayed in the chat UI

Models Chosen

STT: Groq Whisper Large v3

My local machine runs Windows with no dedicated GPU. Running whisper-base locally via HuggingFace took 45 to 90 seconds per transcription on CPU, which is completely unusable for real-time interaction.

Groq runs Whisper Large v3 at approximately 100x real-time speed. A 5-second audio clip transcribes in under 200ms.

The code maintains a fallback chain: Local Whisper first, then Groq, then OpenAI. So anyone with a GPU gets the local model automatically.

LLM: LLaMA 3.3 70B via Groq

On CPU-only machines, a 70B model via Ollama takes minutes per response. Groq's LLaMA 3.3 70B runs in around 500ms, which feels responsive in the chat UI.

Fallback chain: Ollama if running locally, then Groq LLaMA 3.3 70B, then OpenAI GPT-4o-mini.

Bonus Features Implemented

1. Compound Commands

When a user says "Summarize this text and save it to summary.txt", the LLM detects compound intent and returns both summarize and create_file as sub-intents. The _handle_compound() function chains them together in one request.

2. Human-in-the-Loop

Before any file operation, the backend returns needs_confirmation as true. The frontend shows a confirmation card with Yes and Cancel buttons. The confirmed flag is sent back with the next request.

3. Graceful Degradation

Every LLM call is wrapped in try/except. If the audio file is under 5KB (likely silence), the backend rejects it immediately with a clear error message rather than wasting an API call.

4. Session Memory

The SessionMemory class stores the last 10 interactions. The last 3 turns are passed as context to the intent classifier, enabling follow-up commands like "Now save that to a file."

5. Model Benchmarking

Every LLM call logs the provider, duration in ms, prompt length, and response length. The /benchmark endpoint returns aggregated stats and the frontend shows a live benchmark panel.

Challenges Faced

Challenge 1: Microphone Silent on Windows

The mic was recording only 1.5 to 3.7KB of audio, essentially silence, because the Windows microphone volume was at 0 by default for the device.

Fix: Added a minimum size check in stt.py. Any recording under 5000 bytes is rejected with a clear message telling the user to check their mic volume in Windows Sound Settings.

Challenge 2: MediaRecorder MIME Type Differences

Chrome produces audio/webm;codecs=opus, Firefox produces audio/ogg;codecs=opus, and Safari produces audio/mp4. Hardcoding audio/webm caused silent failures on Firefox and Safari.

Fix: Dynamic MIME type selection at runtime using MediaRecorder.isTypeSupported() to pick the best available format.

Challenge 3: LLM JSON Parsing

LLMs sometimes wrap their JSON response in markdown code fences, causing json.loads() to crash.

Fix: Strip markdown fences with regex before parsing, with a secondary regex to extract any JSON object found anywhere in the response string.

Challenge 4: Ollama Blocking Startup

On machines without Ollama installed, the connection attempt hangs for 30 seconds before timing out.

Fix: Documented clearly in README to set LLM_PROVIDER=groq in the .env file if not using Ollama. The fallback triggers immediately on ConnectionRefusedError.

Model Benchmarking Results

Tested on Windows 11, Intel i5, no GPU:

Model	Task	Avg Response Time
Groq Whisper Large v3	STT 5s audio	180ms
Groq LLaMA 3.3 70B	Intent classification	420ms
Groq LLaMA 3.3 70B	Code generation	1200ms
Local Whisper Base CPU	STT 5s audio	65000ms
Ollama LLaMA 3.2 CPU	Intent classification	45000ms

Groq API is roughly 150 to 350 times faster than running models locally on CPU. The tradeoff is needing an internet connection and an API key.

Tech Stack Summary

Layer	Technology
Backend	FastAPI and Uvicorn
STT	Groq Whisper Large v3
LLM	Groq LLaMA 3.3 70B
Frontend	Vanilla HTML CSS JS
Memory	In-session Python store
Safety	output/ sandbox folder

How to Run

Clone the repo and install dependencies:

pip install fastapi uvicorn python-dotenv groq python-multipart

Create a .env file with:

GROQ_API_KEY=your_key_here
LLM_PROVIDER=groq
STT_PROVIDER=groq

Run the server:

python main.py

Then open http://localhost:8000 in your browser. Get a free Groq API key at console.groq.com.

Conclusion

Building this agent taught me a lot about practical tradeoffs between local and cloud-based AI inference, browser audio APIs, and designing resilient fallback systems. The most valuable lesson was that graceful degradation matters more than perfection. A system that fails clearly and recovers cleanly is far better than one that silently produces wrong results.

Built as part of the Mem0 AI/ML Generative AI Developer Intern Assignment.

DEV Community