Building a Voice-Controlled Local AI Agent with Groq, Ollama, and Gradio

VARUN M — Sun, 12 Apr 2026 17:31:17 +0000

Building a Voice-Controlled Local AI Agent with Groq, Ollama, and Gradio

Introduction

What if you could just speak to your computer and have it write code, summarize text, or create files — all locally on your machine? That's exactly what I built for my internship assignment at Mem0 AI.

In this article, I'll walk you through how I designed and built a voice-controlled local AI agent that:

Accepts audio via microphone or file upload
Transcribes speech to text using Groq Whisper
Classifies intent using LLaMA 3.3 70B
Executes local tools (file creation, code generation, summarization, chat)
Displays everything in a clean Gradio UI

Architecture

Audio Input (Mic / File Upload)
↓
STT: Groq Whisper API (whisper-large-v3)
↓
Intent Classification: Groq API (llama-3.3-70b-versatile)
↓
Tool Execution: Groq API (llama-3.3-70b-versatile)
↓
UI Display: Gradio

The pipeline is simple and modular — each stage is isolated in its own file (stt.py, intent.py, tools.py), making it easy to swap models or extend functionality.

Models Chosen and Why

Speech-to-Text: Groq Whisper (whisper-large-v3)

The assignment recommended using a HuggingFace model like Whisper locally. However, my machine is a MacBook Air with only 8GB RAM — running Whisper locally would be slow and unreliable. Instead, I used the Groq API to run Whisper, which is significantly faster (typically under 2 seconds for a 10-second clip) and completely free on the Groq free tier.

Intent Classification: llama-3.3-70b-versatile via Groq

I initially tried using Ollama with llama3.2:1b locally for intent classification. The problem was that small models struggle to reliably output structured JSON. Switching to llama-3.3-70b-versatile via Groq gave consistent, accurate JSON intent classification every time.

Tool Execution: llama-3.3-70b-versatile via Groq

All tool execution — code generation, summarization, and chat — also uses the same Groq model. This keeps latency low and quality high without any local GPU requirement.

Supported Intents

The agent supports four core intents:

write_code + create_file — generates code and saves it to output/
create_file — creates a file with specified content
summarize — summarizes provided text (optionally saves to file)
chat — general conversation with memory across the session

Bonus Features Implemented

Compound Commands

The agent handles multi-intent commands in a single audio input. For example:

"Summarize this text and save it to summary.txt"

The intent classifier returns ["summarize", "create_file"] and the tools pipeline handles both in sequence.

Human-in-the-Loop

Before executing any file operation, the UI shows a confirmation prompt with Confirm and Cancel buttons. This prevents accidental file writes.

Graceful Degradation

Every function is wrapped in try/except blocks. If the STT fails, the error is displayed cleanly. If intent classification returns an unexpected format, it falls back to chat mode.

Persistent Memory

Session history is saved to a session_history.json file. Every action is logged with its transcription, intent, and result. The history persists across app restarts and is displayed as a table in the UI.

Chat context is also maintained within a session — the agent remembers previous messages for coherent multi-turn conversations.

Model Benchmarking

Component	Model	Avg Latency	Quality
STT	Groq Whisper Large v3	~1.5s	Excellent
Intent	llama-3.3-70b (Groq)	~1.2s	Consistent JSON
Intent (attempted)	llama3.2:1b (Ollama local)	~3s	Inconsistent JSON
Code Gen	llama-3.3-70b (Groq)	~2-3s	Clean, executable code

The biggest performance difference was in intent classification. The local llama3.2:1b model frequently failed to return valid JSON, causing fallback to chat mode. Switching to the 70B model via Groq solved this completely.

Challenges Faced

1. Model size vs RAM constraints
Running 7B+ parameter models locally on 8GB RAM caused slowdowns and timeouts. The solution was offloading STT and LLM inference to Groq's free API while keeping the architecture "local-first" in spirit.

2. Intent classification reliability
Small models are not reliable at following structured output instructions. The fix was using a larger, smarter model with a well-crafted system prompt and few-shot examples.

3. Browser compatibility
Gradio's streaming/generator approach caused Safari to drop WebSocket connections on long requests. Switching to Chrome and using a non-streaming approach with app.queue() solved the freezing issue.

4. Markdown fences in generated code
The LLM kept wrapping generated code in markdown fences (

```python). This was fixed by stripping fences before writing to file.

Setup Instructions


bash
git clone https://github.com/varun-2437/local-ai-voice-agent
cd local-ai-voice-agent/voice-agent
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Add your Groq API key to .env:
GROQ_API_KEY=your_key_here

Run:


bash
ollama serve  # in one terminal
python app.py  # in another

Conclusion

This project taught me a lot about building practical AI pipelines — choosing the right model for the right job, handling real hardware constraints, and making systems robust with graceful error handling. The combination of Groq's fast free API and Gradio's simple UI framework made it surprisingly easy to go from idea to working product in a short time.

GitHub: https://github.com/varun-2437/local-ai-voice-agent

DEV Community: VARUN M

Building a Voice-Controlled Local AI Agent with Groq, Ollama, and Gradio

Building a Voice-Controlled Local AI Agent with Groq, Ollama, and Gradio

Introduction

Architecture

Models Chosen and Why

Speech-to-Text: Groq Whisper (whisper-large-v3)

Intent Classification: llama-3.3-70b-versatile via Groq

Tool Execution: llama-3.3-70b-versatile via Groq

Supported Intents

Bonus Features Implemented

Compound Commands

Human-in-the-Loop

Graceful Degradation

Persistent Memory

Model Benchmarking

Challenges Faced

Setup Instructions

Conclusion