DEV Community

ayisha
ayisha

Posted on

Building a Voice-Controlled Local AI Agent with Groq Whisper, Ollama, and Streamlit

Introduction

I built a voice-controlled AI agent that listens to your voice,
understands your intent, and executes actions on your local machine —
all through a clean web UI. In this article, I'll walk through the
architecture, models I chose, and the challenges I faced building this
on Windows.

What It Does

You speak a command like "Create a Python file with a retry function"
and the agent:

  1. Transcribes your audio to text
  2. Detects your intent using a local LLM
  3. Executes the right action (generates code, creates files, summarizes text)
  4. Shows everything in a Streamlit UI

Architecture

Audio Input → Groq Whisper STT → Ollama LLM (Intent) → Tool Execution → Streamlit UI

Components:

  • STT: Groq Whisper large-v3 API
  • LLM: llama3.2 via Ollama (runs 100% locally)
  • UI: Streamlit
  • Tools: File creation, code generation, text summarization, general chat

Models I Chose

Speech-to-Text: Groq Whisper

I initially planned to use OpenAI Whisper locally via HuggingFace.
However, Whisper requires ffmpeg which had PATH issues on Windows.
I switched to Groq's Whisper API which is free, fast, and supports
all audio formats without any local setup.

LLM: llama3.2 via Ollama

I chose Ollama for local LLM inference because it's easy to set up
on Windows and runs completely offline. llama3.2 provided a good
balance between speed and accuracy for intent classification.

Intent Classification

The LLM classifies user speech into four intents:

  • WRITE_CODE — generates and saves code to output/
  • CREATE_FILE — creates a new file in output/
  • SUMMARIZE — summarizes provided text
  • GENERAL_CHAT — general conversation

I used structured JSON prompting to get consistent output from the LLM:

SYSTEM_PROMPT = """Classify the intent into one of:
WRITE_CODE, CREATE_FILE, SUMMARIZE, GENERAL_CHAT
Respond in JSON format only."""
Enter fullscreen mode Exit fullscreen mode

Challenges I Faced

1. ffmpeg on Windows

Whisper requires ffmpeg but adding it to PATH on Windows was
problematic due to OneDrive folder paths with spaces. I solved
this by switching to Groq's API entirely.

2. Multiple Python versions

My machine had both Python 3.12 and 3.13 installed. Packages
installed on one version weren't available on the other. I solved
this by always using py -3.12 explicitly.

3. Streamlit state management

Button clicks in Streamlit trigger full page reruns, losing
previous results. I solved this using st.session_state to persist
transcription and intent results across reruns.

4. API Key Security

GitHub Push Protection blocked my push because my Groq API key
was hardcoded. I fixed this by using python-dotenv with a .env
file and environment variables.

Safety

All file operations are restricted to an output/ folder to prevent
accidental system file overwrites.

Demo

Watch the full demo here: https://youtu.be/S2PejSQGpAA

GitHub

Full source code: https://github.com/ayisha-parli/voice-agent

Conclusion

Building a fully local voice AI agent is very achievable with
modern tools like Ollama and Groq. The biggest challenges were
Windows-specific setup issues rather than AI-related problems.
The final system works reliably and can be extended with more
intents and tools easily.

Top comments (0)