Voice-Controlled AI Agent Using Whisper and Local LLM

#ai #python #streamlit

Overview

I recently built a Voice-Controlled AI Agent that processes both audio and text inputs, understands user intent, and performs meaningful actions through a structured pipeline.

The goal of this project was to design a complete AI system that works locally without relying on paid APIs, while maintaining simplicity and reliability.

Architecture

The system follows this pipeline:

Input → Speech-to-Text → Intent Detection → Action Execution → Output

Key Features

Supports both audio (.wav, .mp3) and text input
Speech-to-text using Whisper (local model)
Intent detection using a hybrid approach (rule-based + LLM fallback)
Actions supported:
- File creation
- Python code generation
- Text summarization
- Chat responses
Compound commands (multiple actions in one input)
Persistent memory using JSON
Safe file handling within a dedicated output directory

Tech Stack

Python
Streamlit
Whisper
Ollama (Llama3)

Challenges

One of the key challenges was handling noisy or unclear speech input. This was addressed by combining rule-based logic with LLM-based intent detection.

Another challenge was ensuring correct intent classification for short inputs, which required prioritizing rules over model responses.

Learnings

This project helped me understand how real-world AI systems are built beyond just using models — including pipeline design, validation, and system reliability.