Building a Voice-Controlled Local AI Agent using Ollama and Whisper

#agents #ai #llm #tutorial

Introduction

In this project, I built a Voice-Controlled Local AI Agent that can accept audio or text input, understand the user's intent, and perform real actions like creating files, generating code, and summarizing text.

Unlike traditional chatbots, this system doesn’t just respond — it acts based on user commands.

System Architecture

The system follows a clear pipeline:

Audio/Text Input → Speech-to-Text → Intent Detection → Tool Execution → UI Output
Audio/Text Input
The system accepts:
Microphone input
Uploaded audio files (.wav, .mp3)
Direct text input
Speech-to-Text (STT)
Audio is converted into text using a local model (faster-whisper).
Intent Detection
A local LLM (via Ollama) analyzes the text and classifies the user’s intent.
Tool Execution
Based on intent, the system performs actions like:
Creating files/folders
Writing code
Summarizing text
User Interface
A Gradio-based UI displays:
Transcribed text
Detected intent
Actions taken
Final output
🛠️ Technologies Used
Python – Core development
Gradio – User interface
faster-whisper – Speech-to-text
Ollama (phi3:mini) – Local LLM for intent detection
Regex + AST – Code extraction and validation
🎯 Features
🎤 Voice input (microphone + file upload)
🧠 Local intent detection using LLM
📁 File and folder creation
💻 Code generation and saving
✂️ Text summarization
🔐 Secure sandbox (output/ directory)
🔁 Compound command support
🧍 Human-in-the-loop approval
🧠 Session memory
⏱️ Performance benchmarking
⚠️ Challenges Faced

During development, several challenges came up:

Intent Misclassification
The model sometimes confused between creating files and writing code.
Uncontrolled Code Generation
The LLM returned explanations along with code, making files messy.
Path Security Issues
Some commands tried writing outside the allowed directory.
Compound Command Handling
Managing multiple steps in a single command required careful execution logic.
✅ Solutions Implemented

To overcome these challenges:

Added rule-based overrides for accurate intent detection
Designed strict prompts for controlled code generation
Implemented code cleaning using AST parsing
Built a safe path system restricting operations to output/
Added path carryover logic for compound commands
🔒 Security Considerations

All file operations are restricted to a dedicated output/ folder.
Any attempt to access outside paths (like ../../) is automatically blocked.

🚀 Example Workflow

User Input:

Write a Python program to add two numbers and save it to test_add.py

System Execution:

Detects intent (write_code_to_new_file)
Generates Python code
Saves file in output/test_add.py
Displays full pipeline in UI
🏁 Conclusion

This project demonstrates how a local AI agent can be built to perform real-world tasks efficiently and securely.

It combines speech processing, language models, and system-level execution into a single interactive application.

🔗 Links
GitHub Repository: (https://github.com/adarsh7979s/Voice-controlled-ai-agent)
Demo Video: (https://youtu.be/PFnSSqCuNd4)