Purpose
I built this project to explore the intersection of voice interfaces and local system automation. The goal was to move beyond simple chatbots and design a hands-free AI agent that understands spoken commands and executes real tasks like generating code, creating files, and summarizing text.
System Architecture
The system is designed as a modular pipeline with four core components:
Frontend: Built using Streamlit for a lightweight, reactive user interface.
Speech-to-Text (STT): Whisper-large-v3 via the Groq API for high-speed transcription.
The Brain (LLM): Llama 3.2 (1B) running locally via Ollama.
Action Layer: Custom Python logic for secure file operations and text processing.
This pipeline ensures a seamless flow from voice input to intent detection and then execution.
Strategic Model Selection
I chose Llama 3.2:1B for intent classification because it is exceptionally lightweight and efficient for local execution. Despite its small parameter count, it excels at:
Categorizing complex user intents.
Generating clean, syntactically correct Python code.
Context-aware text summary.
This model allowed me to build a responsive system that prioritizes user privacy and works without high-end GPU hardware.
Challenges & Workarounds
Solving for Latency
Running Whisper locally on consumer hardware introduced a 10-second lag, which broke the conversational flow.
Workaround: I offloaded STT to the Groq API, reducing latency to near real-time while maintaining a local-first LLM workflow for the thinking process.Handling "Chatty" LLM Outputs
Small LLMs sometimes provide conversational filler when only a specific label is needed.
Workaround:I implemented structured prompt engineering and keyword-based filtering to extract clean, actionable intent labels from the model's response.Safety & Security (The Sandbox)
Allowing an AI to write files directly to a system is a major security risk.
Workaround: I implemented a Human-in-the-loop confirmation system. All file operations are restricted to a dedicated directory and require a manual user click before data is written to the disk.
Key Features
Dual Input:Supports both live Mic recording and File Upload (.wav/.mp3).
Local Intelligence: LLM processing happens entirely via Ollama for privacy.
Automated Workflow:From intent detection to file creation in seconds.
Session Memory:Tracks recent commands for a better user experience.
Learnings & Takeaways
This project was a deep dive into designing end-to-end AI pipelines. It taught me how to integrate local and cloud models to balance performance with privacy and how to design systems that are robust, safe, and useful for real-world tasks.
Link
GitHub Repository: https://github.com/Rupali0-lab/voice-ai-agent-/tree/main
Author: Rupali Raj
Top comments (0)