🚀 Introduction
In this project, I built a voice-controlled AI agent that processes audio input, converts it into text, detects user intent, and performs actions such as file creation, code generation, summarization, and chat.
🧠 System Architecture
The system follows a simple pipeline:
Audio Input → Speech-to-Text → Intent Detection → Action Execution → UI Output
🔊 Speech-to-Text
I used OpenAI Whisper (local model) to convert audio into text. Whisper provides high accuracy even with different accents and noise.
🤖 Intent Detection
The system analyzes the transcribed text and classifies it into:
- Create File
- Write Code
- Summarize Text
- General Chat
⚙️ Actions
Based on the detected intent, the system performs:
- File creation inside a safe output directory
- Code generation and saving into files
- Text summarization
- Chat responses
💻 User Interface
I used Streamlit to build a simple and interactive UI that displays:
- Transcribed text
- Detected intent
- Action results
⚡ Challenges Faced
- Handling speech recognition errors
- Managing file safety using a restricted output directory
- Designing a clean UI pipeline
🎯 Conclusion
This project demonstrates how to build a local AI agent that integrates speech processing, NLP, and automation into a single system.
🔗 Links
- GitHub Repository: https://github.com/Vedant-Jagtap/voice-ai-agent.git
- Demo Video: https://youtu.be/KwK0PrQG9Z4?si=bKxDWaHV6tQPZEwH
Top comments (0)