π Introduction
In this project, I built a Voice AI Assistant that can understand spoken commands and perform actions like creating files, generating code, and summarizing text. The system integrates speech-to-text, natural language understanding, and automation into a single pipeline.
π§ System Overview
The architecture of the system is as follows:
Audio Input β Speech-to-Text β Intent Detection β Tool Execution β Output
The user provides input through voice.
The system converts speech into text.
A local LLM analyzes the text to detect intent.
Based on the intent, the system executes the appropriate action.
π Tech Stack
Python
AssemblyAI (Speech-to-Text API)
Ollama (Local LLM β phi model)
Gradio (User Interface)
π― Features
- Speech-to-Text (STT)
The system uses AssemblyAI to convert audio input into text. Polling is used to continuously check when transcription is completed.
- Intent Detection
A local LLM (via Ollama) is used to classify user input into four categories:
create_file
write_code
summarize
chat
To improve reliability, I implemented:
Prompt engineering for better classification
Regex-based JSON extraction
Rule-based validation as a fallback
- Tool Execution π File Creation
Creates files dynamically inside a dedicated output/ folder.
π» Code Generation
Generates Python code based on user instructions using the LLM and cleans the output to remove markdown and explanations.
π Summarization
Summarizes user-provided content using the LLM.
- Dynamic File Handling
Since speech-to-text may introduce formatting issues (e.g., βtext dot txtβ), I implemented a normalization layer to extract correct file names using regex.
- User Interface
Gradio is used to provide a simple interface where users can upload or record audio and view results including:
Transcription
Detected intent
Action output
βοΈ Challenges Faced
- LLM Output Formatting
The local LLM sometimes returned extra text along with JSON. I solved this by extracting valid JSON using regex.
- Intent Misclassification
Small models like phi occasionally misclassified inputs. I improved accuracy by adding rule-based validation.
- API Limitations
While experimenting with cloud LLMs, I faced quota limitations. To ensure reliability, I switched to a local LLM using Ollama.
- Speech-to-Text Noise
STT outputs sometimes had spacing and punctuation issues. I handled this by cleaning and normalizing text before processing.
π‘ Key Learnings
Building end-to-end AI systems requires combining multiple components.
LLM outputs are not always reliable and need validation.
Local models can improve system stability by removing API dependency.
Prompt engineering plays a critical role in system performance.
π― Conclusion
This project demonstrates how voice interfaces can be integrated with AI systems to automate real-world tasks. By combining STT, LLMs, and tool execution, I built a robust and interactive assistant capable of handling multiple tasks efficiently.
π Links
GitHub Repository: https://github.com/ktejashwini17/voice-ai-assistant
Demo Video: https://youtu.be/L5VGOnNkPGw
Top comments (0)