🎤 Building a Voice AI Assistant using STT, LLM, and Gradio

Kurella Tejashwini — Mon, 13 Apr 2026 08:35:29 +0000

🚀 Introduction

In this project, I built a Voice AI Assistant that can understand spoken commands and perform actions like creating files, generating code, and summarizing text. The system integrates speech-to-text, natural language understanding, and automation into a single pipeline.

🧠 System Overview

The architecture of the system is as follows:

Audio Input → Speech-to-Text → Intent Detection → Tool Execution → Output

The user provides input through voice.
The system converts speech into text.
A local LLM analyzes the text to detect intent.
Based on the intent, the system executes the appropriate action.
🛠 Tech Stack
Python
AssemblyAI (Speech-to-Text API)
Ollama (Local LLM – phi model)
Gradio (User Interface)
🎯 Features

Speech-to-Text (STT)

The system uses AssemblyAI to convert audio input into text. Polling is used to continuously check when transcription is completed.

Intent Detection

A local LLM (via Ollama) is used to classify user input into four categories:

create_file
write_code
summarize
chat

To improve reliability, I implemented:

Prompt engineering for better classification
Regex-based JSON extraction
Rule-based validation as a fallback

Tool Execution 📁 File Creation

Creates files dynamically inside a dedicated output/ folder.

💻 Code Generation

Generates Python code based on user instructions using the LLM and cleans the output to remove markdown and explanations.

📝 Summarization

Summarizes user-provided content using the LLM.

Dynamic File Handling

Since speech-to-text may introduce formatting issues (e.g., “text dot txt”), I implemented a normalization layer to extract correct file names using regex.

User Interface

Gradio is used to provide a simple interface where users can upload or record audio and view results including:

Transcription
Detected intent
Action output
⚙️ Challenges Faced

LLM Output Formatting

The local LLM sometimes returned extra text along with JSON. I solved this by extracting valid JSON using regex.

Intent Misclassification

Small models like phi occasionally misclassified inputs. I improved accuracy by adding rule-based validation.

API Limitations

While experimenting with cloud LLMs, I faced quota limitations. To ensure reliability, I switched to a local LLM using Ollama.

Speech-to-Text Noise

STT outputs sometimes had spacing and punctuation issues. I handled this by cleaning and normalizing text before processing.

💡 Key Learnings
Building end-to-end AI systems requires combining multiple components.
LLM outputs are not always reliable and need validation.
Local models can improve system stability by removing API dependency.
Prompt engineering plays a critical role in system performance.
🎯 Conclusion

This project demonstrates how voice interfaces can be integrated with AI systems to automate real-world tasks. By combining STT, LLMs, and tool execution, I built a robust and interactive assistant capable of handling multiple tasks efficiently.

🔗 Links
GitHub Repository: https://github.com/ktejashwini17/voice-ai-assistant
Demo Video: https://youtu.be/L5VGOnNkPGw

DEV Community: Kurella Tejashwini

🎤 Building a Voice AI Assistant using STT, LLM, and Gradio