DEV Community

Kurella Tejashwini
Kurella Tejashwini

Posted on

🎀 Building a Voice AI Assistant using STT, LLM, and Gradio

πŸš€ Introduction

In this project, I built a Voice AI Assistant that can understand spoken commands and perform actions like creating files, generating code, and summarizing text. The system integrates speech-to-text, natural language understanding, and automation into a single pipeline.

🧠 System Overview

The architecture of the system is as follows:

Audio Input β†’ Speech-to-Text β†’ Intent Detection β†’ Tool Execution β†’ Output

The user provides input through voice.
The system converts speech into text.
A local LLM analyzes the text to detect intent.
Based on the intent, the system executes the appropriate action.
πŸ›  Tech Stack
Python
AssemblyAI (Speech-to-Text API)
Ollama (Local LLM – phi model)
Gradio (User Interface)
🎯 Features

  1. Speech-to-Text (STT)

The system uses AssemblyAI to convert audio input into text. Polling is used to continuously check when transcription is completed.

  1. Intent Detection

A local LLM (via Ollama) is used to classify user input into four categories:

create_file
write_code
summarize
chat

To improve reliability, I implemented:

Prompt engineering for better classification
Regex-based JSON extraction
Rule-based validation as a fallback

  1. Tool Execution πŸ“ File Creation

Creates files dynamically inside a dedicated output/ folder.

πŸ’» Code Generation

Generates Python code based on user instructions using the LLM and cleans the output to remove markdown and explanations.

πŸ“ Summarization

Summarizes user-provided content using the LLM.

  1. Dynamic File Handling

Since speech-to-text may introduce formatting issues (e.g., β€œtext dot txt”), I implemented a normalization layer to extract correct file names using regex.

  1. User Interface

Gradio is used to provide a simple interface where users can upload or record audio and view results including:

Transcription
Detected intent
Action output
βš™οΈ Challenges Faced

  1. LLM Output Formatting

The local LLM sometimes returned extra text along with JSON. I solved this by extracting valid JSON using regex.

  1. Intent Misclassification

Small models like phi occasionally misclassified inputs. I improved accuracy by adding rule-based validation.

  1. API Limitations

While experimenting with cloud LLMs, I faced quota limitations. To ensure reliability, I switched to a local LLM using Ollama.

  1. Speech-to-Text Noise

STT outputs sometimes had spacing and punctuation issues. I handled this by cleaning and normalizing text before processing.

πŸ’‘ Key Learnings
Building end-to-end AI systems requires combining multiple components.
LLM outputs are not always reliable and need validation.
Local models can improve system stability by removing API dependency.
Prompt engineering plays a critical role in system performance.
🎯 Conclusion

This project demonstrates how voice interfaces can be integrated with AI systems to automate real-world tasks. By combining STT, LLMs, and tool execution, I built a robust and interactive assistant capable of handling multiple tasks efficiently.

πŸ”— Links
GitHub Repository: https://github.com/ktejashwini17/voice-ai-assistant
Demo Video: https://youtu.be/L5VGOnNkPGw

Top comments (0)