<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kurella Tejashwini</title>
    <description>The latest articles on DEV Community by Kurella Tejashwini (@tejashwini_kurella).</description>
    <link>https://dev.to/tejashwini_kurella</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3876205%2Fc55a4912-40eb-4164-849d-4d48b7bba64f.png</url>
      <title>DEV Community: Kurella Tejashwini</title>
      <link>https://dev.to/tejashwini_kurella</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tejashwini_kurella"/>
    <language>en</language>
    <item>
      <title>🎤 Building a Voice AI Assistant using STT, LLM, and Gradio</title>
      <dc:creator>Kurella Tejashwini</dc:creator>
      <pubDate>Mon, 13 Apr 2026 08:35:29 +0000</pubDate>
      <link>https://dev.to/tejashwini_kurella/building-a-voice-ai-assistant-using-stt-llm-and-gradio-o1l</link>
      <guid>https://dev.to/tejashwini_kurella/building-a-voice-ai-assistant-using-stt-llm-and-gradio-o1l</guid>
      <description>&lt;p&gt;🚀 Introduction&lt;/p&gt;

&lt;p&gt;In this project, I built a Voice AI Assistant that can understand spoken commands and perform actions like creating files, generating code, and summarizing text. The system integrates speech-to-text, natural language understanding, and automation into a single pipeline.&lt;/p&gt;

&lt;p&gt;🧠 System Overview&lt;/p&gt;

&lt;p&gt;The architecture of the system is as follows:&lt;/p&gt;

&lt;p&gt;Audio Input → Speech-to-Text → Intent Detection → Tool Execution → Output&lt;/p&gt;

&lt;p&gt;The user provides input through voice.&lt;br&gt;
The system converts speech into text.&lt;br&gt;
A local LLM analyzes the text to detect intent.&lt;br&gt;
Based on the intent, the system executes the appropriate action.&lt;br&gt;
🛠 Tech Stack&lt;br&gt;
Python&lt;br&gt;
AssemblyAI (Speech-to-Text API)&lt;br&gt;
Ollama (Local LLM – phi model)&lt;br&gt;
Gradio (User Interface)&lt;br&gt;
🎯 Features&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Speech-to-Text (STT)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The system uses AssemblyAI to convert audio input into text. Polling is used to continuously check when transcription is completed.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Intent Detection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A local LLM (via Ollama) is used to classify user input into four categories:&lt;/p&gt;

&lt;p&gt;create_file&lt;br&gt;
write_code&lt;br&gt;
summarize&lt;br&gt;
chat&lt;/p&gt;

&lt;p&gt;To improve reliability, I implemented:&lt;/p&gt;

&lt;p&gt;Prompt engineering for better classification&lt;br&gt;
Regex-based JSON extraction&lt;br&gt;
Rule-based validation as a fallback&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tool Execution
📁 File Creation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Creates files dynamically inside a dedicated output/ folder.&lt;/p&gt;

&lt;p&gt;💻 Code Generation&lt;/p&gt;

&lt;p&gt;Generates Python code based on user instructions using the LLM and cleans the output to remove markdown and explanations.&lt;/p&gt;

&lt;p&gt;📝 Summarization&lt;/p&gt;

&lt;p&gt;Summarizes user-provided content using the LLM.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Dynamic File Handling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since speech-to-text may introduce formatting issues (e.g., “text dot txt”), I implemented a normalization layer to extract correct file names using regex.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User Interface&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Gradio is used to provide a simple interface where users can upload or record audio and view results including:&lt;/p&gt;

&lt;p&gt;Transcription&lt;br&gt;
Detected intent&lt;br&gt;
Action output&lt;br&gt;
⚙️ Challenges Faced&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;LLM Output Formatting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The local LLM sometimes returned extra text along with JSON. I solved this by extracting valid JSON using regex.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Intent Misclassification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Small models like phi occasionally misclassified inputs. I improved accuracy by adding rule-based validation.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;API Limitations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;While experimenting with cloud LLMs, I faced quota limitations. To ensure reliability, I switched to a local LLM using Ollama.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Speech-to-Text Noise&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;STT outputs sometimes had spacing and punctuation issues. I handled this by cleaning and normalizing text before processing.&lt;/p&gt;

&lt;p&gt;💡 Key Learnings&lt;br&gt;
Building end-to-end AI systems requires combining multiple components.&lt;br&gt;
LLM outputs are not always reliable and need validation.&lt;br&gt;
Local models can improve system stability by removing API dependency.&lt;br&gt;
Prompt engineering plays a critical role in system performance.&lt;br&gt;
🎯 Conclusion&lt;/p&gt;

&lt;p&gt;This project demonstrates how voice interfaces can be integrated with AI systems to automate real-world tasks. By combining STT, LLMs, and tool execution, I built a robust and interactive assistant capable of handling multiple tasks efficiently.&lt;/p&gt;

&lt;p&gt;🔗 Links&lt;br&gt;
GitHub Repository: &lt;a href="https://github.com/ktejashwini17/voice-ai-assistant" rel="noopener noreferrer"&gt;https://github.com/ktejashwini17/voice-ai-assistant&lt;/a&gt;&lt;br&gt;
Demo Video: &lt;a href="https://youtu.be/L5VGOnNkPGw" rel="noopener noreferrer"&gt;https://youtu.be/L5VGOnNkPGw&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
