<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nayana Shaji Mekkunnel</title>
    <description>The latest articles on DEV Community by Nayana Shaji Mekkunnel (@nayana_shaji_m).</description>
    <link>https://dev.to/nayana_shaji_m</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3875396%2Febbae61f-fee4-4f9f-b8d9-b5f0432db744.png</url>
      <title>DEV Community: Nayana Shaji Mekkunnel</title>
      <link>https://dev.to/nayana_shaji_m</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nayana_shaji_m"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent Using Whisper and Ollama</title>
      <dc:creator>Nayana Shaji Mekkunnel</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:25:33 +0000</pubDate>
      <link>https://dev.to/nayana_shaji_m/building-a-voice-controlled-local-ai-agent-using-whisper-and-ollama-3mca</link>
      <guid>https://dev.to/nayana_shaji_m/building-a-voice-controlled-local-ai-agent-using-whisper-and-ollama-3mca</guid>
      <description>&lt;h1&gt;
  
  
  Building a Voice-Controlled Local AI Agent Using Whisper and Ollama
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Voice interfaces are becoming an important way to interact with intelligent systems. This project explores how to build a local AI agent that can understand spoken commands, interpret user intent, and execute real actions on a system.&lt;/p&gt;

&lt;p&gt;The goal was to create a complete pipeline that takes audio input, converts it into text, analyzes the intent using a language model, and performs tasks such as file creation, code generation, and text summarization through a web interface.&lt;/p&gt;




&lt;h2&gt;
  
  
  System Overview
&lt;/h2&gt;

&lt;p&gt;The system follows a modular pipeline:&lt;/p&gt;

&lt;p&gt;Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI Output&lt;/p&gt;

&lt;p&gt;Each component is designed to be independent, making the system easier to debug, optimize, and extend.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Audio Input
&lt;/h3&gt;

&lt;p&gt;The system supports two input modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Microphone input using Streamlit’s &lt;code&gt;audio_input&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Audio file upload (&lt;code&gt;.wav&lt;/code&gt;, &lt;code&gt;.mp3&lt;/code&gt;, &lt;code&gt;.m4a&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures flexibility for both real-time interaction and testing.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Speech-to-Text (STT)
&lt;/h3&gt;

&lt;p&gt;Speech is converted to text using OpenAI’s Whisper model running locally.&lt;/p&gt;

&lt;p&gt;To improve performance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A smaller model (&lt;code&gt;tiny&lt;/code&gt;) was used&lt;/li&gt;
&lt;li&gt;The model was cached to avoid reloading on every request&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduced latency significantly while maintaining acceptable accuracy for command-based inputs.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Intent Detection
&lt;/h3&gt;

&lt;p&gt;Intent detection was implemented using a hybrid approach:&lt;/p&gt;

&lt;h4&gt;
  
  
  Rule-Based Classification
&lt;/h4&gt;

&lt;p&gt;Common patterns such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“write code”&lt;/li&gt;
&lt;li&gt;“create file”&lt;/li&gt;
&lt;li&gt;“summarize”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;are handled instantly using keyword matching. This avoids unnecessary LLM calls and improves speed.&lt;/p&gt;

&lt;h4&gt;
  
  
  LLM Fallback (Ollama)
&lt;/h4&gt;

&lt;p&gt;For ambiguous inputs, a local LLM (via Ollama) is used to classify intent and extract structured data.&lt;/p&gt;

&lt;p&gt;This combination provides both speed and flexibility.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Filename Extraction
&lt;/h3&gt;

&lt;p&gt;Instead of relying on the LLM, filenames are extracted using regex directly from the transcribed text.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input: “write code in hello.py”&lt;/li&gt;
&lt;li&gt;Extracted: &lt;code&gt;hello.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach avoids inconsistencies and ensures reliable file handling.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Tool Execution
&lt;/h3&gt;

&lt;p&gt;Based on the detected intent, specific actions are triggered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Create File:&lt;/strong&gt; Creates a new file inside a restricted directory
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write Code:&lt;/strong&gt; Generates code using the LLM and writes it to a file
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summarize:&lt;/strong&gt; Returns a shortened version of the input text
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All file operations are restricted to an &lt;code&gt;output/&lt;/code&gt; folder to prevent unintended system modifications.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. Code Generation
&lt;/h3&gt;

&lt;p&gt;Code generation is handled using a local LLM (LLaMA via Ollama).&lt;/p&gt;

&lt;p&gt;To ensure clean output:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompts explicitly restrict responses to Python code only
&lt;/li&gt;
&lt;li&gt;Post-processing removes markdown, non-ASCII characters, and unwanted prefixes
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures that generated code can be written directly to files without errors.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. User Interface
&lt;/h3&gt;

&lt;p&gt;The UI is built using Streamlit and displays:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transcribed text
&lt;/li&gt;
&lt;li&gt;Detected intent
&lt;/li&gt;
&lt;li&gt;Generated code
&lt;/li&gt;
&lt;li&gt;File content
&lt;/li&gt;
&lt;li&gt;Action result
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This provides full transparency into each stage of the pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model Latency
&lt;/h3&gt;

&lt;p&gt;Running Whisper and LLM locally introduced noticeable delays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switched to smaller models
&lt;/li&gt;
&lt;li&gt;Cached model loading
&lt;/li&gt;
&lt;li&gt;Reduced unnecessary LLM calls using rule-based detection
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. Incorrect Intent Classification
&lt;/h3&gt;

&lt;p&gt;The LLM sometimes misclassified inputs (e.g., treating code generation as file creation).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Added strict prompting rules
&lt;/li&gt;
&lt;li&gt;Introduced rule-based overrides for critical keywords
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Filename Extraction Issues
&lt;/h3&gt;

&lt;p&gt;Initially, filenames were not reliably extracted, leading to incorrect file operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implemented regex-based extraction
&lt;/li&gt;
&lt;li&gt;Added fallback defaults
&lt;/li&gt;
&lt;li&gt;Handled common speech-to-text variations
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. File Overwrite Logic
&lt;/h3&gt;

&lt;p&gt;The system initially failed to write code into existing files due to premature returns in logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensured write operations always execute
&lt;/li&gt;
&lt;li&gt;Separated file existence checks from write logic
&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  5. Noisy LLM Output
&lt;/h3&gt;

&lt;p&gt;Generated code sometimes contained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Markdown formatting
&lt;/li&gt;
&lt;li&gt;Extra text
&lt;/li&gt;
&lt;li&gt;Non-ASCII characters
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cleaned output using regex
&lt;/li&gt;
&lt;li&gt;Enforced strict prompt constraints
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Optimizations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Used Whisper &lt;code&gt;tiny&lt;/code&gt; model for faster transcription
&lt;/li&gt;
&lt;li&gt;Cached models to avoid repeated loading
&lt;/li&gt;
&lt;li&gt;Implemented rule-based intent detection
&lt;/li&gt;
&lt;li&gt;Reduced LLM calls to only necessary cases
&lt;/li&gt;
&lt;li&gt;Used a lighter model (&lt;code&gt;mistral&lt;/code&gt;) for intent classification
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Speech recognition may introduce minor transcription errors
&lt;/li&gt;
&lt;li&gt;Local models require sufficient system resources
&lt;/li&gt;
&lt;li&gt;Summarization is currently a simple placeholder
&lt;/li&gt;
&lt;li&gt;No support for multi-step or compound commands
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Future Improvements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Support compound commands (e.g., summarize and save)
&lt;/li&gt;
&lt;li&gt;Add confirmation before file operations
&lt;/li&gt;
&lt;li&gt;Replace summarization with LLM-based summarization
&lt;/li&gt;
&lt;li&gt;Maintain session memory and conversation history
&lt;/li&gt;
&lt;li&gt;Improve UI responsiveness and feedback
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates how a complete voice-controlled AI agent can be built using local models and simple tools. By combining speech recognition, intent classification, and automated execution, it is possible to create systems that bridge natural language interaction with real-world actions.&lt;/p&gt;

&lt;p&gt;The key takeaway is that combining rule-based logic with LLM capabilities leads to systems that are both efficient and reliable.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
