<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Shagun Dubey</title>
    <description>The latest articles on DEV Community by Shagun Dubey (@shagun_dubey_ff9bd2c342ed).</description>
    <link>https://dev.to/shagun_dubey_ff9bd2c342ed</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874450%2Fc413fbb1-cfb3-4c15-8eab-245b742c6b30.png</url>
      <title>DEV Community: Shagun Dubey</title>
      <link>https://dev.to/shagun_dubey_ff9bd2c342ed</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shagun_dubey_ff9bd2c342ed"/>
    <language>en</language>
    <item>
      <title>Creating an Offline AI Voice Agent Using Whisper and Ollama</title>
      <dc:creator>Shagun Dubey</dc:creator>
      <pubDate>Sun, 12 Apr 2026 06:17:03 +0000</pubDate>
      <link>https://dev.to/shagun_dubey_ff9bd2c342ed/creating-an-offline-ai-voice-agent-using-whisper-and-ollama-294i</link>
      <guid>https://dev.to/shagun_dubey_ff9bd2c342ed/creating-an-offline-ai-voice-agent-using-whisper-and-ollama-294i</guid>
      <description>&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Artificial intelligence (AI) systems based on voice have become very popular among modern devices. In my current research, I created an AI Voice Agent that listens to users' speech, understands their intentions, and executes some smart operations, including writing code, summarizing data, and creating files.&lt;/p&gt;

&lt;p&gt;One of the main goals of my current work was developing an AI Voice Agent operating offline and not relying on paid APIs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Architecture consists of five stages:&lt;/p&gt;

&lt;p&gt;Voice Input → Speech-to-Text → Intent Detection → Action Execution → Output&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Voice Input&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Voice input is done either from an existing recording or a file upload through a web interface that employs Streamlit framework.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Speech-to-Text (STT)&lt;/em&gt;&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Inputted voice data is processed by the Whisper model that recognizes speech and translates it into text form. FFmpeg is used for audio format  conversion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Intent Detection&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The intent detection mechanism uses a rule-based system. The same command can have several intents at the same time (e.g., “Create a file and generate some code”).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Action Execution&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;According to detected intentions, the system executes actions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creation of files&lt;/li&gt;
&lt;li&gt;Code generation in Python programming language&lt;/li&gt;
&lt;li&gt;Summary and explanation of texts&lt;/li&gt;
&lt;li&gt;Chatting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Local LLM Integration&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To execute the tasks above, the system leverages Ollama software that runs a locally deployed large language model llama3. Thus, no API is necessary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Output Layer&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Output is presented to users via the UI interface, and if needed, is saved to files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technology Utilized&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python - Programming Language&lt;/li&gt;
&lt;li&gt;Streamlit - UI Design&lt;/li&gt;
&lt;li&gt;Whisper - Speech to Text Conversion&lt;/li&gt;
&lt;li&gt;Ollama (llama3) - Local LLM for code creation and explanations&lt;/li&gt;
&lt;li&gt;FFmpeg - Preprocessing of Audio Files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key Functionalities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Voice Command (Recording/Uploading files)&lt;/li&gt;
&lt;li&gt;Multiple Intent Detection &amp;amp; Execution&lt;/li&gt;
&lt;li&gt;Offline AI functionality by leveraging a local LLM&lt;/li&gt;
&lt;li&gt;Code Generation&lt;/li&gt;
&lt;li&gt;Confirmation Method before performing File Operations&lt;/li&gt;
&lt;li&gt;Session History&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What Makes a Local LLM Better?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of integrating with any third-party services or applications for running models, this project makes use of Ollama. There are several reasons why such an approach is better:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No API fee&lt;/li&gt;
&lt;li&gt;Works offline&lt;/li&gt;
&lt;li&gt;Enhanced data security&lt;/li&gt;
&lt;li&gt;Better control&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges Encountered&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Issues With Audio&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Audio recordings needed to be properly formatted before running them through the transcription process because of some initial errors that occurred due to formatting mistakes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Incompatible Dependencies and Versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There were numerous problems with various dependencies like NumPy and pandas because of their incompatible versions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Uploading Files via Git&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The upload to GitHub was not working correctly due to the large files in the virtual environment, but this problem was successfully solved by setting up .gitignore correctly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; Connecting Local LLM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Switching to the integration with local LLMs meant that I had to set up Ollama and work with HTTP connections.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learning Points&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The importance of properly managing Python environments&lt;/li&gt;
&lt;li&gt;Real-world audio pipeline management&lt;/li&gt;
&lt;li&gt;Incorporation of local AI models within application programs&lt;/li&gt;
&lt;li&gt;Modular and scalable design of AI systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The current project highlights how to design and implement an end-to-end AI solution locally without any need to utilize paid APIs. Through the integration of speech recognition, intent identification, and a local LLM, the proposed system has proven to be efficient.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Project Repository&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Github Project Repository: &lt;a href="https://github.com/shagundubey48-cmd/ai_voice_agent" rel="noopener noreferrer"&gt;Click here to view project&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>nlp</category>
    </item>
  </channel>
</rss>
