<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Donthi Nishitha</title>
    <description>The latest articles on DEV Community by Donthi Nishitha (@donthi_nishitha_41).</description>
    <link>https://dev.to/donthi_nishitha_41</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3876400%2Fe1fb4e30-8a89-40a4-a7c9-9e7925f5687d.jpg</url>
      <title>DEV Community: Donthi Nishitha</title>
      <link>https://dev.to/donthi_nishitha_41</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/donthi_nishitha_41"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent with Whisper, LLaMA 3 and Streamlit</title>
      <dc:creator>Donthi Nishitha</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:14:53 +0000</pubDate>
      <link>https://dev.to/donthi_nishitha_41/building-a-voice-controlled-local-ai-agent-with-whisper-llama-3-and-streamlit-1hb8</link>
      <guid>https://dev.to/donthi_nishitha_41/building-a-voice-controlled-local-ai-agent-with-whisper-llama-3-and-streamlit-1hb8</guid>
      <description>&lt;h3&gt;
  
  
  Introduction
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;EchoMemo&lt;/strong&gt; is a Voice Controlled Local AI Agent that runs entirely on your local machine. The user is required to give a command, either through their microphone or by uploading an audio file, through which the system automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converts speech to text&lt;/li&gt;
&lt;li&gt;Understands what the user wants to do&lt;/li&gt;
&lt;li&gt;Executes the right action by capturing the intent&lt;/li&gt;
&lt;li&gt;Shows every step of the pipeline in a clean web UI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No requirement of cloud, API keys, or internet after setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  What difference does this make?
&lt;/h3&gt;

&lt;p&gt;Local AI solves the problems of Privacy, Cost, Dependency, and Latency unlike most AI tools today. With tools like Whisper for speech recognition and Ollama as a local LLM, you can run powerful AI models directly on your own hardware. Your data never leaves your computer, there are no usage fees, and the system works even completely offline.&lt;/p&gt;

&lt;p&gt;This project is a practical demonstration of that idea — a fully functional voice agent built entirely from local, open-source models.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input (mic / file upload)
        │
        ▼
  Speech-to-Text          ← OpenAI Whisper "base" model (fully local)
  [models/stt.py]
        │
        ▼
  Intent Classifier       ← LLaMA 3 via Ollama (fully local)
  [llm/intent_classifier.py]
        │
        ├── create_file    → tools/file_tools.py   → output/
        ├── write_code     → tools/code_tools.py   → output/
        ├── summarize_text → tools/text_tools.py
        └── general_chat   → direct Ollama LLaMA 3 chat
        │
        ▼
  Streamlit UI            ← displays transcription, intent, action, result
  [app.py]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Models I Chose
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Whisper base&lt;/strong&gt; — free, local, accurate, no API key needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLaMA 3 via Ollama&lt;/strong&gt; — fully local LLM, easy to set up and run&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How Each Component Works
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;models/stt.py&lt;/code&gt;&lt;/strong&gt; — Converts audio to a format Whisper understands using pydub + ffmpeg, runs it through the Whisper base model locally, and returns plain text. Whisper handles accents and multiple audio formats without any API call. Supported formats: Microphone, WAV, MP3, M4A, OGG, WEBM.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;llm/intent_classifier.py&lt;/code&gt;&lt;/strong&gt; — Takes the transcribed text, sends it to LLaMA 3 running locally via Ollama with a strict prompt. The LLM returns one of 4 intents (&lt;code&gt;create_file&lt;/code&gt;, &lt;code&gt;write_code&lt;/code&gt;, &lt;code&gt;summarize_text&lt;/code&gt;, &lt;code&gt;general_chat&lt;/code&gt;). A normalization function cleans up any freeform LLM reply into the exact intent label.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;File Tool&lt;/strong&gt; — Extracts a filename from the transcription using regex and creates a &lt;code&gt;.txt&lt;/code&gt; file in &lt;code&gt;output/&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Code Tool&lt;/strong&gt; — Sends the request to LLaMA 3, strips markdown fences from the response, and saves a clean &lt;code&gt;.py&lt;/code&gt; file to &lt;code&gt;output/&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Text Tool&lt;/strong&gt; — Sends the transcription to LLaMA 3 and returns a 5-bullet summary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Streamlit UI&lt;/strong&gt; — Handles mic + file upload, runs the full pipeline, and displays each step's result cleanly.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. ffmpeg not found on Windows&lt;/strong&gt;&lt;br&gt;
winget installs ffmpeg to a very long user-specific path instead of a standard location. pydub couldn't find it automatically. Fixed by hardcoding the exact path using &lt;code&gt;AudioSegment.converter&lt;/code&gt; and &lt;code&gt;AudioSegment.ffprobe&lt;/code&gt; directly in &lt;code&gt;stt.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. pydub needing explicit path in code&lt;/strong&gt;&lt;br&gt;
Even after ffmpeg was installed and working in CMD, Python/Streamlit's process couldn't see it on PATH. Fixed by explicitly setting &lt;code&gt;os.environ["PATH"]&lt;/code&gt; inside the code to force pydub to find ffmpeg.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. LLM returning freeform text instead of clean intent labels&lt;/strong&gt;&lt;br&gt;
LLaMA 3 would sometimes reply with things like &lt;em&gt;"The intent is write_code"&lt;/em&gt; or &lt;em&gt;"2. create_file"&lt;/em&gt; instead of just the label. This caused all intent matching to silently fail. Fixed by writing a &lt;code&gt;_normalize_intent()&lt;/code&gt; function that scans the raw reply and maps it to the correct label using keyword matching as fallback.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Windows file locking (WinError 32)&lt;/strong&gt;&lt;br&gt;
Windows holds a lock on temp files longer than Linux/Mac, so &lt;code&gt;os.remove()&lt;/code&gt; was throwing a &lt;code&gt;PermissionError&lt;/code&gt;. Fixed by wrapping the cleanup in a &lt;code&gt;try/except PermissionError&lt;/code&gt; block and letting Windows handle it automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Enhancements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Improve intent detection using few-shot prompting for higher accuracy&lt;/li&gt;
&lt;li&gt;Add support for more intents like &lt;strong&gt;delete file&lt;/strong&gt;, &lt;strong&gt;rename file&lt;/strong&gt;, and &lt;strong&gt;web search&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Build a chat history panel so previous commands are visible in the UI&lt;/li&gt;
&lt;li&gt;Add a visual waveform display for recorded microphone audio&lt;/li&gt;
&lt;li&gt;Support multilingual voice commands using Whisper's built-in language detection&lt;/li&gt;
&lt;li&gt;Package the app as a desktop executable so no terminal setup is needed&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🔗 &lt;strong&gt;GitHub Repo:&lt;/strong&gt; 
&lt;a href="https://github.com/donthi-nishitha-4/EchoMemo-Voice-Controlled-Local-AI-Agent" rel="noopener noreferrer"&gt;EchoMemo - Voice controlled Local AI Agent&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🎬 &lt;strong&gt;YouTube Demo:&lt;/strong&gt; &lt;a href="https://youtu.be/2ZH3qdoM1FQ?si=n_eKC9tKTa68VlXv" rel="noopener noreferrer"&gt;EchoMemo - Voice Controlled Local AI Agent&lt;/a&gt;
---&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; I developed this project as part of an assignment. It helped me gain solid knowledge on tools, pipelining, agents, and debugging. I used ChatGPT, Claude as an AI assistant for code baselining, debugging Windows-specific issues, and architectural guidance. All understanding, testing, and final decisions were my own.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Thank you for reading!&lt;/em&gt;&lt;br&gt;
&lt;em&gt;— D. Nishitha&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
