<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rishitha Rao</title>
    <description>The latest articles on DEV Community by Rishitha Rao (@rishitha_rao_7941702a3f90).</description>
    <link>https://dev.to/rishitha_rao_7941702a3f90</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3877000%2F22f520f8-1db7-403a-aea9-4961d4eddf86.png</url>
      <title>DEV Community: Rishitha Rao</title>
      <link>https://dev.to/rishitha_rao_7941702a3f90</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rishitha_rao_7941702a3f90"/>
    <language>en</language>
    <item>
      <title>Voice-Controlled Local AI Agent — How I Built It</title>
      <dc:creator>Rishitha Rao</dc:creator>
      <pubDate>Mon, 13 Apr 2026 15:58:26 +0000</pubDate>
      <link>https://dev.to/rishitha_rao_7941702a3f90/voice-controlled-local-ai-agent-how-i-built-it-2dga</link>
      <guid>https://dev.to/rishitha_rao_7941702a3f90/voice-controlled-local-ai-agent-how-i-built-it-2dga</guid>
      <description>&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;In this project, I built a voice-controlled AI agent that accepts audio input, understands user intent, and executes actions locally on my machine. The agent supports creating files, writing code, summarizing text, and general chat — all triggered by voice commands.&lt;/p&gt;

&lt;p&gt;Architecture&lt;/p&gt;

&lt;p&gt;The system follows a 4-step pipeline:&lt;/p&gt;

&lt;p&gt;Audio Input → Speech-to-Text → Intent Classification → Tool Execution → UI Display&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Audio Input&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The app supports two methods — microphone recording using sounddevice and file upload (wav, mp3, m4a).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Speech-to-Text&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I used Groq's Whisper Large V3 API for transcription. Initially I tried running Whisper locally but it gave inaccurate results on my CPU-only 8GB RAM machine, especially with Indian English accent. Groq's API solved this completely — it is free, fast (2-3 seconds), and highly accurate.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Intent Classification&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I used Ollama running llama3.2:1b locally for intent detection. The LLM classifies the transcribed text into one of four intents — create file, write code, summarize, or general chat. It also supports compound commands where multiple intents are detected in a single voice input.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Tool Execution&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Based on detected intents, the system executes the appropriate action — generating code, creating files, summarizing text, or responding to chat. All files are safely saved inside an output folder.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Built with Streamlit, the interface displays all 4 pipeline steps clearly — transcription, detected intent, action taken, and final result.&lt;/p&gt;

&lt;p&gt;Models Chosen&lt;/p&gt;

&lt;p&gt;For Speech-to-Text I used Groq Whisper Large V3 because it is accurate, free, and fast on CPU. For Intent Classification and Code Generation I used Ollama llama3.2:1b because it runs comfortably on 8GB RAM locally.&lt;/p&gt;

&lt;p&gt;Bonus Features&lt;/p&gt;

&lt;p&gt;I implemented all four bonus features from the requirements. Compound Commands allow multiple intents in one voice input. Human-in-the-Loop adds a confirmation prompt before any file operation. Graceful Error Handling shows all errors clearly in the UI. Session Memory tracks the full history of actions in the sidebar.&lt;/p&gt;

&lt;p&gt;Challenges Faced&lt;/p&gt;

&lt;p&gt;Challenge 1 — Whisper accuracy on CPU&lt;br&gt;
Local Whisper-base gave wrong transcriptions for Indian English accent. I switched to Groq Whisper API which solved it completely with much better accuracy.&lt;/p&gt;

&lt;p&gt;Challenge 2 — RAM limitations&lt;br&gt;
With 8GB RAM, running Whisper and Ollama together caused memory issues. I solved this by using Groq API for STT which uses zero RAM, and switching to llama3.2:1b which only needs 1.3GB instead of the 3b model.&lt;/p&gt;

&lt;p&gt;Challenge 3 — Streamlit rerun issue&lt;br&gt;
The Human-in-the-Loop confirmation was clearing all variables on rerun. I fixed this by storing the entire pipeline state in st.session_state before showing the confirmation dialog.&lt;/p&gt;

&lt;p&gt;Challenge 4 — Compound command execution&lt;br&gt;
Getting the LLM to return structured JSON for multiple intents required careful prompt engineering with clear examples in the system prompt.&lt;/p&gt;

&lt;p&gt;Tech Stack&lt;/p&gt;

&lt;p&gt;UI — Streamlit&lt;br&gt;
Speech-to-Text — Groq Whisper Large V3 API&lt;br&gt;
LLM — Ollama llama3.2:1b running locally&lt;br&gt;
Audio Recording — sounddevice and scipy&lt;br&gt;
Language — Python 3.11&lt;/p&gt;

&lt;p&gt;GitHub and Demo&lt;/p&gt;

&lt;p&gt;GitHub Repository: &lt;a href="https://github.com/rishithabompelli/voice-agent" rel="noopener noreferrer"&gt;https://github.com/rishithabompelli/voice-agent&lt;/a&gt;&lt;br&gt;
Demo Video: &lt;a href="https://youtu.be/6ulvTsCmlEk" rel="noopener noreferrer"&gt;https://youtu.be/6ulvTsCmlEk&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;Building this agent taught me a lot about combining STT, LLM, and tool execution into a clean pipeline. The biggest learning was choosing the right model for the hardware. Running everything locally on 8GB RAM required careful optimization of model sizes and offloading STT to an API. The project gave me hands-on experience with local LLMs, voice processing, and building agentic AI systems.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>python</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
