<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Adarsh Sharma</title>
    <description>The latest articles on DEV Community by Adarsh Sharma (@adarsh_sharma_d509399e3d2).</description>
    <link>https://dev.to/adarsh_sharma_d509399e3d2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873434%2F56f9e461-b08a-475f-b108-cc4813550426.png</url>
      <title>DEV Community: Adarsh Sharma</title>
      <link>https://dev.to/adarsh_sharma_d509399e3d2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/adarsh_sharma_d509399e3d2"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent using Ollama and Whisper</title>
      <dc:creator>Adarsh Sharma</dc:creator>
      <pubDate>Sat, 11 Apr 2026 15:06:47 +0000</pubDate>
      <link>https://dev.to/adarsh_sharma_d509399e3d2/building-a-voice-controlled-local-ai-agent-using-ollama-and-whisper-599</link>
      <guid>https://dev.to/adarsh_sharma_d509399e3d2/building-a-voice-controlled-local-ai-agent-using-ollama-and-whisper-599</guid>
      <description>&lt;p&gt;Introduction&lt;/p&gt;

&lt;p&gt;In this project, I built a Voice-Controlled Local AI Agent that can accept audio or text input, understand the user's intent, and perform real actions like creating files, generating code, and summarizing text.&lt;/p&gt;

&lt;p&gt;Unlike traditional chatbots, this system doesn’t just respond — it acts based on user commands.&lt;/p&gt;

&lt;p&gt;System Architecture&lt;/p&gt;

&lt;p&gt;The system follows a clear pipeline:&lt;/p&gt;

&lt;p&gt;Audio/Text Input → Speech-to-Text → Intent Detection → Tool Execution → UI Output&lt;br&gt;
Audio/Text Input&lt;br&gt;
The system accepts:&lt;br&gt;
Microphone input&lt;br&gt;
Uploaded audio files (.wav, .mp3)&lt;br&gt;
Direct text input&lt;br&gt;
Speech-to-Text (STT)&lt;br&gt;
Audio is converted into text using a local model (faster-whisper).&lt;br&gt;
Intent Detection&lt;br&gt;
A local LLM (via Ollama) analyzes the text and classifies the user’s intent.&lt;br&gt;
Tool Execution&lt;br&gt;
Based on intent, the system performs actions like:&lt;br&gt;
Creating files/folders&lt;br&gt;
Writing code&lt;br&gt;
Summarizing text&lt;br&gt;
User Interface&lt;br&gt;
A Gradio-based UI displays:&lt;br&gt;
Transcribed text&lt;br&gt;
Detected intent&lt;br&gt;
Actions taken&lt;br&gt;
Final output&lt;br&gt;
🛠️ Technologies Used&lt;br&gt;
Python – Core development&lt;br&gt;
Gradio – User interface&lt;br&gt;
faster-whisper – Speech-to-text&lt;br&gt;
Ollama (phi3:mini) – Local LLM for intent detection&lt;br&gt;
Regex + AST – Code extraction and validation&lt;br&gt;
🎯 Features&lt;br&gt;
🎤 Voice input (microphone + file upload)&lt;br&gt;
🧠 Local intent detection using LLM&lt;br&gt;
📁 File and folder creation&lt;br&gt;
💻 Code generation and saving&lt;br&gt;
✂️ Text summarization&lt;br&gt;
🔐 Secure sandbox (output/ directory)&lt;br&gt;
🔁 Compound command support&lt;br&gt;
🧍 Human-in-the-loop approval&lt;br&gt;
🧠 Session memory&lt;br&gt;
⏱️ Performance benchmarking&lt;br&gt;
⚠️ Challenges Faced&lt;/p&gt;

&lt;p&gt;During development, several challenges came up:&lt;/p&gt;

&lt;p&gt;Intent Misclassification&lt;br&gt;
The model sometimes confused between creating files and writing code.&lt;br&gt;
Uncontrolled Code Generation&lt;br&gt;
The LLM returned explanations along with code, making files messy.&lt;br&gt;
Path Security Issues&lt;br&gt;
Some commands tried writing outside the allowed directory.&lt;br&gt;
Compound Command Handling&lt;br&gt;
Managing multiple steps in a single command required careful execution logic.&lt;br&gt;
✅ Solutions Implemented&lt;/p&gt;

&lt;p&gt;To overcome these challenges:&lt;/p&gt;

&lt;p&gt;Added rule-based overrides for accurate intent detection&lt;br&gt;
Designed strict prompts for controlled code generation&lt;br&gt;
Implemented code cleaning using AST parsing&lt;br&gt;
Built a safe path system restricting operations to output/&lt;br&gt;
Added path carryover logic for compound commands&lt;br&gt;
🔒 Security Considerations&lt;/p&gt;

&lt;p&gt;All file operations are restricted to a dedicated output/ folder.&lt;br&gt;
Any attempt to access outside paths (like ../../) is automatically blocked.&lt;/p&gt;

&lt;p&gt;🚀 Example Workflow&lt;/p&gt;

&lt;p&gt;User Input:&lt;/p&gt;

&lt;p&gt;Write a Python program to add two numbers and save it to test_add.py&lt;/p&gt;

&lt;p&gt;System Execution:&lt;/p&gt;

&lt;p&gt;Detects intent (write_code_to_new_file)&lt;br&gt;
Generates Python code&lt;br&gt;
Saves file in output/test_add.py&lt;br&gt;
Displays full pipeline in UI&lt;br&gt;
🏁 Conclusion&lt;/p&gt;

&lt;p&gt;This project demonstrates how a local AI agent can be built to perform real-world tasks efficiently and securely.&lt;/p&gt;

&lt;p&gt;It combines speech processing, language models, and system-level execution into a single interactive application.&lt;/p&gt;

&lt;p&gt;🔗 Links&lt;br&gt;
GitHub Repository: (&lt;a href="https://github.com/adarsh7979s/Voice-controlled-ai-agent" rel="noopener noreferrer"&gt;https://github.com/adarsh7979s/Voice-controlled-ai-agent&lt;/a&gt;)&lt;br&gt;
Demo Video: (&lt;a href="https://youtu.be/PFnSSqCuNd4" rel="noopener noreferrer"&gt;https://youtu.be/PFnSSqCuNd4&lt;/a&gt;)&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
