<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sanidhya Shishodia</title>
    <description>The latest articles on DEV Community by Sanidhya Shishodia (@sanidhya_shishodia_401d73).</description>
    <link>https://dev.to/sanidhya_shishodia_401d73</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3812890%2F15134dea-3b42-439a-ac10-6062080a30ee.png</url>
      <title>DEV Community: Sanidhya Shishodia</title>
      <link>https://dev.to/sanidhya_shishodia_401d73</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sanidhya_shishodia_401d73"/>
    <language>en</language>
    <item>
      <title>Building VoxAgent: A Local Voice-Controlled AI Agent with Whisper, Ollama, and Safe File Actions</title>
      <dc:creator>Sanidhya Shishodia</dc:creator>
      <pubDate>Fri, 10 Apr 2026 15:05:55 +0000</pubDate>
      <link>https://dev.to/sanidhya_shishodia_401d73/building-voxagent-a-local-voice-controlled-ai-agent-with-whisper-ollama-and-safe-file-actions-1chm</link>
      <guid>https://dev.to/sanidhya_shishodia_401d73/building-voxagent-a-local-voice-controlled-ai-agent-with-whisper-ollama-and-safe-file-actions-1chm</guid>
      <description>&lt;p&gt;If you ask most AI demos to do something useful, they usually stop right before the interesting part.&lt;/p&gt;

&lt;p&gt;They can transcribe your speech, explain what they think you meant, and generate a polished response. But they often do not cross the line into safe, visible action on a real machine.&lt;/p&gt;

&lt;p&gt;For my Mem0 AI/ML &amp;amp; Generative AI Developer Intern assignment, I wanted to build something more practical: a local-first voice-controlled AI agent that could listen to spoken commands, understand user intent, execute local tools, and expose the whole pipeline in a simple UI.&lt;/p&gt;

&lt;p&gt;That project became &lt;strong&gt;VoxAgent&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What VoxAgent Does
&lt;/h2&gt;

&lt;p&gt;VoxAgent is a local-first AI agent that supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;microphone input&lt;/li&gt;
&lt;li&gt;uploaded audio files&lt;/li&gt;
&lt;li&gt;local speech-to-text&lt;/li&gt;
&lt;li&gt;local intent understanding&lt;/li&gt;
&lt;li&gt;safe file and folder creation&lt;/li&gt;
&lt;li&gt;code generation into files&lt;/li&gt;
&lt;li&gt;text summarization&lt;/li&gt;
&lt;li&gt;general chat&lt;/li&gt;
&lt;li&gt;a UI that shows the full pipeline from audio to action&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key requirement was not just to generate responses, but to actually perform useful tasks while staying within safe local boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Idea
&lt;/h2&gt;

&lt;p&gt;The architecture is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Accept voice input&lt;/li&gt;
&lt;li&gt;Convert speech to text&lt;/li&gt;
&lt;li&gt;Classify the user’s intent&lt;/li&gt;
&lt;li&gt;Route the request to a local tool&lt;/li&gt;
&lt;li&gt;Show the transcript, decision, and final output in the UI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That sounds straightforward, but the interesting engineering work was in making the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local-first&lt;/li&gt;
&lt;li&gt;safe by default&lt;/li&gt;
&lt;li&gt;resilient when local models are slow or unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;p&gt;I used the following stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Streamlit&lt;/strong&gt; for the frontend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;faster-whisper&lt;/strong&gt; for local speech-to-text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; for local intent routing and generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python&lt;/strong&gt; for orchestration and tool execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gave me a stack that was practical to run on a normal laptop without turning the project into a hosted API workflow.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;VoxAgent is split into five layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Audio ingestion
&lt;/h3&gt;

&lt;p&gt;The Streamlit UI accepts audio in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;direct microphone recording&lt;/li&gt;
&lt;li&gt;uploaded &lt;code&gt;.wav&lt;/code&gt;, &lt;code&gt;.mp3&lt;/code&gt;, &lt;code&gt;.m4a&lt;/code&gt;, or &lt;code&gt;.ogg&lt;/code&gt; files&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made the app easier to demo and easier to test. If browser microphone behavior is inconsistent, uploaded audio still works.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Local speech-to-text
&lt;/h3&gt;

&lt;p&gt;For transcription, I used &lt;strong&gt;faster-whisper&lt;/strong&gt; with CPU-friendly defaults:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model: &lt;code&gt;base.en&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;device: &lt;code&gt;cpu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;compute type: &lt;code&gt;int8&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I initially considered using a larger Whisper setup, but in practice local responsiveness matters more than chasing the largest possible model.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Intent planning with Ollama
&lt;/h3&gt;

&lt;p&gt;Once the transcript is available, it is sent to a local Ollama model. Instead of asking for a free-form answer, I ask the model for a &lt;strong&gt;strict JSON action plan&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That plan includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;intent&lt;/li&gt;
&lt;li&gt;file or folder target if needed&lt;/li&gt;
&lt;li&gt;code generation instruction if needed&lt;/li&gt;
&lt;li&gt;text to summarize if needed&lt;/li&gt;
&lt;li&gt;whether confirmation should be required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This structure makes the agent much easier to reason about than a plain-text routing step.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Tool execution
&lt;/h3&gt;

&lt;p&gt;The agent supports four main intents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;create_file&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write_code&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;summarize_text&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;general_chat&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These actions are executed through a local tool layer, not directly through the UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. UI and memory
&lt;/h3&gt;

&lt;p&gt;The interface shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcribed text&lt;/li&gt;
&lt;li&gt;detected intents&lt;/li&gt;
&lt;li&gt;planned actions&lt;/li&gt;
&lt;li&gt;final results&lt;/li&gt;
&lt;li&gt;execution notes&lt;/li&gt;
&lt;li&gt;backend information&lt;/li&gt;
&lt;li&gt;timing information&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The app also stores lightweight session history in JSON so recent interactions remain visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety Constraints
&lt;/h2&gt;

&lt;p&gt;This part mattered a lot.&lt;/p&gt;

&lt;p&gt;Once an AI agent can write files locally, the execution boundary has to be extremely clear. I added three explicit safeguards:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. All writes are restricted to &lt;code&gt;output/&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The agent cannot create files outside the repository’s &lt;code&gt;output/&lt;/code&gt; folder.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Path traversal is blocked
&lt;/h3&gt;

&lt;p&gt;Any path containing &lt;code&gt;..&lt;/code&gt;, &lt;code&gt;/&lt;/code&gt;, or &lt;code&gt;\&lt;/code&gt; is rejected before execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. File actions require confirmation
&lt;/h3&gt;

&lt;p&gt;The UI includes a human-in-the-loop approval checkbox before file creation or code writing happens.&lt;/p&gt;

&lt;p&gt;That kept the project aligned with the assignment and prevented the most obvious local-risk failure modes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Broke During Real Testing
&lt;/h2&gt;

&lt;p&gt;The first version worked in structure, but real end-to-end testing exposed some weaknesses.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Fallback logic was incomplete
&lt;/h3&gt;

&lt;p&gt;At first, I had a rule-based fallback for intent planning if Ollama was unavailable. But some execution paths still depended on the local LLM later in the pipeline.&lt;/p&gt;

&lt;p&gt;That meant the app could recover during planning and still fail during summarization or code generation.&lt;/p&gt;

&lt;p&gt;I fixed this by adding a &lt;strong&gt;local fallback responder&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summarization can fall back locally&lt;/li&gt;
&lt;li&gt;chat can fall back locally&lt;/li&gt;
&lt;li&gt;code generation can fall back to a safe template&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made degraded execution much more predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Planning latency was too high
&lt;/h3&gt;

&lt;p&gt;In local testing, planning timeouts could make the app feel frozen even when a fallback path was available.&lt;/p&gt;

&lt;p&gt;I improved this by separating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;planner timeout&lt;/li&gt;
&lt;li&gt;generation timeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shorter planning timeout lets the UI fall back faster if the local model stalls, while generation still gets its own bounded budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demo Flows I Verified
&lt;/h2&gt;

&lt;p&gt;I verified the app with actual local audio runs, not just unit tests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Demo 1: Summarization
&lt;/h3&gt;

&lt;p&gt;Voice input:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Summarize this text. Local AI agents combine speech recognition, reasoning, and tool execution to automate tasks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Observed behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcript recognized correctly&lt;/li&gt;
&lt;li&gt;intent detected as &lt;code&gt;summarize_text&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;planner fell back after timeout&lt;/li&gt;
&lt;li&gt;summarization still completed successfully&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Final summary:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Local AI agents use speech recognition, reasoning, and tool execution to automate tasks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Demo 2: Code generation
&lt;/h3&gt;

&lt;p&gt;Voice input:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Create a Python file named retry helper dot py with a retry function.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Observed behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcript normalized to &lt;code&gt;retryhelper.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;intents detected as &lt;code&gt;create_file&lt;/code&gt; and &lt;code&gt;write_code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;file created safely inside &lt;code&gt;output/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;fallback code generation completed successfully&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Generated output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Generated in fallback mode because the local LLM was unavailable.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;last_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attempts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;last_error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delay_seconds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="n"&gt;last_error&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Challenges I Faced
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Running everything locally without making it painful
&lt;/h3&gt;

&lt;p&gt;Local AI systems sound attractive, but latency and hardware limits show up immediately. The practical solution was to tune for reliability:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;smaller Whisper model&lt;/li&gt;
&lt;li&gt;CPU-friendly quantization&lt;/li&gt;
&lt;li&gt;bounded timeouts&lt;/li&gt;
&lt;li&gt;local fallback behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Keeping flexibility without sacrificing safety
&lt;/h3&gt;

&lt;p&gt;Natural language is flexible. File system actions are not.&lt;/p&gt;

&lt;p&gt;So the architecture separates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;transcript and interpretation&lt;/li&gt;
&lt;li&gt;validated action planning&lt;/li&gt;
&lt;li&gt;constrained execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation kept the agent much easier to trust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Making the system observable
&lt;/h3&gt;

&lt;p&gt;I did not want the UI to behave like a black box. Showing the transcript, action plan, notes, backend used, and timing information made debugging much easier and also made the demo much stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Improve Next
&lt;/h2&gt;

&lt;p&gt;If I continue iterating on VoxAgent, the next improvements would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;richer extraction of filenames and structured parameters&lt;/li&gt;
&lt;li&gt;compound actions like “summarize this and save it to summary.txt”&lt;/li&gt;
&lt;li&gt;stronger local templates for multiple programming languages&lt;/li&gt;
&lt;li&gt;benchmarking different Whisper and Ollama combinations&lt;/li&gt;
&lt;li&gt;better multi-step approval flows before execution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;p&gt;The most useful lesson from this project was that a strong local AI agent is not just a model wrapped in a UI.&lt;/p&gt;

&lt;p&gt;It is a pipeline with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;explicit safety boundaries&lt;/li&gt;
&lt;li&gt;predictable fallbacks&lt;/li&gt;
&lt;li&gt;observable execution&lt;/li&gt;
&lt;li&gt;practical local defaults&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is what I tried to build with VoxAgent.&lt;/p&gt;

&lt;p&gt;If you want to check out the code, the repository is here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/dev-sanidhya/VoxAgent" rel="noopener noreferrer"&gt;https://github.com/dev-sanidhya/VoxAgent&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>nlp</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
