<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ishita Raj</title>
    <description>The latest articles on DEV Community by Ishita Raj (@candy_rush).</description>
    <link>https://dev.to/candy_rush</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1841812%2F77691637-044d-4c32-a651-276a464d6ffa.jpg</url>
      <title>DEV Community: Ishita Raj</title>
      <link>https://dev.to/candy_rush</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/candy_rush"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent with Human-in-the-Loop Execution</title>
      <dc:creator>Ishita Raj</dc:creator>
      <pubDate>Mon, 13 Apr 2026 17:38:56 +0000</pubDate>
      <link>https://dev.to/candy_rush/building-a-voice-controlled-local-ai-agent-with-human-in-the-loop-execution-edc</link>
      <guid>https://dev.to/candy_rush/building-a-voice-controlled-local-ai-agent-with-human-in-the-loop-execution-edc</guid>
      <description>&lt;p&gt;Building an AI agent that can write code and modify your file system is exciting, but it’s also a massive security risk if left unchecked. For this project, the goal was to build a voice-controlled local AI agent from scratch that could transcribe audio, understand user intent, and execute file operations—all while being strictly sandboxed and requiring human approval before doing any real damage. &lt;br&gt;
Here is the &lt;a href="https://github.com/Ishitaa7/Voice-Controlled-AI-Agent/tree/master" rel="noopener noreferrer"&gt;link&lt;/a&gt; to the project.&lt;/p&gt;

&lt;p&gt;Here is a breakdown of the architecture, the models I chose, and the engineering challenges I ran into along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The project is built in Python and entirely containerized using Docker. The architecture is built around a central &lt;code&gt;AgentOrchestrator&lt;/code&gt; that manages a linear pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Audio Ingestion:&lt;/strong&gt; A Streamlit frontend captures audio via microphone or file upload.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speech-to-Text (STT):&lt;/strong&gt; The audio is transcribed into text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intent &amp;amp; Tool Parsing:&lt;/strong&gt; The text, along with recent chat history, is sent to an LLM. The LLM determines if it should just chat, or if it needs to invoke specific Python functions (tools).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human-in-the-Loop (HITL):&lt;/strong&gt; If tools are called, execution pauses. The pending actions are saved to a local TinyDB database, and the UI prompts the user to approve or reject the action.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safe Execution:&lt;/strong&gt; Once approved, the tools are executed through a &lt;code&gt;SafeExecutor&lt;/code&gt; class. &lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Security: The SafeExecutor
&lt;/h3&gt;

&lt;p&gt;One of the core requirements was ensuring the LLM couldn't overwrite system files. I implemented a &lt;code&gt;SafeExecutor&lt;/code&gt; wrapper for all file operations (&lt;code&gt;create_file_tool&lt;/code&gt;, &lt;code&gt;write_code_tool&lt;/code&gt;). It uses &lt;code&gt;os.path.abspath&lt;/code&gt; and &lt;code&gt;os.path.commonpath&lt;/code&gt; to resolve requested file paths and verify they strictly reside within a dedicated &lt;code&gt;output/&lt;/code&gt; directory. If the LLM attempts a path traversal attack (e.g., &lt;code&gt;../../etc/passwd&lt;/code&gt;), the executor intercepts and blocks it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Models Chosen
&lt;/h2&gt;

&lt;p&gt;I designed the provider layer using Abstract Base Classes (&lt;code&gt;BaseSTT&lt;/code&gt;, &lt;code&gt;BaseLLM&lt;/code&gt;) so I could easily hot-swap models from the UI. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-Text (STT)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local:&lt;/strong&gt; Hugging Face &lt;code&gt;Transformers&lt;/code&gt; pipeline running &lt;code&gt;openai/whisper-tiny.en&lt;/code&gt;. Chosen for privacy and the ability to run completely offline on CPU/MPS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Fallback:&lt;/strong&gt; Groq's API running &lt;code&gt;whisper-large-v3-turbo&lt;/code&gt;. Chosen for blazing-fast transcription when internet access is available.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Large Language Models (LLM)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local:&lt;/strong&gt; Ollama running &lt;code&gt;llama3.2&lt;/code&gt;. The v0.4 Ollama Python client natively supports function calling, making it trivial to pass a list of Python tools and get structured JSON tool calls back without manual prompt engineering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Fallback:&lt;/strong&gt; Groq API running &lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt;. Chosen for its high reasoning capabilities, specifically for handling complex, multi-step compound commands (e.g., "Summarize this text and save it to two different files").&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Engineering Challenges
&lt;/h2&gt;

&lt;p&gt;Building the "happy path" is easy, but integrating LLMs, audio processing, and Docker brought up several interesting edge cases.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Audio Extension Trap
&lt;/h3&gt;

&lt;p&gt;Initially, when users uploaded &lt;code&gt;.mp3&lt;/code&gt; or &lt;code&gt;.m4a&lt;/code&gt; files via the Streamlit UI, the transcription would return absolute garbage—classic Whisper hallucinations. &lt;br&gt;
&lt;strong&gt;The issue:&lt;/strong&gt; Streamlit passes uploaded files as raw bytes. We were saving these bytes into a temporary &lt;code&gt;.wav&lt;/code&gt; file before passing them to the STT models. Because the file extension didn't match the actual underlying audio codec, the models couldn't extract the audio features properly and tried to decode "silence." &lt;br&gt;
&lt;strong&gt;The fix:&lt;/strong&gt; I updated the UI to extract the true file extension from the uploaded file and dynamically assign it to the temporary file, allowing the STT engines to decode the formats perfectly.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tool-Calling Hallucinations
&lt;/h3&gt;

&lt;p&gt;Even modern models like Llama 3.3 struggle with API constraints. Instead of using the native API tool-calling schema (which the backend expects), the model would occasionally output raw XML tags (e.g., &lt;code&gt;&amp;lt;function=create_file&amp;gt;&lt;/code&gt;) or dump concatenated JSON objects directly into its text response like this: &lt;br&gt;
&lt;code&gt;{"type": "function", "name": "create_file_tool"}{"type": "function", "name": "write_code_tool"}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Because this was just raw text, the orchestrator thought it was a general chat response and skipped the tool execution queue.&lt;br&gt;
&lt;strong&gt;The fix:&lt;/strong&gt; I updated the system prompt to explicitly forbid XML tags, and wrote a fallback JSON parser in the orchestrator. If the LLM returns an empty tool array but its text output starts with &lt;code&gt;{&lt;/code&gt;, the parser intercepts it, splits any concatenated &lt;code&gt;}{&lt;/code&gt; blocks, and manually queues them up for Human-in-the-Loop approval.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Python 3.14 Alphas vs. Legacy Audio Libraries
&lt;/h3&gt;

&lt;p&gt;For package management, I used &lt;code&gt;uv&lt;/code&gt;. I initially set the &lt;code&gt;pyproject.toml&lt;/code&gt; to &lt;code&gt;requires-python = "&amp;gt;=3.12"&lt;/code&gt;. &lt;code&gt;uv&lt;/code&gt; saw this and aggressively downloaded the latest Python 3.14 alpha for the Docker container. &lt;br&gt;
This immediately broke the app because &lt;code&gt;audiorecorder&lt;/code&gt; relies on &lt;code&gt;pydub&lt;/code&gt;, which relies on &lt;code&gt;audioop&lt;/code&gt;—a core C-library that was completely removed from Python starting in version 3.13. The app crashed trying to install a backported &lt;code&gt;pyaudioop&lt;/code&gt; package.&lt;br&gt;
&lt;strong&gt;The fix:&lt;/strong&gt; I strictly pinned the project to &lt;code&gt;==3.12.*&lt;/code&gt;, ensuring a stable environment where native audio processing libraries still exist. &lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;By combining strict sandboxing, an intuitive Human-in-the-Loop UI, and robust fallback parsers to handle LLM quirks, the result is a local, voice-controlled developer assistant that is genuinely useful—and more importantly, safe to run on a local machine.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>python</category>
      <category>security</category>
    </item>
  </channel>
</rss>
