<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mansi More</title>
    <description>The latest articles on DEV Community by Mansi More (@mansi2704).</description>
    <link>https://dev.to/mansi2704</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3879103%2F12166f56-f347-48d0-aa59-297c90064842.png</url>
      <title>DEV Community: Mansi More</title>
      <link>https://dev.to/mansi2704</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mansi2704"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent: Architecture, Models, and Lessons Learned</title>
      <dc:creator>Mansi More</dc:creator>
      <pubDate>Tue, 14 Apr 2026 17:58:35 +0000</pubDate>
      <link>https://dev.to/mansi2704/building-a-voice-controlled-local-ai-agent-architecture-models-and-lessons-learned-3ggc</link>
      <guid>https://dev.to/mansi2704/building-a-voice-controlled-local-ai-agent-architecture-models-and-lessons-learned-3ggc</guid>
      <description>&lt;p&gt;&lt;strong&gt;Published by Mansi | April 2026&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/Mansi-2703/VoiceBot" rel="noopener noreferrer"&gt;https://github.com/Mansi-2703/VoiceBot&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Most AI assistants today ship your voice to a remote server, process it in the cloud, and return a response. That works — until you care about privacy, latency, or simply running offline. This project takes a different approach: every component, from speech recognition to code generation, runs entirely on local hardware.&lt;/p&gt;

&lt;p&gt;VoiceBot is a voice-controlled AI agent built as part of the Mem0 AI/ML and Generative AI Developer Intern assignment. The goal was to build a system that accepts audio input, classifies the user's intent, executes the appropriate local tool, and presents the full pipeline result in a clean UI. This article documents the architecture decisions, model choices, implementation challenges, and benchmarked performance of the final system.&lt;/p&gt;


&lt;h2&gt;
  
  
  What the System Does
&lt;/h2&gt;

&lt;p&gt;A user speaks or uploads an audio file. The system transcribes the speech, classifies the intent into one of four categories (create a file, write code, summarize text, or chat), executes the corresponding tool, and displays the result in a Streamlit dashboard.&lt;/p&gt;

&lt;p&gt;A single utterance like "Write a Python function to reverse a string and save it to reverse.py" triggers the full pipeline: transcription, compound intent detection, code generation, and file creation — all without a single outbound API call.&lt;/p&gt;


&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The system is composed of four sequential modules, each responsible for a single concern.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Audio Input (mic or file)
        |
  [ STT Module ]          src/stt.py
  OpenAI Whisper (local)
  Returns: transcript string
        |
  [ Intent Classifier ]   src/intent.py
  Mistral 7B via Ollama
  Returns: intent(s) + extracted params + confidence
        |
  [ Tool Executor ]       src/tools.py
  create_file / write_code / summarize_text / general_chat
  Returns: structured result dict
        |
  [ Streamlit UI ]        app.py
  Renders: transcript, intent badge, action, result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline orchestrator in &lt;code&gt;src/pipeline.py&lt;/code&gt; wires these together, manages session history, and handles compound commands — multiple intents detected in a single utterance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Module 1: Speech-to-Text with OpenAI Whisper
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model choice
&lt;/h3&gt;

&lt;p&gt;The STT layer uses OpenAI's Whisper base model, running entirely locally via the &lt;code&gt;openai-whisper&lt;/code&gt; Python package. Whisper is a transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio. The base model is approximately 140 MB and supports 99 languages with automatic language detection.&lt;/p&gt;

&lt;p&gt;The assignment permitted an API-based fallback (Groq or OpenAI) for machines that cannot run Whisper efficiently. The final implementation uses local Whisper because the base model runs acceptably on CPU for the audio lengths typical in voice commands, and keeping the STT local is essential to the privacy-first design goal.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;transcribe()&lt;/code&gt; function in &lt;code&gt;src/stt.py&lt;/code&gt; accepts either raw bytes (from a microphone recording) or a file path string (from an uploaded file). It writes bytes to a temporary file when needed, calls &lt;code&gt;whisper.load_model("base")&lt;/code&gt;, runs &lt;code&gt;model.transcribe()&lt;/code&gt;, and returns a standardised dict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Error handling catches two failure modes: unintelligible audio (empty or very short transcript) and model loading failures. Both return &lt;code&gt;error&lt;/code&gt; in the dict rather than raising exceptions, so the pipeline can surface a user-friendly warning instead of crashing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Latency per minute of audio&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU (quad-core)&lt;/td&gt;
&lt;td&gt;15–20 seconds&lt;/td&gt;
&lt;td&gt;Acceptable for short commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU (RTX 3060)&lt;/td&gt;
&lt;td&gt;2–3 seconds&lt;/td&gt;
&lt;td&gt;Near real-time&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For voice commands under 10 seconds, CPU latency is 2–4 seconds, which is acceptable for this use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  Module 2: Intent Classification with Mistral 7B
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model choice
&lt;/h3&gt;

&lt;p&gt;Intent classification uses Mistral 7B, served locally via Ollama. Mistral 7B is a 7-billion-parameter transformer that outperforms larger models on several reasoning benchmarks while fitting in approximately 4.4 GB of memory in 4-bit quantised form. Ollama provides a local HTTP server that exposes a simple REST API, making it straightforward to call from Python without managing model loading directly.&lt;/p&gt;

&lt;p&gt;The alternative considered was a smaller dedicated classifier (e.g., a fine-tuned BERT or DistilBERT). Mistral 7B was chosen instead because it handles the compound command case well — it can decompose "summarize this and save it to summary.txt" into two ordered intents in a single forward pass, without any multi-step prompting chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prompt design
&lt;/h3&gt;

&lt;p&gt;The system prompt instructs the model to return only valid JSON. This is a known reliability pattern for structured outputs from instruction-tuned models: constraining the output format in the system turn dramatically reduces parse failures compared to asking for JSON in the user turn alone.&lt;/p&gt;

&lt;p&gt;The output schema is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"intents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"create_file"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.96&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"extracted_params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"write_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"reverse.py"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"language"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"function to reverse a string"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"create_file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"filename"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"reverse.py"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For single-intent utterances, &lt;code&gt;intents&lt;/code&gt; contains one element. The pipeline loops over the array and dispatches each intent sequentially, which is the compound command implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fallback behaviour
&lt;/h3&gt;

&lt;p&gt;If JSON parsing fails (rare but possible with long, ambiguous utterances), the classifier defaults to &lt;code&gt;general_chat&lt;/code&gt; with the raw transcript passed as the message. This ensures the pipeline always produces a response rather than a silent failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Latency per query&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU (quad-core)&lt;/td&gt;
&lt;td&gt;8–12 seconds&lt;/td&gt;
&lt;td&gt;Bottleneck on CPU builds&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU (RTX 3060)&lt;/td&gt;
&lt;td&gt;1–2 seconds&lt;/td&gt;
&lt;td&gt;Comfortable for interactive use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Module 3: Tool Execution
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;src/tools.py&lt;/code&gt; module implements four tools, each returning a standardised result dict with a &lt;code&gt;success&lt;/code&gt; boolean and a &lt;code&gt;message&lt;/code&gt; string for error surfacing.&lt;/p&gt;

&lt;h3&gt;
  
  
  create_file
&lt;/h3&gt;

&lt;p&gt;Creates a plain text file inside the &lt;code&gt;output/&lt;/code&gt; directory. Path traversal is prevented with an &lt;code&gt;os.path.abspath&lt;/code&gt; check: if the resolved path does not start with the &lt;code&gt;output/&lt;/code&gt; directory's absolute path, the operation is refused and the error is returned to the pipeline. This is a critical security constraint because the intent classifier extracts filenames from raw user speech — an adversarial or malformed input could otherwise write to arbitrary paths.&lt;/p&gt;

&lt;h3&gt;
  
  
  write_code
&lt;/h3&gt;

&lt;p&gt;Sends a code generation prompt to Mistral 7B via Ollama, asking for clean code matching the user's description in the specified language. The generated code is saved to &lt;code&gt;output/{filename}&lt;/code&gt;. The result dict includes a &lt;code&gt;code_preview&lt;/code&gt; field (first 200 characters) for display in the UI, and the full code in the file.&lt;/p&gt;

&lt;h3&gt;
  
  
  summarize_text
&lt;/h3&gt;

&lt;p&gt;Passes the extracted text to Mistral 7B with a concise summarisation prompt. The result dict contains the summary string. This tool is typically the first step in a compound command like "summarize this and save it to notes.txt", where the pipeline chains &lt;code&gt;summarize_text&lt;/code&gt; output into the &lt;code&gt;content&lt;/code&gt; parameter of &lt;code&gt;create_file&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  general_chat
&lt;/h3&gt;

&lt;p&gt;Sends the user's message to Mistral 7B with a conversational system prompt. The last three session history entries are included as context, giving the agent memory across exchanges within a session. The result dict contains the response string.&lt;/p&gt;




&lt;h2&gt;
  
  
  Module 4: Pipeline Orchestration
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;src/pipeline.py&lt;/code&gt; contains the &lt;code&gt;run_pipeline()&lt;/code&gt; function, which is the single entry point called by the UI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bytes&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# 1. Transcribe
&lt;/span&gt;    &lt;span class="n"&gt;stt_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;transcribe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stt_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;early_error_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stt_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Classify
&lt;/span&gt;    &lt;span class="n"&gt;intent_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;classify_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stt_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transcript&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Execute each intent in order
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;intent&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;intent_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;intents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;intent_result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extracted_params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
        &lt;span class="n"&gt;tool_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;dispatch_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;intent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tool_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 4. Append to session history
&lt;/span&gt;    &lt;span class="n"&gt;session_history&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({...})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;build_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stt_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;intent_result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The session history list holds the last 20 exchanges and is passed to the &lt;code&gt;general_chat&lt;/code&gt; tool on each call, enabling conversational continuity.&lt;/p&gt;

&lt;p&gt;The return dict contains all keys the UI needs: &lt;code&gt;transcript&lt;/code&gt;, &lt;code&gt;intent&lt;/code&gt;, &lt;code&gt;intents&lt;/code&gt;, &lt;code&gt;confidence&lt;/code&gt;, &lt;code&gt;action_taken&lt;/code&gt;, &lt;code&gt;actions_taken&lt;/code&gt;, &lt;code&gt;result&lt;/code&gt;, and &lt;code&gt;error&lt;/code&gt;. This flat contract between pipeline and UI avoids the UI needing to know about the internal structure of any individual module.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Streamlit UI
&lt;/h2&gt;

&lt;p&gt;The UI (&lt;code&gt;app.py&lt;/code&gt;) is a two-column Streamlit layout with no third-party UI libraries. The left column contains the audio input widgets and the run button. The right column contains four output cards rendered as raw HTML via &lt;code&gt;st.markdown(..., unsafe_allow_html=True)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Key implementation decisions:&lt;/p&gt;

&lt;p&gt;The output cards bypass Streamlit's default container and expander widgets entirely. Those widgets render a visible header bar that conflicts with the custom card design. Raw HTML divs with inline styles give full control over the visual output.&lt;/p&gt;

&lt;p&gt;The run button uses &lt;code&gt;type="primary"&lt;/code&gt; with a CSS override via &lt;code&gt;button[kind="primary"]&lt;/code&gt; selectors to enforce the dark background. Streamlit's theming system does not expose button background color as a first-class config value, so the override must be injected in the global stylesheet block.&lt;/p&gt;

&lt;p&gt;Session state holds all pipeline results across reruns. Streamlit's execution model reruns the entire script on every interaction, so all mutable state must live in &lt;code&gt;st.session_state&lt;/code&gt;. The pipeline is only called when the run button is clicked, guarded by the &lt;code&gt;if run_button:&lt;/code&gt; block.&lt;/p&gt;

&lt;p&gt;The sidebar displays the last five session history entries in &lt;code&gt;[intent] — transcript preview...&lt;/code&gt; format, giving the user a quick audit trail of their session.&lt;/p&gt;




&lt;h2&gt;
  
  
  Compound Commands
&lt;/h2&gt;

&lt;p&gt;This was the most technically interesting feature to implement. Supporting compound commands required changes at two levels.&lt;/p&gt;

&lt;p&gt;At the intent classifier level, the system prompt was updated to return an array of intents rather than a single string, and to extract parameters for each intent independently. The key prompt instruction is: "If the user's command contains multiple distinct actions, return all of them in the intents array in the order they should be executed."&lt;/p&gt;

&lt;p&gt;At the pipeline level, the sequential loop over &lt;code&gt;intent_result["intents"]&lt;/code&gt; handles execution. The output of one tool can be wired to the input of the next: when &lt;code&gt;summarize_text&lt;/code&gt; precedes &lt;code&gt;create_file&lt;/code&gt;, the pipeline passes the summary string as the &lt;code&gt;content&lt;/code&gt; parameter for file creation, rather than requiring the user to repeat the content.&lt;/p&gt;

&lt;p&gt;Example compound command flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;User&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;this&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;article&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;save&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;it&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;summary.txt"&lt;/span&gt;

&lt;span class="na"&gt;Intent classifier returns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;intents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summarize_text"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;create_file"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summarize_text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;..."&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;create_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;filename&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary.txt"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

&lt;span class="na"&gt;Pipeline executes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Step 1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;summarize_text → summary string&lt;/span&gt;
  &lt;span class="na"&gt;Step 2&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;create_file with content = summary string&lt;/span&gt;
  &lt;span class="na"&gt;Output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;output/summary.txt created&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;The test suite contains over 70 test cases across four test files, organised under &lt;code&gt;tests/&lt;/code&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Module&lt;/th&gt;
&lt;th&gt;Test count&lt;/th&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_stt.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;85%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_intent.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;20+&lt;/td&gt;
&lt;td&gt;90%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_tools.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;20+&lt;/td&gt;
&lt;td&gt;92%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test_pipeline.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;15+&lt;/td&gt;
&lt;td&gt;88%+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;70+&lt;/td&gt;
&lt;td&gt;~90%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Unit tests mock external dependencies (Whisper model loading, Ollama HTTP calls) so they run without GPU or Ollama installed. Integration tests run against live services and are marked with &lt;code&gt;@pytest.mark.slow&lt;/code&gt; so they can be excluded from fast CI runs via &lt;code&gt;pytest tests/ -m "not slow"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Security tests verify path traversal prevention: attempts to write to &lt;code&gt;../etc/passwd&lt;/code&gt; or absolute paths outside &lt;code&gt;output/&lt;/code&gt; are confirmed to return &lt;code&gt;success: false&lt;/code&gt; without writing any file.&lt;/p&gt;

&lt;p&gt;The project includes &lt;code&gt;pytest.ini&lt;/code&gt; for configuration and &lt;code&gt;requirements-dev.txt&lt;/code&gt; for test-only dependencies (&lt;code&gt;pytest&lt;/code&gt;, &lt;code&gt;pytest-cov&lt;/code&gt;, &lt;code&gt;pytest-mock&lt;/code&gt;), keeping test infrastructure separate from the production dependency graph.&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Benchmarking
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;benchmark.py&lt;/code&gt; script measures latency and throughput for each model independently and for the full pipeline. Sample results on an RTX 3060:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Speech-to-Text (Whisper base)
  Average latency:    3,200 ms
  Min / Max:          2,800 ms / 3,600 ms
  Samples:            10

Intent Classification (Mistral 7B)
  Average latency:    1,500 ms
  Min / Max:          1,100 ms / 2,100 ms
  Samples:            5

Tool Execution
  create_file:        8 ms
  write_code:         5,320 ms  (includes Mistral generation)
  summarize_text:     7,150 ms  (includes Mistral generation)

Full Pipeline (GPU, warm models)
  Average:            4–8 seconds end-to-end
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On CPU-only hardware, full pipeline latency is 25–40 seconds. For a local agent where privacy is the primary constraint, this is acceptable. For production use, GPU acceleration is strongly recommended.&lt;/p&gt;




&lt;h2&gt;
  
  
  Challenges Faced
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Structured JSON output from Mistral.&lt;/strong&gt; Getting a local model to reliably return only JSON without preamble or explanation required careful system prompt engineering. The final approach uses a strict system prompt, a &lt;code&gt;format: json&lt;/code&gt; parameter in the Ollama API call where supported, and a try/except parser that strips markdown code fences (&lt;code&gt;\&lt;/code&gt;&lt;code&gt;\&lt;/code&gt;json&lt;code&gt;) before parsing. The fallback to&lt;/code&gt;general_chat` on parse failure means the user always gets a response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streamlit's execution model.&lt;/strong&gt; Streamlit reruns the entire script on every widget interaction, including audio recording. This caused the pipeline to re-execute unexpectedly on certain state changes. The fix was to gate all pipeline logic strictly inside &lt;code&gt;if run_button:&lt;/code&gt; and rely on &lt;code&gt;st.session_state&lt;/code&gt; for all persistent state, never on module-level variables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whisper model warm-up.&lt;/strong&gt; The first transcription in a session takes 3–5 seconds longer than subsequent ones because the model must be loaded into memory. For a better user experience, the model could be loaded at startup rather than on the first call. This is a known optimisation left for a future iteration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Path safety for file operations.&lt;/strong&gt; The intent classifier extracts filenames from raw natural language, which means user input directly influences file paths. The &lt;code&gt;os.path.abspath&lt;/code&gt; comparison guard was added after identifying this risk during security test writing — not before. Writing security tests first would have caught this earlier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local Models Over Cloud APIs
&lt;/h2&gt;

&lt;p&gt;The cost and privacy advantages are concrete. Running Whisper and Mistral locally has zero ongoing cost after the initial model download. An equivalent cloud setup using OpenAI Whisper API and GPT-4o would cost approximately $0.006 per minute of audio plus $5–15 per million tokens for intent classification and tool calls. For a developer running hundreds of test iterations, local models eliminate a meaningful expense and remove the risk of sending sensitive content to third-party servers.&lt;/p&gt;

&lt;p&gt;The tradeoff is hardware requirements. The system needs approximately 8 GB of RAM and 5 GB of disk space, plus a GPU for comfortable interactive latency. On CPU-only machines, the pipeline is functional but slow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;plaintext&lt;br&gt;
voice-agent/&lt;br&gt;
├── app.py                  Streamlit UI (437 lines)&lt;br&gt;
├── benchmark.py            Performance measurement script&lt;br&gt;
├── requirements.txt        Production dependencies&lt;br&gt;
├── requirements-dev.txt    Test dependencies&lt;br&gt;
├── pytest.ini              Test configuration&lt;br&gt;
├── src/&lt;br&gt;
│   ├── stt.py              Whisper transcription&lt;br&gt;
│   ├── intent.py           Mistral intent classification&lt;br&gt;
│   ├── tools.py            Four tool implementations&lt;br&gt;
│   └── pipeline.py         Orchestration + session memory&lt;br&gt;
├── tests/&lt;br&gt;
│   ├── conftest.py         Pytest fixtures&lt;br&gt;
│   ├── test_stt.py&lt;br&gt;
│   ├── test_intent.py&lt;br&gt;
│   ├── test_tools.py&lt;br&gt;
│   └── test_pipeline.py&lt;br&gt;
└── output/                 All generated files written here&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup in Five Steps
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;`bash&lt;/p&gt;

&lt;h1&gt;
  
  
  1. Install Ollama from &lt;a href="https://ollama.ai" rel="noopener noreferrer"&gt;https://ollama.ai&lt;/a&gt;, then pull the model
&lt;/h1&gt;

&lt;p&gt;ollama pull mistral&lt;/p&gt;

&lt;h1&gt;
  
  
  2. Create and activate a virtual environment
&lt;/h1&gt;

&lt;p&gt;python -m venv venv&lt;br&gt;
source venv/bin/activate   # Windows: venv\Scripts\activate&lt;/p&gt;

&lt;h1&gt;
  
  
  3. Install dependencies
&lt;/h1&gt;

&lt;p&gt;pip install -r requirements.txt&lt;/p&gt;

&lt;h1&gt;
  
  
  4. Start the Ollama server (keep this terminal open)
&lt;/h1&gt;

&lt;p&gt;ollama serve&lt;/p&gt;

&lt;h1&gt;
  
  
  5. Run the application
&lt;/h1&gt;

&lt;p&gt;streamlit run app.py&lt;br&gt;
`&lt;code&gt;&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;If building this again, I would write the security tests for file operations before writing the &lt;code&gt;create_file&lt;/code&gt; tool, not after. The path traversal risk is obvious in retrospect but was not the first thing I thought about when implementing the feature.&lt;/p&gt;

&lt;p&gt;I would also add a Whisper model warm-up call at application startup — loading the model into a module-level variable so the first user request does not experience the cold-start penalty.&lt;/p&gt;

&lt;p&gt;For the intent classifier, I would explore whether a smaller, fine-tuned model (a DistilBERT classifier trained on 500 examples per intent) could match Mistral 7B's accuracy for the four supported intents at a fraction of the latency. The 8–12 second CPU latency for Mistral is the single biggest usability problem on low-end hardware, and a dedicated small classifier would likely close it to under 1 second.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;VoiceBot demonstrates that a production-quality voice agent — with compound command support, session memory, security constraints, and a full test suite — can be built entirely on local hardware using open-source models. The architecture is modular: each of the four components (STT, intent classification, tool execution, UI) can be replaced independently. Swapping Whisper for a faster local model, or Mistral for a fine-tuned classifier, requires changing a single module without touching the others.&lt;/p&gt;

&lt;p&gt;The full source code is available at &lt;a href="https://github.com/Mansi-2703/VoiceBot" rel="noopener noreferrer"&gt;https://github.com/Mansi-2703/VoiceBot&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built using OpenAI Whisper, Mistral 7B via Ollama, Streamlit, and PyTorch. All models run locally.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
