<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aaditya Kapruwan</title>
    <description>The latest articles on DEV Community by Aaditya Kapruwan (@aaditya_kapruwan).</description>
    <link>https://dev.to/aaditya_kapruwan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3871535%2Ff0b751d9-77ee-4b00-ad64-b9c37d47ae33.jpg</url>
      <title>DEV Community: Aaditya Kapruwan</title>
      <link>https://dev.to/aaditya_kapruwan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aaditya_kapruwan"/>
    <language>en</language>
    <item>
      <title>Voice Agent Project</title>
      <dc:creator>Aaditya Kapruwan</dc:creator>
      <pubDate>Sat, 11 Apr 2026 23:09:56 +0000</pubDate>
      <link>https://dev.to/aaditya_kapruwan/voice-agent-project-46ka</link>
      <guid>https://dev.to/aaditya_kapruwan/voice-agent-project-46ka</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff514m8f1e7kwnyk11pqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff514m8f1e7kwnyk11pqc.png" alt="Local Voice Coding Agent" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Building a Local-First AI Agent: Coding with Your Voice
&lt;/h1&gt;

&lt;p&gt;I built a local-first AI agent that turns spoken words into real-time actions on your machine—whether it's coding or general file management.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vision
&lt;/h2&gt;

&lt;p&gt;Most coding tools today are cloud-dependent. I wanted something that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Respects privacy (data stays local)&lt;/li&gt;
&lt;li&gt;Has low latency&lt;/li&gt;
&lt;li&gt;Enables hands-free workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was simple: a lightweight local system capable of handling tasks like saving files, deleting directories, or summarizing documents without sending data to external services.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;p&gt;Building this system required stitching together multiple components that initially didn’t integrate smoothly. To make them work cohesively, I:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dockerized services for isolation and reliability&lt;/li&gt;
&lt;li&gt;Used JSON as a standard communication format between components&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Streamlit
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend (STT):&lt;/strong&gt; FastAPI + Faster-Whisper (Dockerized)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Layer:&lt;/strong&gt; Ollama (local models for intent detection and code generation)
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Action Layer:&lt;/strong&gt; Custom Python functions for system operations
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  4-Layer Architecture
&lt;/h2&gt;

&lt;p&gt;The system is divided into four layers to ensure separation of concerns and maintainability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Frontend (Streamlit)
&lt;/h3&gt;

&lt;p&gt;Handles mic recording, file uploads, and displays action logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  STT Service (FastAPI + Whisper)
&lt;/h3&gt;

&lt;p&gt;Runs in a dedicated Docker container and converts audio into text.&lt;/p&gt;

&lt;h3&gt;
  
  
  AI Layer (Ollama)
&lt;/h3&gt;

&lt;p&gt;Processes text to detect intent and generate code or actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Action Layer
&lt;/h3&gt;

&lt;p&gt;Executes safe file operations within a controlled output directory.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Flow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Voice → Whisper STT → Text → Ollama → Intent → Action → File Output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;h1&gt;Overcoming Challenges&lt;/h1&gt;

&lt;h2&gt;1. The Streamlit "Hang"&lt;/h2&gt;

&lt;p&gt;Streamlit reruns the script on interaction. Initially, stopping a recording caused the UI to crash or feel stuck. I solved this by implementing a non-stopping recording logic using session state and hashing:&lt;br&gt;
&lt;code&gt;&lt;br&gt;
if mic_audio is not None:&lt;br&gt;
    recorded = mic_audio.getvalue()&lt;br&gt;
    if recorded:&lt;br&gt;
        recorded_digest = hashlib.sha1(recorded).hexdigest()&lt;br&gt;
        # Only process if the audio is new&lt;br&gt;
        if recorded_digest != st.session_state.mic_audio_digest:&lt;br&gt;
            st.session_state.mic_audio_bytes = recorded&lt;br&gt;
            st.session_state.mic_audio_ready = True&lt;br&gt;
            st.session_state.mic_audio_digest = recorded_digest&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;2. Service Reliability&lt;/h2&gt;

&lt;p&gt;To prevent the UI from hanging when a backend service was down, I added defensive health checks for all Dockerized components.&lt;/p&gt;

&lt;h1&gt;Lessons Learned&lt;/h1&gt;

&lt;p&gt;Building AI features isn't just about the model; it’s about reliability. My biggest improvements came from:&lt;/p&gt;

&lt;p&gt;Strict API contracts.&lt;br&gt;
Defensive programming.&lt;br&gt;
Safe execution boundaries (sandboxing).&lt;br&gt;
Future Plans&lt;br&gt;
I’m looking into integrating Gemma 4 models for better task following and more complex conversation handling.&lt;/p&gt;

&lt;p&gt;Explore the Code&lt;br&gt;
You can check out the full source code and setup instructions here:&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/fister12" rel="noopener noreferrer"&gt;
        fister12
      &lt;/a&gt; / &lt;a href="https://github.com/fister12/voice-agent" rel="noopener noreferrer"&gt;
        voice-agent
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Voice Agent&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;Local-first voice assistant for coding and text workflows.&lt;/p&gt;

&lt;p&gt;It combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Streamlit UI for input, status, and results&lt;/li&gt;
&lt;li&gt;FastAPI + faster-whisper STT service in Docker&lt;/li&gt;
&lt;li&gt;Ollama for intent classification and generation&lt;/li&gt;
&lt;li&gt;Safe action executor that only writes inside output/&lt;/li&gt;
&lt;li&gt;Persistent memory and action history for continuity&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Audio input from file upload and microphone recording (when supported by Streamlit)&lt;/li&gt;
&lt;li&gt;Typed command fallback when audio is unavailable&lt;/li&gt;
&lt;li&gt;Intent routing to file creation, code generation, summarization, chat, and compound multi-step actions&lt;/li&gt;
&lt;li&gt;Guardrails to prevent path traversal outside output/&lt;/li&gt;
&lt;li&gt;Persistent SQLite memory in output/memory.db&lt;/li&gt;
&lt;li&gt;Action audit log in output/action_log.jsonl&lt;/li&gt;
&lt;li&gt;Benchmark runner with JSONL result logging and dashboard snapshot&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;Main components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;app.py: Streamlit UI and orchestration flow&lt;/li&gt;
&lt;li&gt;stt.py: client for STT HTTP API&lt;/li&gt;
&lt;li&gt;stt_service/app.py: Whisper transcription API (FastAPI)&lt;/li&gt;
&lt;li&gt;intent.py: intent classification + LLM helpers&lt;/li&gt;
&lt;li&gt;tools/actions.py: safe action execution and logging&lt;/li&gt;
&lt;li&gt;memory_store.py: SQLite memory retrieval and storage&lt;/li&gt;
&lt;li&gt;benchmark.py: repeatable intent/STT benchmarking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Request flow:&lt;/p&gt;


&lt;ol&gt;&lt;li&gt;…&lt;/li&gt;&lt;/ol&gt;
&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/fister12/voice-agent" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


&lt;p&gt;I would love to hear your feedback or suggestions for improvement!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>fastapi</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
