<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: rautaditya2606</title>
    <description>The latest articles on DEV Community by rautaditya2606 (@rautaditya2606).</description>
    <link>https://dev.to/rautaditya2606</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2747140%2F696730cd-b32e-4fc6-ad1b-d6c8e7bf9df7.png</url>
      <title>DEV Community: rautaditya2606</title>
      <link>https://dev.to/rautaditya2606</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/rautaditya2606"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled Local AI Agent on a 4GB GPU</title>
      <dc:creator>rautaditya2606</dc:creator>
      <pubDate>Sun, 12 Apr 2026 20:57:55 +0000</pubDate>
      <link>https://dev.to/rautaditya2606/building-a-voice-controlled-local-ai-agent-on-a-4gb-gpu-emc</link>
      <guid>https://dev.to/rautaditya2606/building-a-voice-controlled-local-ai-agent-on-a-4gb-gpu-emc</guid>
      <description>&lt;p&gt;&lt;strong&gt;What I Built&lt;/strong&gt;&lt;br&gt;
I built a voice-controlled local AI agent that transcribes &lt;br&gt;
audio, classifies intent, and executes local tools — all &lt;br&gt;
visible through a transparent pipeline trace in a Gradio UI.&lt;br&gt;
The agent supports four intents: create file, write code, &lt;br&gt;
summarize text, and general chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;br&gt;
STT layer: Groq Whisper-large-v3 handles transcription via API.&lt;br&gt;
I chose Groq over local Whisper because my RTX 3050 (4GB VRAM) &lt;br&gt;
cannot run STT and an LLM simultaneously without OOM errors. &lt;br&gt;
Groq's API is actually faster (~300ms) than local whisper-small &lt;br&gt;
would have been.&lt;/p&gt;

&lt;p&gt;Intent layer: Ollama serves qwen2.5-coder:1.5b locally. The LLM &lt;br&gt;
returns a structured JSON intent that the tool router uses to &lt;br&gt;
decide which action to take.&lt;/p&gt;

&lt;p&gt;Tool layer: Four tools — create_file, write_code, summarize, &lt;br&gt;
general_chat. All file writes are sandboxed to output/.&lt;/p&gt;

&lt;p&gt;UI layer: Gradio displays transcription, detected intent, action &lt;br&gt;
taken, and a full pipeline trace with per-stage latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware Constraints and Decisions&lt;/strong&gt; &lt;br&gt;
My machine: Intel i5-12500H, RTX 3050 (4GB VRAM), 15GB RAM.&lt;/p&gt;

&lt;p&gt;The core constraint: 4GB VRAM cannot hold both a Whisper model &lt;br&gt;
and an LLM simultaneously.&lt;/p&gt;

&lt;p&gt;Decision 1 — STT via Groq API&lt;br&gt;
Running whisper-small locally uses ~1.5GB VRAM. That leaves &lt;br&gt;
only 2.5GB for the LLM, which isn't enough for a useful model. &lt;br&gt;
Offloading STT to Groq frees the entire 4GB for the LLM and &lt;br&gt;
actually improves latency.&lt;/p&gt;

&lt;p&gt;Decision 2 — qwen2.5-coder:1.5b via Ollama&lt;br&gt;
A 1.5B model at Q4 quantization fits comfortably in ~1.5GB VRAM.&lt;br&gt;
I initially tried the 7b variant but it exceeded available VRAM &lt;br&gt;
and caused Ollama to offload to RAM, significantly slowing &lt;br&gt;
inference.&lt;/p&gt;

&lt;p&gt;Decision 3 — Sequential pipeline&lt;br&gt;
STT completes before Ollama is called. This keeps peak VRAM &lt;br&gt;
usage under 2GB at any given time.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;Challenges I Faced *&lt;/em&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;VRAM management&lt;br&gt;
Loading two models simultaneously caused OOM errors. Solved &lt;br&gt;
by switching STT to Groq and keeping only the LLM local.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Intent JSON parsing&lt;br&gt;
Ollama sometimes returns malformed JSON or wraps it in &lt;br&gt;
markdown code fences. Solved with a robust parser that &lt;br&gt;
strips fences and falls back to keyword matching if JSON &lt;br&gt;
parsing fails entirely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Output sandboxing&lt;br&gt;
Naive file creation allowed path traversal (e.g. &lt;br&gt;
../../etc/passwd). Solved with path normalization and &lt;br&gt;
checking that the resolved path starts with the output/ &lt;br&gt;
directory.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gradio mic input format&lt;br&gt;
Gradio returns audio as a tuple (sample_rate, numpy_array) &lt;br&gt;
not a file path. Had to write it to a temp file before &lt;br&gt;
passing to Groq API.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What I'd Do Differently at Scale&lt;/strong&gt;&lt;br&gt;
For a production version of this system, I would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace Ollama with Triton Inference Server for proper 
model serving with batching and metrics endpoints.&lt;/li&gt;
&lt;li&gt;Add a message queue (Redis or RabbitMQ) between the UI 
and pipeline so multiple users don't block each other.&lt;/li&gt;
&lt;li&gt;Replace the flat logger with structured JSON logs shipped 
to an observability stack (Grafana + Loki).&lt;/li&gt;
&lt;li&gt;Add model versioning — config.yaml currently hardcodes 
model names. A proper MLOps setup uses a model registry.&lt;/li&gt;
&lt;li&gt;Containerize STT locally using a sidecar so the pipeline 
has no external API dependency in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Links&lt;br&gt;
GitHub: &lt;a href="https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI" rel="noopener noreferrer"&gt;https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI&lt;/a&gt;&lt;br&gt;
Demo: &lt;a href="https://youtu.be/rhGIQvi4Y74" rel="noopener noreferrer"&gt;https://youtu.be/rhGIQvi4Y74&lt;/a&gt;  &lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
