DEV Community

rautaditya2606
rautaditya2606

Posted on

Building a Voice-Controlled Local AI Agent on a 4GB GPU

What I Built
I built a voice-controlled local AI agent that transcribes
audio, classifies intent, and executes local tools — all
visible through a transparent pipeline trace in a Gradio UI.
The agent supports four intents: create file, write code,
summarize text, and general chat.

Architecture
STT layer: Groq Whisper-large-v3 handles transcription via API.
I chose Groq over local Whisper because my RTX 3050 (4GB VRAM)
cannot run STT and an LLM simultaneously without OOM errors.
Groq's API is actually faster (~300ms) than local whisper-small
would have been.

Intent layer: Ollama serves qwen2.5-coder:1.5b locally. The LLM
returns a structured JSON intent that the tool router uses to
decide which action to take.

Tool layer: Four tools — create_file, write_code, summarize,
general_chat. All file writes are sandboxed to output/.

UI layer: Gradio displays transcription, detected intent, action
taken, and a full pipeline trace with per-stage latency.

Hardware Constraints and Decisions
My machine: Intel i5-12500H, RTX 3050 (4GB VRAM), 15GB RAM.

The core constraint: 4GB VRAM cannot hold both a Whisper model
and an LLM simultaneously.

Decision 1 — STT via Groq API
Running whisper-small locally uses ~1.5GB VRAM. That leaves
only 2.5GB for the LLM, which isn't enough for a useful model.
Offloading STT to Groq frees the entire 4GB for the LLM and
actually improves latency.

Decision 2 — qwen2.5-coder:1.5b via Ollama
A 1.5B model at Q4 quantization fits comfortably in ~1.5GB VRAM.
I initially tried the 7b variant but it exceeded available VRAM
and caused Ollama to offload to RAM, significantly slowing
inference.

Decision 3 — Sequential pipeline
STT completes before Ollama is called. This keeps peak VRAM
usage under 2GB at any given time.

*Challenges I Faced *

  1. VRAM management
    Loading two models simultaneously caused OOM errors. Solved
    by switching STT to Groq and keeping only the LLM local.

  2. Intent JSON parsing
    Ollama sometimes returns malformed JSON or wraps it in
    markdown code fences. Solved with a robust parser that
    strips fences and falls back to keyword matching if JSON
    parsing fails entirely.

  3. Output sandboxing
    Naive file creation allowed path traversal (e.g.
    ../../etc/passwd). Solved with path normalization and
    checking that the resolved path starts with the output/
    directory.

  4. Gradio mic input format
    Gradio returns audio as a tuple (sample_rate, numpy_array)
    not a file path. Had to write it to a temp file before
    passing to Groq API.

What I'd Do Differently at Scale
For a production version of this system, I would:

  • Replace Ollama with Triton Inference Server for proper model serving with batching and metrics endpoints.
  • Add a message queue (Redis or RabbitMQ) between the UI and pipeline so multiple users don't block each other.
  • Replace the flat logger with structured JSON logs shipped to an observability stack (Grafana + Loki).
  • Add model versioning — config.yaml currently hardcodes model names. A proper MLOps setup uses a model registry.
  • Containerize STT locally using a sidecar so the pipeline has no external API dependency in production.

Links
GitHub: https://github.com/rautaditya2606/Aditya_Raut_Mem0_AI
Demo: https://youtu.be/rhGIQvi4Y74

Top comments (0)