DEV Community

AKSHI SHARMA
AKSHI SHARMA

Posted on

Voice-Controlled Local AI Agent

Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

The Goal

I recently built a voice-controlled AI agent for the Mem0 Generative AI internship assignment. The system takes an audio command, converts it to text, classifies the user's intent, and executes the appropriate local action — all displayed in a web UI.

Here's what I built, how it works, and what I learned.


System Architecture

The pipeline has five stages:

Audio → STT → Intent Classification → Tool Execution → UI Display

Each stage is independently swappable — the system degrades gracefully if local hardware can't keep up.

Stage Primary Fallback
Audio Input Gradio mic / file upload
STT Groq Whisper API
Intent Groq llama-3.3-70b-versatile
Tools Python stdlib
UI Gradio 6.x

Model Choices

Speech-to-Text: Groq Whisper API

I used Groq's hosted Whisper API for transcription. It is near-instant for short clips, requires no local GPU, and has a generous free tier — making it ideal for laptops without powerful hardware.

LLM: Groq llama-3.3-70b-versatile

For intent classification and all text generation (code, summarization, chat), I used Groq's llama-3.3-70b-versatile. Groq's LPU hardware makes inference extremely fast (~200 tokens/s), which feels nearly instant in practice.

For intent classification, I structured the LLM output as JSON:

{
  "actions": [
    {
      "type": "write_code",
      "confidence": "high",
      "details": "Generate a Python retry function",
      "params": {
        "filename": "retry.py",
        "language": "python",
        "description": "a retry decorator with exponential backoff"
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This made it trivial to support compound commands — the actions array simply holds multiple items.


Key Implementation Decisions

Safety Sandbox

All file operations are restricted to an output/ directory. I implemented path traversal protection:

def _safe_path(filename: str) -> Path:
    safe = OUTPUT_DIR / Path(filename).name
    resolved = safe.resolve()
    if not str(resolved).startswith(str(OUTPUT_DIR)):
        raise ValueError("Path traversal attempt blocked")
    return resolved
Enter fullscreen mode Exit fullscreen mode

Human-in-the-Loop

Before executing any file write, the UI asks for confirmation. The pending action is stored in gr.State and executed only on the confirmation click.

Session Memory

I maintain two separate state structures:

  • Chat context — OpenAI-style message list passed to the LLM on each call
  • Action log — displayed in the UI history panel so users can review what the agent has done

Challenges

1. JSON parsing from LLMs

LLMs sometimes wrap JSON output in markdown fences or add preamble text. I wrote a robust parser that strips fences, attempts json.loads, falls back to regex extraction, and as a last resort defaults to a chat intent.

2. Compound command detection

Getting the LLM to reliably emit two action objects for commands like "summarize this and save to a file" required careful prompt engineering. Showing an explicit schema example in the system prompt dramatically improved reliability.

3. Gradio version compatibility

Gradio 6 removed several parameters that existed in older versions (show_download_button, GoogleFont). Updating the code to match the new API was a key debugging step.


What I'd Add Next

  • Streaming LLM responses — show code as it's generated token by token
  • Wake word detection — "Hey Agent" to trigger recording
  • Plugin system — let users define custom tool handlers in YAML
  • Persistent memory — save session history to disk across restarts

Source Code

GitHub: https://github.com/akshisharmaaa/Voice-Controlled-Local-AI-Agent


Thanks for reading! If you're building something similar or have questions about the architecture, drop a comment below.

Top comments (0)