DEV Community

S@ndeep Kum@r
S@ndeep Kum@r

Posted on

Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop

Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop

Published as part of the Mem0 AI MLOps & AI Infra Internship Assignment


Introduction

What if you could speak to your computer and have it write code, create files, or summarize text — with full transparency into every step of the pipeline?

That's exactly what I built: a voice-controlled AI agent that transcribes audio, classifies intent (including compound multi-intent commands), executes the right tools, maintains persistent session memory, and requires human confirmation before touching your filesystem.

In this article I'll walk through the architecture, model choices, the three bonus features I implemented, and the real bugs I hit along the way.


Architecture Overview

The system is a linear pipeline with four stages:

Audio Input
    │
    ▼
Speech-to-Text  (Groq Whisper API — whisper-large-v3)
    │
    ▼
Intent Classification  (Groq llama-3.3-70b — structured JSON)
    │
    ▼
Tool Execution  (write_code | create_file | summarize | general_chat)
    │
    ├── output/                   ← all files created here only
    └── session_history.json      ← persistent memory
    │
    ▼
Gradio UI  (5-step pipeline display + session memory panel)
Enter fullscreen mode Exit fullscreen mode

Each stage is an isolated Python module (stt.py, intent.py, tools.py, memory.py) wired together by a thin orchestrator in app.py. This makes it easy to swap any component independently.


Stage 1: Speech-to-Text — Why Groq Whisper

The assignment recommended running HuggingFace Whisper locally. I tried this first — Whisper large-v3 requires ~4 GB VRAM and runs at about 0.1× realtime on a CPU-only machine. That's a 10-second wait for a 1-second voice clip.

My solution: Groq Whisper API.

Groq runs the same whisper-large-v3 model on their LPU hardware, delivering roughly 200× realtime speed. The API is free, requires one pip install, and the same API key is reused for the LLM too — no extra signup.

client = Groq(api_key=GROQ_API_KEY)
with open(audio_path, "rb") as f:
    response = client.audio.transcriptions.create(
        file=(Path(audio_path).name, f.read()),
        model="whisper-large-v3",
        response_format="text",
    )
transcription = response  # returns plain string when format="text"
Enter fullscreen mode Exit fullscreen mode

For anyone wanting a fully local setup, I kept a commented-out _transcribe_local() function using faster-whisper with the base model — much lighter at ~150 MB and fast enough on CPU.

Lesson: Don't let hardware constraints kill your UX. Document the workaround clearly and provide a fallback.


Stage 2: Intent Classification with Compound Command Support

This is the most interesting part. Given a sentence like "Summarize this text and save it to summary.txt", the system needs to identify two intents — not just one.

I used Groq llama-3.3-70b and designed the system prompt to return a JSON array of intent objects instead of a single object. This unlocks compound command support:

[
  {
    "intent": "summarize",
    "text_to_summarize": "this text",
    "save_result_to": "summary.txt"
  },
  {
    "intent": "create_file",
    "filename": "summary.txt",
    "content_hint": "save the summary result"
  }
]
Enter fullscreen mode Exit fullscreen mode

The tool executor then runs each step in order, passing the output of one step as input to the next — so the summary text is automatically injected into the file creation step without any extra user input.

Four supported intents:

  • write_code — generate code and save to file
  • create_file — create a file with generated content
  • summarize — summarize provided text
  • general_chat — conversational fallback

Why Groq LLM instead of a local model?

I initially tried Ollama with llama3 locally. It worked but was slow (~8 seconds per classification on CPU). Groq's hosted llama-3.3-70b responds in under 1 second and is free — a clear win for this use case.


Stage 3: Tool Execution with Safety Constraint

All tools that write files use a hard safety rule — every file is written inside the output/ directory only:

def _safe_path(filename: str) -> Path:
    safe_name = Path(filename).name  # strips "../" and any directory components
    return OUTPUT_DIR / safe_name
Enter fullscreen mode Exit fullscreen mode

For compound commands, the executor passes results between steps:

def execute_all_tools(details: list, original_command: str):
    last_result_text = None
    for detail in details:
        intent = detail.get("intent")
        if intent == "create_file" and last_result_text:
            # Inject previous step's output (e.g. summary) into file
            action, result, err = tool_create_file(detail, original_command,
                                                    injected_content=last_result_text)
        else:
            action, result, err = execute_tool(intent, detail, original_command)
        last_result_text = extract_plain_text(result)
Enter fullscreen mode Exit fullscreen mode

Bonus 1: Human-in-the-Loop Confirmation

Before any file operation, the pipeline pauses and shows confirmation buttons in the UI. The file is never created until the user explicitly clicks "Yes, proceed".

This required careful state management in Gradio — I stored the pending intent in a global _pending dict and only executed it on confirmation:

# Stage 1: classify, detect file op, pause
if intent in ("write_code", "create_file"):
    _pending = {"transcription": transcription, "details": details}
    return ..., gr.update(visible=True)   # show confirm buttons

# Stage 2: user clicks confirm
def confirm_execution():
    details = _pending["details"]
    action, result, err = execute_all_tools(details, ...)
Enter fullscreen mode Exit fullscreen mode

Bonus 2: Persistent Session Memory

Every action is logged to output/session_history.json on disk — not just in memory. This means the history panel survives page refreshes and server restarts.

def add_entry(intent, command, action, status):
    history = _load()   # read from disk
    history.append({
        "time": datetime.now().strftime("%H:%M:%S"),
        "date": datetime.now().strftime("%Y-%m-%d"),
        "intent": intent,
        "command": command[:80],
        "action": action,
        "status": status,
    })
    _save(history)      # write back to disk
Enter fullscreen mode Exit fullscreen mode

The UI loads this file on startup so the memory panel is always populated, even after a restart. Cancelled operations are also logged with a 🚫 status.


Bonus 3: Compound Commands

As described in Stage 2, a single voice input can now trigger multiple sequential actions. The intent classifier returns an ordered list of steps, and the tool executor runs them in sequence, piping output between steps automatically.

Example:

"Summarize: AI is transforming every industry — and save it to summary.txt"

Result:

  1. Summarizes the text
  2. Saves the summary to output/summary.txt
  3. Logs both actions as one entry in session memory
  4. Displays both step results in the UI

Challenges I Faced

1. Groq returns a string, not an object for STT

When using response_format="text", the API returns a plain string — not an object with a .text attribute. This caused an AttributeError on first run:

text = response if isinstance(response, str) else response.text
Enter fullscreen mode Exit fullscreen mode

2. Gradio 6 breaking changes

Upgrading to Gradio 6 broke two things: show_copy_button was removed from gr.Textbox, and css moved from gr.Blocks() to demo.launch(). Always check the changelog when upgrading UI frameworks.

3. Gemini free tier quota exhausted immediately

I initially used Google Gemini for the LLM. The free tier quota was exhausted after just a few test calls (limit: 0 errors). Switching to Groq solved this — their free tier is genuinely generous with thousands of requests per day.

4. Confirmation buttons disappearing too fast

In Gradio 6, toggling visible on individual buttons inside a Row behaved inconsistently. The fix was to toggle the entire gr.Row container's visibility instead of individual buttons.

5. Session history lost on refresh

Storing history in a Python global variable means it resets on every page refresh. Moving to a JSON file on disk solved this completely.


Project Structure

voice-ai-agent/
├── app.py              # Gradio UI + pipeline orchestration
├── src/
│   ├── stt.py          # Speech-to-Text (Groq Whisper)
│   ├── intent.py       # Intent classification + compound command support
│   ├── tools.py        # Tool execution with result chaining
│   └── memory.py       # Persistent session memory (JSON on disk)
├── output/             # All generated files + session_history.json
├── requirements.txt
├── .env.example
└── README.md
Enter fullscreen mode Exit fullscreen mode

Tech Stack Summary

Component Tool Cost
STT Groq Whisper (whisper-large-v3) Free
LLM Groq llama-3.3-70b-versatile Free
UI Gradio 6 Free / Open Source
Memory JSON file on disk Free
Language Python 3.12 Free

Total API cost: $0 — one free Groq key handles everything.


What I Would Add Next

  • Graceful degradation — better error messages for unintelligible audio or unmapped intents
  • Model benchmarking — compare Groq vs local Ollama on speed and accuracy
  • More compound patterns — e.g. "Write a retry function and also write tests for it"
  • File editing"Open retry.py and add logging to it"

Conclusion

Building this agent taught me that the hardest part of an AI pipeline isn't any single model — it's the glue between components. Making structured JSON output reliable, handling Gradio version changes, managing state for multi-step confirmation flows, and persisting memory across sessions are all engineering challenges that tutorials skip over.

The compound command feature was the most satisfying to build — it turns a simple classifier into something that genuinely understands user intent at a higher level.

Full source code: GitHub
Demo video: Loom Demo


Built with: Python · Gradio · Groq API (Whisper + llama-3.3-70b)

Top comments (0)