S@ndeep Kum@r

Posted on Apr 12

Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop

#python #ai #machinelearning #gradio

Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop

Published as part of the Mem0 AI MLOps & AI Infra Internship Assignment

Introduction

What if you could speak to your computer and have it write code, create files, or summarize text — with full transparency into every step of the pipeline?

That's exactly what I built: a voice-controlled AI agent that transcribes audio, classifies intent (including compound multi-intent commands), executes the right tools, maintains persistent session memory, and requires human confirmation before touching your filesystem.

In this article I'll walk through the architecture, model choices, the three bonus features I implemented, and the real bugs I hit along the way.

Architecture Overview

The system is a linear pipeline with four stages:

Audio Input
    │
    ▼
Speech-to-Text  (Groq Whisper API — whisper-large-v3)
    │
    ▼
Intent Classification  (Groq llama-3.3-70b — structured JSON)
    │
    ▼
Tool Execution  (write_code | create_file | summarize | general_chat)
    │
    ├── output/                   ← all files created here only
    └── session_history.json      ← persistent memory
    │
    ▼
Gradio UI  (5-step pipeline display + session memory panel)

Each stage is an isolated Python module (stt.py, intent.py, tools.py, memory.py) wired together by a thin orchestrator in app.py. This makes it easy to swap any component independently.

Stage 1: Speech-to-Text — Why Groq Whisper

The assignment recommended running HuggingFace Whisper locally. I tried this first — Whisper large-v3 requires ~4 GB VRAM and runs at about 0.1× realtime on a CPU-only machine. That's a 10-second wait for a 1-second voice clip.

My solution: Groq Whisper API.

Groq runs the same whisper-large-v3 model on their LPU hardware, delivering roughly 200× realtime speed. The API is free, requires one pip install, and the same API key is reused for the LLM too — no extra signup.

client = Groq(api_key=GROQ_API_KEY)
with open(audio_path, "rb") as f:
    response = client.audio.transcriptions.create(
        file=(Path(audio_path).name, f.read()),
        model="whisper-large-v3",
        response_format="text",
    )
transcription = response  # returns plain string when format="text"

For anyone wanting a fully local setup, I kept a commented-out _transcribe_local() function using faster-whisper with the base model — much lighter at ~150 MB and fast enough on CPU.

Lesson: Don't let hardware constraints kill your UX. Document the workaround clearly and provide a fallback.

Stage 2: Intent Classification with Compound Command Support

This is the most interesting part. Given a sentence like "Summarize this text and save it to summary.txt", the system needs to identify two intents — not just one.

I used Groq llama-3.3-70b and designed the system prompt to return a JSON array of intent objects instead of a single object. This unlocks compound command support:

[
  {
    "intent": "summarize",
    "text_to_summarize": "this text",
    "save_result_to": "summary.txt"
  },
  {
    "intent": "create_file",
    "filename": "summary.txt",
    "content_hint": "save the summary result"
  }
]

The tool executor then runs each step in order, passing the output of one step as input to the next — so the summary text is automatically injected into the file creation step without any extra user input.

Four supported intents:

write_code — generate code and save to file
create_file — create a file with generated content
summarize — summarize provided text
general_chat — conversational fallback

Why Groq LLM instead of a local model?

I initially tried Ollama with llama3 locally. It worked but was slow (~8 seconds per classification on CPU). Groq's hosted llama-3.3-70b responds in under 1 second and is free — a clear win for this use case.

Stage 3: Tool Execution with Safety Constraint

All tools that write files use a hard safety rule — every file is written inside the output/ directory only:

def _safe_path(filename: str) -> Path:
    safe_name = Path(filename).name  # strips "../" and any directory components
    return OUTPUT_DIR / safe_name

For compound commands, the executor passes results between steps:

def execute_all_tools(details: list, original_command: str):
    last_result_text = None
    for detail in details:
        intent = detail.get("intent")
        if intent == "create_file" and last_result_text:
            # Inject previous step's output (e.g. summary) into file
            action, result, err = tool_create_file(detail, original_command,
                                                    injected_content=last_result_text)
        else:
            action, result, err = execute_tool(intent, detail, original_command)
        last_result_text = extract_plain_text(result)

Bonus 1: Human-in-the-Loop Confirmation

Before any file operation, the pipeline pauses and shows confirmation buttons in the UI. The file is never created until the user explicitly clicks "Yes, proceed".

This required careful state management in Gradio — I stored the pending intent in a global _pending dict and only executed it on confirmation:

# Stage 1: classify, detect file op, pause
if intent in ("write_code", "create_file"):
    _pending = {"transcription": transcription, "details": details}
    return ..., gr.update(visible=True)   # show confirm buttons

# Stage 2: user clicks confirm
def confirm_execution():
    details = _pending["details"]
    action, result, err = execute_all_tools(details, ...)

Bonus 2: Persistent Session Memory

Every action is logged to output/session_history.json on disk — not just in memory. This means the history panel survives page refreshes and server restarts.

def add_entry(intent, command, action, status):
    history = _load()   # read from disk
    history.append({
        "time": datetime.now().strftime("%H:%M:%S"),
        "date": datetime.now().strftime("%Y-%m-%d"),
        "intent": intent,
        "command": command[:80],
        "action": action,
        "status": status,
    })
    _save(history)      # write back to disk

The UI loads this file on startup so the memory panel is always populated, even after a restart. Cancelled operations are also logged with a 🚫 status.

Bonus 3: Compound Commands

As described in Stage 2, a single voice input can now trigger multiple sequential actions. The intent classifier returns an ordered list of steps, and the tool executor runs them in sequence, piping output between steps automatically.

Example:

"Summarize: AI is transforming every industry — and save it to summary.txt"

Result:

Summarizes the text
Saves the summary to output/summary.txt
Logs both actions as one entry in session memory
Displays both step results in the UI

Challenges I Faced

1. Groq returns a string, not an object for STT

When using response_format="text", the API returns a plain string — not an object with a .text attribute. This caused an AttributeError on first run:

text = response if isinstance(response, str) else response.text

2. Gradio 6 breaking changes

Upgrading to Gradio 6 broke two things: show_copy_button was removed from gr.Textbox, and css moved from gr.Blocks() to demo.launch(). Always check the changelog when upgrading UI frameworks.

3. Gemini free tier quota exhausted immediately

I initially used Google Gemini for the LLM. The free tier quota was exhausted after just a few test calls (limit: 0 errors). Switching to Groq solved this — their free tier is genuinely generous with thousands of requests per day.

4. Confirmation buttons disappearing too fast

In Gradio 6, toggling visible on individual buttons inside a Row behaved inconsistently. The fix was to toggle the entire gr.Row container's visibility instead of individual buttons.

5. Session history lost on refresh

Storing history in a Python global variable means it resets on every page refresh. Moving to a JSON file on disk solved this completely.

Project Structure

voice-ai-agent/
├── app.py              # Gradio UI + pipeline orchestration
├── src/
│   ├── stt.py          # Speech-to-Text (Groq Whisper)
│   ├── intent.py       # Intent classification + compound command support
│   ├── tools.py        # Tool execution with result chaining
│   └── memory.py       # Persistent session memory (JSON on disk)
├── output/             # All generated files + session_history.json
├── requirements.txt
├── .env.example
└── README.md

Tech Stack Summary

Component	Tool	Cost
STT	Groq Whisper (whisper-large-v3)	Free
LLM	Groq llama-3.3-70b-versatile	Free
UI	Gradio 6	Free / Open Source
Memory	JSON file on disk	Free
Language	Python 3.12	Free

Total API cost: $0 — one free Groq key handles everything.

What I Would Add Next

Graceful degradation — better error messages for unintelligible audio or unmapped intents
Model benchmarking — compare Groq vs local Ollama on speed and accuracy
More compound patterns — e.g. "Write a retry function and also write tests for it"
File editing — "Open retry.py and add logging to it"

Conclusion

Building this agent taught me that the hardest part of an AI pipeline isn't any single model — it's the glue between components. Making structured JSON output reliable, handling Gradio version changes, managing state for multi-step confirmation flows, and persisting memory across sessions are all engineering challenges that tutorials skip over.

The compound command feature was the most satisfying to build — it turns a simple classifier into something that genuinely understands user intent at a higher level.

Full source code: GitHub
Demo video: Loom Demo

Built with: Python · Gradio · Groq API (Whisper + llama-3.3-70b)

DEV Community

Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop

Building a Voice-Controlled Local AI Agent with Compound Commands, Memory & Human-in-the-Loop

Introduction

Architecture Overview

Stage 1: Speech-to-Text — Why Groq Whisper

Stage 2: Intent Classification with Compound Command Support

Stage 3: Tool Execution with Safety Constraint

Bonus 1: Human-in-the-Loop Confirmation

Bonus 2: Persistent Session Memory

Bonus 3: Compound Commands

Challenges I Faced

Project Structure

Tech Stack Summary

What I Would Add Next

Conclusion

Top comments (0)