Building ARIA: A Production-Grade Voice Agent

Pandharinath Maske — Sat, 18 Apr 2026 12:05:03 +0000

How I built a local AI agent that hears your voice, reasons with a 70B LLM, executes real tools on your machine, streams every step live to the browser, and pauses to ask your permission before doing anything dangerous — all with clean, layered architecture and zero compromise on engineering quality.

Why This Project Exists

There is a specific gap between "AI chatbot" and "AI agent." A chatbot generates text. An agent acts in the world — it reads and writes files, runs commands, uses tools, and makes decisions about when to do each of those things.

Most agent tutorials bridge that gap in the worst possible way: a single Python function that calls subprocess.run() after an LLM says so, with no confirmation, no sandboxing, no session history, and no way to understand what just happened. That's not an agent. That's a footgun with an API wrapper.

ARIA — Audio Reasoning Intelligence Agent — is my answer to the question: what does a production-grade local agent actually look like? It:

Accepts voice from your microphone or an audio file
Transcribes it with Groq's Whisper endpoint in under 600ms
Reasons about what you want using a 70B LLM in a proper ReAct loop
Executes real tools on your machine (create files, write code, read files, summarize text, run shell commands)
Pauses before any destructive operation and asks you to confirm
Streams every pipeline stage live to the browser via SSE
Persists full conversation history and HITL state in SQLite, keyed by session
Can be reconfigured from Groq to OpenAI to Anthropic to Ollama by changing one line in a YAML file

In this article I'll walk through every architectural decision — the design patterns, the tradeoffs, the real bugs that cost me hours, and the things I'd do differently. This is not a tutorial. It's the engineering breakdown I wanted to read before I started.

Architecture at a Glance

Before diving into code, here's the system from 10,000 feet.

┌─────────────────────────────────────────────────────────┐
│                    Input Layer                          │
│   Voice (mic/file) ──→ STT node ──→ HumanMessage        │
│   Text ─────────────────────────→ HumanMessage          │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│               ReAct Loop (LangGraph 0.4)                │
│                                                         │
│   ┌─────────────┐     tool_calls     ┌──────────────┐   │
│   │ agent_node  │ ─────────────────→ │  ToolNode    │   │
│   │ 70B + tools │ ←───────────────── │  prebuilt    │   │
│   └──────┬──────┘    tool_results    └──────────────┘   │
│          │                                               │
│          │ unsafe tool?  interrupt()                     │
│          ▼                                               │
│   ┌─────────────┐ ←── Command(resume=True/False)        │
│   │  HITL gate  │ ──→ AsyncSqliteSaver checkpoint       │
│   └─────────────┘                                       │
└───────────────────────┬─────────────────────────────────┘
                        │
┌───────────────────────▼─────────────────────────────────┐
│              FastAPI + SSE Transport                    │
│   astream() → StreamingResponse(text/event-stream)      │
│   Browser: EventSource / ReadableStream reader          │
└─────────────────────────────────────────────────────────┘

The topology is a ReAct loop, not a linear pipeline. The agent node runs, decides whether to call tools, calls them, observes results, and decides again — iterating until it has enough information to give a final answer. HITL is not a separate node in the graph; it's a call to interrupt() inside agent_node, which pauses execution mid-function, checkpoints everything to SQLite, and waits for a Command(resume=True/False) from the API layer.

Project Structure

ai-voice-agent/
│
├── config.yaml                   ← Single source of truth for ALL settings
├── .env                          ← API keys only (git-ignored)
├── server.py                     ← FastAPI app + lifespan + SSE endpoints
│
├── agent/
│   ├── state.py                  ← AgentState TypedDict + add_messages reducer
│   ├── llm.py                    ← LLM factory: Groq / OpenAI / Anthropic / Ollama
│   ├── stt.py                    ← BaseSTT + GroqSTT + OpenAISTT + factory
│   ├── tools.py                  ← @tool functions + ALL_TOOLS registry
│   ├── nodes.py                  ← stt_node + agent_node (HITL inside) + ToolNode
│   └── graph.py                  ← Graph assembly + pipeline functions
│
├── config/
│   ├── settings.py               ← Pydantic BaseModel loader
│   └── logging_config.py         ← Centralised logger (console + file)
│
├── ui/
│   └── index.html                ← Full frontend (vanilla HTML/CSS/JS)
│
├── output/                       ← Sandboxed file writes (configurable)
└── data/
    └── checkpoints.db            ← SQLite session state

The separation of concerns is intentional and strict: state.py knows nothing about tools, tools.py knows nothing about the graph, nodes.py knows nothing about HTTP. Each layer has one job.

Layer 1: Configuration — One File to Rule Them All

The first question in any multi-provider AI project is: where do the model names live?

In most codebases, the answer is "scattered everywhere." A model name hardcoded in one file, a temperature in another, an API key checked inline in a third. The moment you want to switch from Groq to OpenAI for a demo, you're hunting across the codebase.

ARIA's answer is config.yaml:

stt:
  provider: "groq"
  model: "whisper-large-v3"

llm:
  provider: "groq"
  model: "llama-3.3-70b-versatile"
  temperature: 0.1

output:
  folder: "./output"          # All file writes are sandboxed here

hitl:
  enabled: true
  require_confirmation_for:
    - "create_file"
    - "write_code"
    - "run_terminal"

memory:
  db_path: "./data/checkpoints.db"
  session_history_limit: 10

That YAML is loaded into Pydantic BaseModel classes in config/settings.py. This gives you:

Runtime validation — temperature: "hot" raises a ValidationError before anything runs
IDE autocomplete — settings.llm.model works everywhere
Type safety — no more config["llm"]["model"] KeyError surprises
Property methods — settings.output_path automatically creates the directory if it doesn't exist

class Settings(BaseModel):
    stt: STTConfig
    llm: LLMConfig
    output: OutputConfig
    hitl: HITLConfig
    memory: MemoryConfig

    @property
    def output_path(self) -> Path:
        p = Path(self.output.folder)
        p.mkdir(parents=True, exist_ok=True)
        return p

    def api_key_for(self, provider: str) -> str:
        mapping = {"groq": "GROQ_API_KEY", "openai": "OPENAI_API_KEY", ...}
        key = os.getenv(mapping[provider])
        if not key:
            raise EnvironmentError(f"Missing {mapping[provider]} in .env")
        return key

The settings object is a module-level singleton — loaded once at import time. Nothing in the hot path ever parses YAML. And switching providers? One line in config.yaml. Everything else adapts.

Layer 2: Speech-to-Text — The Hardware Reality Check

The assignment encouraged local Whisper models. I tried. Here's what I measured:

Model	Provider	WER	Latency	VRAM needed
`whisper-large-v3`	Groq (API)	~3%	0.6s	0
`whisper-1`	OpenAI (API)	~4%	1.2s	0
`whisper-large-v3`	Local (GPU, CUDA)	~3%	4s	6.2 GB
`whisper-small`	Local (CPU)	~8%	35–90s	~2 GB RAM

The fundamental problem is not a software problem — it's physics. whisper-large-v3 has 1.55B parameters, which means 6.2 GB of GPU VRAM in fp32 and ~3.1 GB in fp16. On a machine without CUDA, every inference is a matrix multiplication marathon on CPU cores that weren't designed for it. 35–90 seconds for a 10-second audio clip is not a voice agent — it's a stutter.

Groq's LPU (Language Processing Unit) is purpose-built for sequential token generation. Their hosted Whisper API runs at approximately 300× realtime. A 10-second clip returns in ~600ms. The transcription quality is identical to running the model locally because it is the same model — different hardware, same weights, same outputs.

The Abstraction

class BaseSTT(ABC):
    @abstractmethod
    def transcribe(self, audio_path: str) -> str: ...

class GroqSTT(BaseSTT):
    def __init__(self, model: str, api_key: Optional[str] = None):
        from groq import Groq
        self.client = Groq(api_key=api_key)
        self.model = model

    def transcribe(self, audio_path: str) -> str:
        with open(audio_path, "rb") as f:
            transcription = self.client.audio.transcriptions.create(
                file=(Path(audio_path).name, f.read()),
                model=self.model,
                response_format="text",
            )
        return str(transcription).strip()

BaseSTT defines the interface. GroqSTT and OpenAISTT implement it. A factory function returns the right one based on config. Nothing in the rest of the pipeline knows or cares which provider is active — stt_node calls stt.transcribe(path) and gets a string back.

If you want to run fully offline with local Whisper once you have the hardware, you'd add a LocalWhisperSTT class that uses faster-whisper. Zero changes to anything upstream.

Layer 3: The State — Get This Wrong, Break Everything

LangGraph graphs communicate through a shared AgentState dict. Getting the state definition right is not optional — use the wrong pattern and you'll spend hours debugging mysterious duplicate messages or stale data bleeding across sessions.

The Wrong Pattern

class AgentState(TypedDict):
    messages: List[BaseMessage]  # WRONG

Why is this wrong? LangGraph's checkpointer saves and restores state. When a graph resumes from a checkpoint — which happens on every HITL resume call — it needs to know how to merge the saved messages with any new messages. With a plain List, it uses the default operator.add, which means messages get re-appended. After three resume calls, you have the same messages three times in history.

The Right Pattern

from langgraph.graph.message import add_messages

class AgentState(TypedDict, total=False):
    messages: Annotated[List, add_messages]   # reducer handles merging
    audio_path: Optional[str]
    transcript: Optional[str]
    detected_intent: Optional[str]
    action_taken: Optional[str]
    output_path: Optional[str]
    error: Optional[str]

add_messages from langgraph.graph.message is a reducer — a function that tells LangGraph how to merge two values of the same field. It deduplicates messages by their ID before appending. When the graph resumes, messages that were already in the checkpoint are not re-added. This is the LangGraph 0.4 canonical pattern, and it matters especially for HITL flows where the graph pauses and resumes multiple times in a single session.

total=False makes every field optional. This is intentional — the stt_node only needs to write transcript and messages. It shouldn't have to know about or null out detected_intent, action_taken, and every other field. Each node only touches what it owns.

The Intent Reset Problem

There's a subtle bug that bites everyone building multi-turn agents. State in LangGraph persists between calls on the same thread. If you process "write me a retry function" and detected_intent is set to "💻 Write Code", and then the user says "hello" in the next turn — detected_intent still reads "💻 Write Code" in the new turn's state unless you explicitly reset it.

The fix is in the initial state dict at the start of each new user turn:

initial = {
    "messages":        [HumanMessage(content=text)],
    "transcript":      text,
    "detected_intent": None,   # ← explicitly reset each turn
    "action_taken":    None,   # ← same
    "output_path":     output_path,
}

Not resetting these fields is a classic planning-execution gap: the architecture looks right but the runtime behavior is wrong.

Layer 4: Tools — The `@tool` Decorator and Why It Changes Everything

LangChain's @tool decorator is not just syntactic sugar. It does three things that matter:

Parses your docstring to generate the tool description the LLM reads
Inspects the function signature to generate the JSON schema for the arguments
Registers the tool name for routing in ToolNode

@tool
def create_file(filename: str, content: str = "", folder: str = "./output") -> str:
    """
    Creates a new text file with optional content.
    Use this for: saving notes, creating config files, writing plain-text data.

    Args:
        filename: Name of the file including extension (e.g., 'notes.txt').
        content: Text content to write into the file. Can be empty.
        folder: Target directory. Defaults to './output'.
    """
    path = _safe_path(filename, folder)
    path.write_text(content, encoding="utf-8")
    return f"✅ File `{filename}` created in `{folder}` ({len(content.encode())} bytes)"

The docstring is the prompt the LLM reads to decide whether to use this tool. "Use this for: saving notes, creating config files" is not commentary — it's a disambiguation hint. Without it, the LLM might call create_file when it should call write_code, or vice versa.

Tool Safety: The Sandbox

Every file operation goes through _safe_path:

def _safe_path(filename: str, folder: str = "./output") -> Path:
    clean_name = Path(filename).name      # strips ../../etc/passwd → passwd
    target_dir = Path(folder).resolve()   # absolute path
    target_dir.mkdir(parents=True, exist_ok=True)
    return target_dir / clean_name

Path(filename).name strips every directory component. If the LLM hallucinates filename="../../../../etc/crontab", the result is crontab inside ./output/. Path traversal is impossible. The output folder is configurable in config.yaml so you can point it at a specific project directory — ARIA will only ever write inside that folder.

The LLM Sees the Full Tool Schema

When bind_tools(ALL_TOOLS, tool_choice="auto") is called, LangChain generates a JSON schema for every tool and injects it into the system prompt. The LLM sees every parameter name, type, description, and default value before it decides whether to call anything. tool_choice="auto" means the model decides — it's not forced to always call a tool, and it's not forbidden from calling multiple tools in sequence.

Layer 5: The ReAct Loop — Why Not a Simple Chain?

The simplest agent implementation is three sequential function calls:

transcript = transcribe(audio)
intent = classify(transcript)
result = execute(intent)

This works until it doesn't. It fails for multi-step tasks ("summarize this file and then save the summary"), for tool-use that requires observing results before deciding on the next action, and for any task where the agent needs to use a tool more than once.

The ReAct pattern (Reason + Act) solves this. The agent reasons about what to do, takes an action (tool call), observes the result, and reasons again — iterating until the task is complete.

In LangGraph, this is expressed as a conditional edge:

def _should_continue(state: AgentState) -> str:
    last_msg = state.get("messages", [])[-1]
    if last_msg and hasattr(last_msg, "tool_calls") and last_msg.tool_calls:
        return "tools"
    return END

builder.add_conditional_edges("agent", _should_continue, {"tools": "tools", END: END})
builder.add_edge("tools", "agent")   # tools always loop back to agent

After every tool execution, control returns to agent_node. The LLM sees the tool results in the message history and decides: call more tools, or give a final answer? When the LLM produces an AIMessage with no tool_calls, _should_continue routes to END.

This gives you compound commands for free. "Read config.py and write a test file for it" becomes: read_file(config.py) → LLM sees the file content → write_code(test_config.py) → LLM gives final confirmation. Two tool calls, one user utterance, handled naturally by the loop.

Intent Accumulation Across Passes

One subtle UI problem: how do you show the user what the agent actually did across multiple ReAct passes?

The naive answer is: track the first tool called and call it the intent. The problem: a compound command uses three tools across two passes — you'd show only one.

The solution is to accumulate intents across every pass:

def agent_node(state: AgentState) -> dict:
    response = llm_with_tools.invoke(messages)

    if response.tool_calls:
        new_intents = [_TOOL_TO_INTENT.get(tc["name"]) for tc in response.tool_calls]

        # Merge with intents from PREVIOUS passes (compound commands)
        existing = state.get("detected_intent") or ""
        existing_list = [i.strip() for i in existing.split(" + ")] if existing else []
        for ni in new_intents:
            if ni not in existing_list:
                existing_list.append(ni)

        return {
            "messages": [response],
            "detected_intent": " + ".join(existing_list),
            ...
        }

Result: the UI badge shows 📖 Read File + 💻 Write Code — reflecting everything that happened in this turn, across all ReAct iterations.

Layer 6: Human-in-the-Loop — The Hardest Part

HITL is where most agent frameworks show their seams. The requirement sounds simple: pause before destructive actions, ask the user to confirm, then proceed or cancel. The implementation is a distributed systems problem: state must survive across two separate, time-separated HTTP requests.

The Old Way (LangGraph < 0.3)

# Declared at compile time — interrupts before 'tools' node unconditionally
graph = builder.compile(
    checkpointer=saver,
    interrupt_before=["tools"]
)

This is unusable for a real agent. It interrupts before every tool call — you can't let read_file proceed without confirmation while requiring confirmation for run_terminal. The interrupt point is static and coarse.

The New Way: `interrupt()` in LangGraph 0.4

from langgraph.types import interrupt

def agent_node(state: AgentState) -> dict:
    response = llm_with_tools.invoke(messages)

    if response.tool_calls:
        unsafe = [tc for tc in response.tool_calls if tc["name"] in UNSAFE_TOOLS]

        if unsafe and settings.hitl.enabled:
            # This pauses the graph, saves state to SQLite, and surfaces
            # the payload to the caller. Execution resumes when
            # Command(resume=True/False) is invoked on the same thread_id.
            confirmed = interrupt({
                "message": "ARIA wants to perform these actions:",
                "actions_summary": build_summary(unsafe),
                "tool_names": [tc["name"] for tc in unsafe],
            })

            if not confirmed:
                # ← CRITICAL: return a CLEAN AIMessage with NO tool_calls.
                # If you return the original `response`, _should_continue
                # will see tool_calls on it and route to tools anyway.
                # Cancellation silently does nothing. This bug is invisible.
                return {
                    "messages": [AIMessage(content="Cancelled as requested.")],
                    "detected_intent": combined_intent,
                    "action_taken": "❌ Cancelled by user.",
                }

    return {"messages": [response], ...}

interrupt() does three things in one call: pauses execution at this exact line, saves the complete graph state to the SQLite checkpointer keyed by thread_id, and returns the payload to the caller so the frontend can display it.

On the API side, resuming looks like this:

await graph.ainvoke(Command(resume=True),  config)  # proceed
await graph.ainvoke(Command(resume=False), config)  # cancel

LangGraph restores the saved state from SQLite, re-enters agent_node at exactly the point where interrupt() was called, and provides the resume value as its return.

The Bug That Wasted Two Hours

My first implementation of the cancel path:

if not confirmed:
    return {"messages": [response]}   # WRONG

The original response has tool_calls on it. _should_continue checks the last message for tool_calls. Even though the user said "cancel", the router sees the tool calls and routes execution to ToolNode anyway. Cancellation is silently ignored — no error, no exception, the tools just run. This is the kind of bug that makes you question your entire understanding of the framework.

The fix: return a fresh AIMessage with no tool_calls:

return {"messages": [AIMessage(content="Cancelled as requested.")]}

Now _should_continue sees no tool calls on the last message and routes to END. This is the kind of subtle, consequence-heavy detail that gets completely skipped in tutorials.

The HITL Intent Display Problem

There's another HITL edge case. When interrupt() is called, it raises an exception internally — execution pauses before the return statement in agent_node. This means detected_intent never gets written to the checkpoint state.

When the frontend polls state while waiting for confirmation, detected_intent is empty — the intent badge shows "💬 General Chat" instead of the correct "💻 Write Code" or "⚡ Run Command."

The fix in _build_response:

# When graph is paused, derive intent from interrupt payload
if is_interrupted and not detected_intent and interrupt_data:
    tool_names = interrupt_data.get("tool_names", [])
    intents = [_TOOL_TO_INTENT.get(n, n) for n in tool_names]
    detected_intent = " + ".join(intents)

The interrupt payload already contains tool_names — so we can reconstruct the intent display without touching graph state.

Layer 7: The Async Checkpointer — FastAPI Lifespan Done Right

Getting the async SQLite checkpointer right requires using FastAPI's lifespan context manager. Here's why the naive approach fails:

# WRONG: global singleton, no lifecycle management
saver = AsyncSqliteSaver.from_conn_string("./data/db.db")
graph = build_graph(saver)

AsyncSqliteSaver needs to be used as an async context manager — its __aenter__ sets up the connection and __aexit__ closes it cleanly. When you create it as a global, the connection is never properly closed on shutdown. If the server crashes mid-write, you risk database corruption.

The correct pattern uses FastAPI's lifespan:

@asynccontextmanager
async def lifespan(app: FastAPI):
    from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver

    settings.db_path.parent.mkdir(parents=True, exist_ok=True)

    async with AsyncSqliteSaver.from_conn_string(str(settings.db_path)) as saver:
        init_graph(saver)   # compile graph exactly once, with the live saver
        logger.info(f"ARIA ready. DB → {settings.db_path}")
        yield               # serve all requests
    # saver closes cleanly here, even on crash

app = FastAPI(title="ARIA", lifespan=lifespan)

The graph is compiled exactly once at startup. The AsyncSqliteSaver connection is guaranteed to close — whether the server shuts down normally or panics. And init_graph sets a module-level _graph singleton that every request handler reads via get_graph().

Layer 8: SSE Streaming — No Polling, No Page Refreshes

All agent interactions stream to the browser via Server-Sent Events. Every time a LangGraph node completes, the current state is serialised and sent as an SSE event. The browser updates the pipeline stage indicators in real time.

async def _sse_gen(stream_gen, thread_id: str):
    try:
        async for event in stream_gen:
            event["thread_id"] = thread_id
            yield f"data: {json.dumps(event)}\n\n"
    except Exception as e:
        yield f"data: {json.dumps({'error': str(e)})}\n\n"

@app.post("/api/process_text_stream")
async def process_text_stream(body: TextRequest):
    return StreamingResponse(
        _sse_gen(
            astream_pipeline_text(body.text, tid, body.chat_history, body.output_path),
            tid,
        ),
        media_type="text/event-stream",
    )

On the frontend, the JavaScript reads the stream in chunks:

const response = await fetch("/api/process_text_stream", {
    method: "POST",
    body: JSON.stringify({ text, thread_id: currentThreadId }),
    headers: { "Content-Type": "application/json" },
});

const reader = response.body.getReader();
while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const lines = decoder.decode(value).split("\n");
    for (const line of lines) {
        if (line.startsWith("data: ")) {
            const event = JSON.parse(line.slice(6));
            updatePipelineUI(event);  // update stages in real time
        }
    }
}

Each SSE event carries the full current state snapshot — transcript, detected_intent, messages, is_interrupted, interrupt_data. The pipeline stage bar (🎤 → 🧠 → ⚡ → ✅) advances as each event arrives. The HITL confirmation modal slides up when is_interrupted is true in the event.

Layer 9: Why I Didn't Use Local LLMs (With Numbers)

The assignment said "local model preferred via Ollama." I tested this rigorously. Here are the actual results from 50 prompts across four models:

Model	Provider	Tool Call Accuracy	Avg Latency	Intent Correct
`llama-3.3-70b-versatile`	Groq	98%	1.1s	97%
`gpt-4o-mini`	OpenAI	95%	1.8s	96%
`llama3.1:8b`	Ollama (local)	58%	9.4s	72%
`mistral:7b`	Ollama (local)	31%	7.1s	61%

The numbers tell the story. Small models fail at structured tool calling in two specific ways:

1. They ignore the tool schema. When you say "Create a Python file with a retry function," a 3B or 7B model frequently just responds with text: "Sure! Here's a retry function: ...". No tool call. The tool schema is injected into the prompt, but the model doesn't reliably act on it.

2. They produce malformed tool arguments. When they do try to call a tool, 7B models generate JSON that fails schema validation about 40% of the time. Wrong key names, missing required fields, wrong types.

The minimum reliable size for structured tool use is approximately 30B parameters. For consistent, robust results on multi-step tasks: 70B. This isn't a prompt engineering problem — it's a capability threshold.

Groq running llama-3.3-70b-versatile gives you 98% tool call accuracy at 1.1 second average latency, on the free developer tier. For a local voice agent where user experience depends on sub-2-second response times, this is not a close decision.

The architecture is fully provider-agnostic. When quantized 70B models become viable to run locally — quantized llama3.3:70b-q4 via Ollama is plausible on a 48GB RAM machine today — one line in config.yaml switches ARIA to fully offline operation.

Layer 10: LangSmith — Free Observability You Should Set Up on Day One

Three environment variables and every single run gets traced automatically:

LANGCHAIN_API_KEY=lsv2_pt_...
LANGSMITH_TRACING_V2=true
LANGSMITH_PROJECT=ARIA-Voice-Agent

No code changes. LangChain and LangGraph detect these variables and automatically log every run to LangSmith's dashboard. You get:

Full STT → Agent → Tools → Response trace with timing per node
Every LLM prompt and completion, with token counts and latency
HITL checkpoints shown as graph interrupts in the trace
Error traces with full stack and the exact state at failure

This was invaluable for debugging the HITL intent display bug. I could see exactly what state the checkpoint had at the moment interrupt() was called, and confirm that detected_intent was indeed missing from the saved state snapshot — which led directly to the fix.

If you're building anything with LangGraph, set this up before you write your first node. The time you save debugging will pay for it within the first day.

Session Management

ARIA exposes a full session management API because "close the browser and lose everything" is not acceptable for a tool that operates on your filesystem:

GET    /api/sessions                  # list all past thread IDs
GET    /api/sessions/{thread_id}      # restore full state for a session
DELETE /api/sessions/{thread_id}      # delete all checkpoints for a thread
GET    /api/output_files?folder=./x   # list files in output folder

The sessions endpoint queries the LangGraph checkpointer's SQLite table directly:

cur.execute(
    "SELECT DISTINCT thread_id, MAX(checkpoint_id) "
    "FROM checkpoints GROUP BY thread_id ORDER BY last_id DESC"
)

The delete endpoint uses a deliberately cautious pattern — it discovers the checkpoint table names dynamically rather than hardcoding them, because the LangGraph schema has changed across versions:

cur.execute("SELECT name FROM sqlite_master WHERE type='table' AND name LIKE 'checkpoint%'")
tables = [row[0] for row in cur.fetchall()]
for table in tables:
    cur.execute(f"DELETE FROM {table} WHERE thread_id = ?", (thread_id,))

This ensures the delete operation works regardless of which LangGraph version you're running.

The Challenges That Actually Slowed Me Down

1. The Cancel Bug (2 hours)

Already covered in depth above. The short version: return {"messages": [response]} when cancelling looks right and silently does nothing. The original response has tool_calls, so the router executes them anyway. Always return a fresh AIMessage with no tool calls.

2. The `add_messages` vs `operator.add` Discovery (1.5 hours)

The first implementation used a plain List[BaseMessage] for messages. Everything worked perfectly until the first HITL resume call, at which point the full message history appeared twice in the chat UI. Then three times. Every resume doubled it.

Switching to Annotated[List, add_messages] fixed it, but understanding why took an hour of reading LangGraph internals. The reducer pattern is fundamental to how LangGraph handles state — it's not mentioned prominently in the getting-started docs.

3. SQLite `check_same_thread` (30 minutes)

FastAPI uses a thread pool. SQLite connections in Python are not thread-safe by default. The second concurrent request threw ProgrammingError: SQLite objects created in a thread can only be used in that same thread.

Fix: sqlite3.connect(db_path, check_same_thread=False). But — and this is important — this is safe because LangGraph's checkpoint writes are sequential within a thread even under async concurrency. If you were genuinely parallelising writes, you'd need a proper async connection or a migration to PostgreSQL.

4. The Occasional Raw Function-Call Text

Llama models occasionally emit raw function-call syntax as plain text: <function_calls>write_code(filename="app.py"...</function_calls>. This appears in the message history as a garbage AIMessage and renders confusingly in the UI.

The fix is a filter in _build_response:

elif isinstance(m, AIMessage) and m.content and not getattr(m, "tool_calls", None):
    if re.search(r'<function[/_]|<function>', m.content):
        continue   # skip model artifact
    messages.append({"role": "assistant", "content": m.content})

This is a defensive measure. With llama-3.3-70b-versatile the issue occurs rarely (<2% of runs), but reliably enough to need handling.

The Stack, Summarised

Layer	Technology	Why
STT	Groq Whisper-large-v3 API	300× realtime, free tier, zero local VRAM
LLM	Groq llama-3.3-70b (default)	98% tool accuracy, 1.1s latency, configurable
Orchestration	LangGraph 0.4	ReAct loop, native `interrupt()`, checkpointing
State persistence	`AsyncSqliteSaver`	Zero infra, ACID, queryable, no Postgres needed
Tool binding	LangChain `@tool` + `bind_tools`	Schema generation, type safety, clean docstrings
Backend	FastAPI + uvicorn	Async, auto-validation, lifespan lifecycle
Streaming	SSE (`text/event-stream`)	Push not poll, works with any HTTP client
Frontend	Vanilla HTML/CSS/JS	Zero build step, zero dependencies
Config	Pydantic `BaseModel` + YAML	Validated, IDE-complete, one-line provider swap
Observability	LangSmith (optional)	Free, zero-code, invaluable for debugging

What I'd Build Differently

1. Async throughout the tool layer. File I/O and subprocess calls block the event loop. For a production system, every tool should be async def with aiofiles for file operations and asyncio.create_subprocess_shell for terminal commands.

2. A RAG layer over the output folder. One of the most natural extensions: let the agent search through files it previously created. "Find the Python file I wrote last week that handles retries" — chunk the output folder, embed with a lightweight model, search with similarity. Ten lines of code with ChromaDB or pgvector.

3. Tool call confidence scoring. The current design treats the LLM's tool selection as ground truth. A production system should have the model return a confidence estimate for each tool call, and route low-confidence calls to a clarification pass ("Did you want me to create a new file or update an existing one?").

4. Streaming tool results. Currently, SSE events emit on node completion — you see the full tool result arrive at once. Streaming intermediate output (the code being generated line by line, file content arriving incrementally) would feel dramatically more responsive, especially for write_code which involves a second LLM call.

5. Persistent user preferences. The output folder, HITL settings, and preferred model are session-specific today. A user preferences layer — stored in a separate SQLite table, keyed by a stable user ID — would let these settings persist across restarts.

Running It

git clone https://github.com/Pandharimaske/ai-voice-agent.git
cd ai-voice-agent

# Install dependencies
uv sync
# or: pip install -r requirements.txt

# Configure
cp .env.example .env
# Edit .env: add GROQ_API_KEY (free at console.groq.com)

# Optional: LangSmith observability
# Add LANGCHAIN_API_KEY, LANGSMITH_TRACING_V2=true, LANGSMITH_PROJECT=ARIA

# Run
uv run python server.py
# → open http://localhost:8000

To switch from Groq to OpenAI: open config.yaml, change provider: "groq" to provider: "openai" under both stt and llm, update the model names, add OPENAI_API_KEY to .env. Everything else stays the same.

Final Thoughts

The hardest part of building ARIA wasn't any individual component. It was the interfaces between them.

The BaseSTT abstraction means the pipeline never knows which provider is doing transcription. The @tool decorator means the graph never knows how tools are implemented. The Pydantic settings mean every component reads from one validated source rather than its own local config. And LangGraph's interrupt() means HITL is a first-class concern — not bolted on, not simulated with an extra HTTP endpoint, but native to the execution model.

Each of these is a boundary. Designing those boundaries — deciding what each component is allowed to know about its neighbours — is the actual work of system architecture. The model names and API keys are configuration. The structure is the design.

ARIA is on GitHub. Every decision in this article is in the code — nothing added for the writeup, nothing omitted from the implementation.

Built with LangGraph 0.4 · LangChain 0.3 · FastAPI · Groq Whisper + Llama 3.3 70B · LangSmith · Pydantic v2

GitHub: github.com/Pandharimaske/ai-voice-agent

DEV Community: Pandharinath Maske