Adewole Babatunde

Posted on May 25

How 12 AI agent frameworks handle human approval (most badly)

#ai #agents #machinelearning #langchain

I keep watching teams ship agent systems into production and then discover, on day three, that "the agent needs to wait for a human sometimes" breaks every assumption in their stack. Not because they didn't see it coming, every team plans for HITL. Because every popular agent framework reduces "human in the loop" to "block the Python process on input() and hope for the best."

I spent a day auditing the twelve most popular AI-agent frameworks against a strict production rubric. The results aren't kind. Two frameworks pass. Ten are one production deploy away from breaking.

This post is the receipts.

The rubric (and why it's strict)

A production HITL primitive isn't "the agent can pause for input." That's a 1980s primitive. A production HITL primitive needs six properties:

Axis	What it measures
Durability	Does the agent survive a worker restart during a pending await? Is the paused state stored in durable storage (Postgres), not in-process memory?
Idempotency	If the agent retries after a crash, can the same approval resolve once without double-acting?
Typed I/O	Is the request payload AND the human's response a typed schema (Pydantic / Zod)? Or just `str`?
Channel abstraction	Can you swap channel (terminal → Slack → email → dashboard) without rewriting the agent?
Verifier hook	Is there a built-in slot for an AI quality check on the human's response before resuming?
Default UI	Does the framework ship an admin UI to view, claim, resolve in-flight tasks?

Score 1 (absent / broken) to 5 (production-ready primitive in core). Max 30.

If you think six axes is too strict: the alternative is your agent process dying because a worker rotated mid-await, your retry double-charging a customer, your "approve y/n" prompt happening on stdin in a Slack channel only you can see, and you discovering all of this two weeks after you shipped.

The scorecard

Rank	Framework	Durability	Idempotency	Typed I/O	Channel	Verifier	UI	Total
1	LangGraph	5	3	3	1	1	2	15
2	Pydantic AI	4	4	5	1	1	0	15
3	Mastra	4	3	4	2	1	1	15
4	OpenAI Agents SDK	3	3	3	1	1	0	11
5	LlamaIndex	3	2	3	1	1	0	10
6	Haystack	2	1	2	2	1	1	9
7	Semantic Kernel	2	2	2	1	1	0	8
8	CrewAI	2	1	1	1	1	0	6
9	Claude Agent SDK	1	1	1	1	1	0	5
10	LangChain (legacy)	1	1	1	1	1	0	5
11	AutoGen	1	1	1	1	1	0	5
12	smolagents	1	1	1	1	1	0	5

Nobody scores above 15/30. Four frameworks tie at the bottom on 5/30, each for a different reason, but the headline number is the same.

The top: LangGraph, Pydantic AI, Mastra

These three are the closest things to production-ready. They still leave 50% of the rubric to you, but at least the durable pause primitive works.

LangGraph — best-in-class durability, BYO everything else

LangGraph's interrupt() pauses a graph node; resuming is graph.invoke(Command(resume=value), config={...}). Critically, and this is what separates it from the field, when paired with a PostgresSaver checkpointer, the paused thread state lives in Postgres. Any worker can resume it. The docs explicitly walk through worker-restart resilience.

from langgraph.types import interrupt, Command

def approval_node(state):
    decision = interrupt({"question": "approve transfer?", "amount": state["amount"]})
    return {"approved": decision}

What you'll discover the hard way: the docs warn that code before interrupt() runs twice on resume. "Place pure computation before, side effects after." That's the developer's idempotency burden, not the framework's. And there's no channel abstraction, the interrupt() payload is just a dict, and how that becomes a Slack message is entirely on you.

LangGraph Platform ships a task UI. The OSS package does not.

Verdict: Best durability story in the survey. Everything above the storage layer is BYO.

Pydantic AI — best typed API, weakest UI story

Pydantic AI's Deferred Tools is the cleanest API I saw. A tool marked requires_approval=True causes the run to end with a DeferredToolRequests output. You collect approvals as DeferredToolResults and resume by passing both into the next agent run.

@agent.tool(requires_approval=True)
async def transfer_funds(amount: int, to: str) -> str:
    return f"sent {amount} to {to}"

result = await agent.run("send $500 to Bob")
if result.output_type is DeferredToolRequests:
    approvals = {call.tool_call_id: ToolApproved() for call in result.output.approvals}
    result = await agent.run(
        message_history=result.all_messages(),
        deferred_tool_results=DeferredToolResults(approvals=approvals),
    )

Everything is Pydantic-typed. Durability comes via integration with Restate (or Temporal, Prefect), which journals every step including the await. That's powerful, but "we ship a primitive plus a separate runtime you also have to learn" is not zero-config.

Verdict: The typing is exactly right. The rest of the stack is a separate framework.

Mastra — best TypeScript story, channels don't compose

Mastra workflows expose suspend() / resume(). When a step suspends, the workflow snapshot is persisted (PostgreSQL, Upstash Redis, etc.), and resume() can be called from any HTTP endpoint with the matching run ID. Both suspend and resume payloads are Zod-typed.

const approvalStep = createStep({
  id: "approval",
  inputSchema: z.object({ amount: z.number() }),
  resumeSchema: z.object({ approved: z.boolean() }),
  execute: async ({ resumeData, suspend }) => {
    if (!resumeData) {
      await suspend({ requestId: crypto.randomUUID() });
      return;
    }
    return { approved: resumeData.approved };
  },
});

The catch: Mastra ships "Channels" (Slack, Discord, Telegram) — but for agents, not workflows. You can't say "suspend this step and route it to #ops on Slack" without writing the glue yourself. The community-maintained assistant-ui/mastra-hitl repo exists precisely because there's no canonical built-in approval UI for production.

Verdict: The best TS option. The last mile is still on you.

The middle: OpenAI Agents SDK, LlamaIndex, Haystack, Semantic Kernel

These four ship a recognizable HITL primitive but trip over a single major axis each.

OpenAI Agents SDK (11/30) has clean needs_approval semantics and a RunState you can serialize. But the SDK assumes you'll bring your own queue, storage, and UI. No channel. No idempotency guarantee beyond what serialize-and-resume gives you.
LlamaIndex (10/30) has elegant event-driven HITL via wait_for_event — but the docs warn: "the runtime pauses by throwing an internal control-flow exception and replays the entire step when the event arrives." Any side effects before the await run twice. Footgun.
Haystack (9/30) has the best-designed confirmation taxonomy (AlwaysAskPolicy, BlockingConfirmationStrategy, RichConsoleUI) — and ships only console UIs. Both built-in implementations block the Python process.
Semantic Kernel (8/30) has two half-built HITL mechanisms (IFunctionInvocationFilter and Process Framework KernelFunction parameters). The community is openly confused about which to use — issue #10832 literally asks "How to support HITL in Agent and Process frameworks."

The bottom: CrewAI, Claude Agent SDK, LangChain legacy, AutoGen, smolagents

This is where it gets bleak. Five frameworks. Each ships some pause-for-human primitive. Each one is input().

CrewAI: `human_input=True` — and then what?

The API is one boolean:

task = Task(
    description="Research the latest AI advancements...",
    expected_output="A comprehensive report",
    agent=researcher,
    human_input=True,
)

The CrewAI community forum has an officially acknowledged thread titled "Human in the loop - workaround" that opens:

"HITL input is handled through the terminal, which does not work for a production web environment."

The recommended workaround is Streamlit or Chainlit — wrap your CrewAI process behind a UI and hijack stdin. The entire production HITL story for CrewAI is "wrap stdin."

LangChain (legacy), AutoGen, smolagents: the `input()` family

LangChain ships HumanInputRun and HumanApprovalCallbackHandler. Both wrap input(). The maintainers' standing recommendation for production HITL is "move to LangGraph" — Discussion #21524, Discussion #28217.
AutoGen's human_input_mode="ALWAYS" is a 2023 primitive still shipping in 2026. Issues #2358 (persistence roadmap) and #5806 (mid-run checkpointing) are both open.
smolagents has step_callbacks and agent.interrupt(). Memory is in-process. Issue #364 tracks a state-serialization request — open, unfilled.

Claude Agent SDK: not actually HITL

The SDK's canUseTool is permission gating, not human-in-the-loop. The callback is synchronous within the agent loop. If you want to pause for a human review, you make canUseTool block on a Promise that resolves when the human answers. There's no built-in story for "human two timezones away approves via Slack." The SDK is honest about this — it's designed for Claude Code's interactive CLI approval, not for distributed agent ops.

The pattern: three universal gaps

Across all twelve frameworks, three axes are universally missing or near-missing:

1. Channel abstraction. Only Mastra ships anything resembling a channel adapter system. Even there, channels live on agents, not on workflows. Every other framework hands you a payload and walks away.

2. Verifier hook. Zero out of twelve frameworks ship a slot for "the human's response goes through an AI quality check before resuming the agent." The idea that a junior approver clicking "approve" should be pre-validated by an LLM before the agent trusts it — fully absent from the field.

3. Default UI. Zero OSS dashboards for in-flight HITL tasks. LangGraph Platform has one (paid). Mastra has a dev playground (not production). The rest assume you'll build it.

The aggregate pattern: the industry has reduced HITL to "block on stdin," and ten of twelve frameworks are one production deploy away from breaking.

What "good" actually looks like

A production HITL primitive should hit all six axes of the rubric above:

Persist to durable storage by default — Postgres or equivalent. No in-process state.
Emit a typed payload describing what the human is being asked. Pydantic / Zod.
Route through a pluggable channel adapter — Slack, email, dashboard, SMS. One config knob. Audit row written on every delivery + response.
Accept a typed response from any channel, idempotent on duplicate deliveries.
Optionally pipe the response through a verifier model before resuming the agent.
Expose an OSS admin UI so ops teams can see what's in flight.

No framework in this audit ships all six. So I built one: awaithumans. One function call (await_human / awaitHuman), all six properties in the box, channel adapters for Slack and email shipping today, dashboard included, audit trail by default, optional Claude/OpenAI/Gemini/Azure verifier. Apache 2.0. Python + TypeScript. Adapters for the two frameworks that got close — LangGraph and the broader Pydantic AI / Temporal combo — built in.

It is not the final answer. It's an honest answer to the gap this audit exposes.

Methodology

For each framework I read the official docs, the README, the examples folder, and grepped open GitHub issues for "human", "approval", "interrupt", "input", "pause". Code samples are copied from canonical docs; I didn't make any of them up. Scores are subjective on a 1-5 scale per axis but the scorecard is reproducible — pull the framework, grep for the same terms, you'll find the same gaps. The full per-framework deep dive (with source URLs for every claim) lives in the research notes. Audit date: 2026-05-24; the field is moving fast and at least three of these frameworks have HITL work in flight — see linked issues per section. If your framework is on this list and you've shipped one of the missing axes since the audit date, I'd love to update the scorecard — open an issue at github.com/awaithumans/awaithumans or DM @awaithumans on X.

Sources

Selected, in order of relevance. The full list of ~50 source URLs is in the research repo.

LangGraph — Interrupts · persistence
Pydantic AI — Deferred Tools · Restate integration
Mastra — HITL workflows · community HITL repo
OpenAI Agents SDK — HITL Python · HITL JS
LlamaIndex — HITL workflows
Haystack — HITL docs
Semantic Kernel — Process Framework HITL · issue #10832
CrewAI — community workaround thread
Claude Agent SDK — permissions docs
LangChain (legacy) — Discussion #21524 · Discussion #28217
AutoGen — HITL tutorial · Issue #2358
smolagents — Plan customization · Issue #364

If you maintain one of these frameworks and want to argue for a higher score on any axis, please do, open an issue with evidence and I'll update the scorecard. The point of the audit isn't to dunk; it's to make the gap legible.

P.S. — the browser-agent + awaithumans template repo shows what this looks like wired up: a browser-use agent that calls await_human() before clicking Place Order. Slack DM with cart screenshot, one-tap approve, agent resumes. ~90 lines.

Top comments (2)

Harjot Singh • May 31

This is a great audit because human-approval is the feature everyone bolts on last and it shows. The "most badly" finding tracks with what I see: most frameworks treat approval as a UI prompt (pause, ask, resume) rather than a real control primitive - which breaks the moment you need durable approvals (the run outlives the session), audit trails (who approved what, when), or scoped approval (this action yes, that one no). Approval-as-an-afterthought means it's fragile exactly when it matters: irreversible or expensive actions.

What good approval needs to be is a first-class gate in the execution model: the agent proposes, the action is BLOCKED until an authorized human (or a deterministic policy) approves, and the whole thing is logged. Propose-then-gate, not pause-then-hope. That's the spine of how I handle it in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - consequential actions are gated and logged, not left to a transient prompt. Genuinely valuable comparison - this is the unsexy primitive that decides whether you can trust an agent with real authority. Of the 12, did ANY get it right (durable + auditable + scoped), or is it universally an afterthought? Curious if there's a reference implementation worth pointing people to.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.