I keep watching teams ship agent systems into production and then discover, on day three, that "the agent needs to wait for a human sometimes" breaks every assumption in their stack. Not because they didn't see it coming, every team plans for HITL. Because every popular agent framework reduces "human in the loop" to "block the Python process on input() and hope for the best."
I spent a day auditing the twelve most popular AI-agent frameworks against a strict production rubric. The results aren't kind. Two frameworks pass. Ten are one production deploy away from breaking.
This post is the receipts.
The rubric (and why it's strict)
A production HITL primitive isn't "the agent can pause for input." That's a 1980s primitive. A production HITL primitive needs six properties:
| Axis | What it measures |
|---|---|
| Durability | Does the agent survive a worker restart during a pending await? Is the paused state stored in durable storage (Postgres), not in-process memory? |
| Idempotency | If the agent retries after a crash, can the same approval resolve once without double-acting? |
| Typed I/O | Is the request payload AND the human's response a typed schema (Pydantic / Zod)? Or just str? |
| Channel abstraction | Can you swap channel (terminal → Slack → email → dashboard) without rewriting the agent? |
| Verifier hook | Is there a built-in slot for an AI quality check on the human's response before resuming? |
| Default UI | Does the framework ship an admin UI to view, claim, resolve in-flight tasks? |
Score 1 (absent / broken) to 5 (production-ready primitive in core). Max 30.
If you think six axes is too strict: the alternative is your agent process dying because a worker rotated mid-await, your retry double-charging a customer, your "approve y/n" prompt happening on stdin in a Slack channel only you can see, and you discovering all of this two weeks after you shipped.
The scorecard
| Rank | Framework | Durability | Idempotency | Typed I/O | Channel | Verifier | UI | Total |
|---|---|---|---|---|---|---|---|---|
| 1 | LangGraph | 5 | 3 | 3 | 1 | 1 | 2 | 15 |
| 2 | Pydantic AI | 4 | 4 | 5 | 1 | 1 | 0 | 15 |
| 3 | Mastra | 4 | 3 | 4 | 2 | 1 | 1 | 15 |
| 4 | OpenAI Agents SDK | 3 | 3 | 3 | 1 | 1 | 0 | 11 |
| 5 | LlamaIndex | 3 | 2 | 3 | 1 | 1 | 0 | 10 |
| 6 | Haystack | 2 | 1 | 2 | 2 | 1 | 1 | 9 |
| 7 | Semantic Kernel | 2 | 2 | 2 | 1 | 1 | 0 | 8 |
| 8 | CrewAI | 2 | 1 | 1 | 1 | 1 | 0 | 6 |
| 9 | Claude Agent SDK | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
| 10 | LangChain (legacy) | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
| 11 | AutoGen | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
| 12 | smolagents | 1 | 1 | 1 | 1 | 1 | 0 | 5 |
Nobody scores above 15/30. Four frameworks tie at the bottom on 5/30, each for a different reason, but the headline number is the same.
The top: LangGraph, Pydantic AI, Mastra
These three are the closest things to production-ready. They still leave 50% of the rubric to you, but at least the durable pause primitive works.
LangGraph — best-in-class durability, BYO everything else
LangGraph's interrupt() pauses a graph node; resuming is graph.invoke(Command(resume=value), config={...}). Critically, and this is what separates it from the field, when paired with a PostgresSaver checkpointer, the paused thread state lives in Postgres. Any worker can resume it. The docs explicitly walk through worker-restart resilience.
from langgraph.types import interrupt, Command
def approval_node(state):
decision = interrupt({"question": "approve transfer?", "amount": state["amount"]})
return {"approved": decision}
What you'll discover the hard way: the docs warn that code before interrupt() runs twice on resume. "Place pure computation before, side effects after." That's the developer's idempotency burden, not the framework's. And there's no channel abstraction, the interrupt() payload is just a dict, and how that becomes a Slack message is entirely on you.
LangGraph Platform ships a task UI. The OSS package does not.
Verdict: Best durability story in the survey. Everything above the storage layer is BYO.
Pydantic AI — best typed API, weakest UI story
Pydantic AI's Deferred Tools is the cleanest API I saw. A tool marked requires_approval=True causes the run to end with a DeferredToolRequests output. You collect approvals as DeferredToolResults and resume by passing both into the next agent run.
@agent.tool(requires_approval=True)
async def transfer_funds(amount: int, to: str) -> str:
return f"sent {amount} to {to}"
result = await agent.run("send $500 to Bob")
if result.output_type is DeferredToolRequests:
approvals = {call.tool_call_id: ToolApproved() for call in result.output.approvals}
result = await agent.run(
message_history=result.all_messages(),
deferred_tool_results=DeferredToolResults(approvals=approvals),
)
Everything is Pydantic-typed. Durability comes via integration with Restate (or Temporal, Prefect), which journals every step including the await. That's powerful, but "we ship a primitive plus a separate runtime you also have to learn" is not zero-config.
Verdict: The typing is exactly right. The rest of the stack is a separate framework.
Mastra — best TypeScript story, channels don't compose
Mastra workflows expose suspend() / resume(). When a step suspends, the workflow snapshot is persisted (PostgreSQL, Upstash Redis, etc.), and resume() can be called from any HTTP endpoint with the matching run ID. Both suspend and resume payloads are Zod-typed.
const approvalStep = createStep({
id: "approval",
inputSchema: z.object({ amount: z.number() }),
resumeSchema: z.object({ approved: z.boolean() }),
execute: async ({ resumeData, suspend }) => {
if (!resumeData) {
await suspend({ requestId: crypto.randomUUID() });
return;
}
return { approved: resumeData.approved };
},
});
The catch: Mastra ships "Channels" (Slack, Discord, Telegram) — but for agents, not workflows. You can't say "suspend this step and route it to #ops on Slack" without writing the glue yourself. The community-maintained assistant-ui/mastra-hitl repo exists precisely because there's no canonical built-in approval UI for production.
Verdict: The best TS option. The last mile is still on you.
The middle: OpenAI Agents SDK, LlamaIndex, Haystack, Semantic Kernel
These four ship a recognizable HITL primitive but trip over a single major axis each.
-
OpenAI Agents SDK (11/30) has clean
needs_approvalsemantics and aRunStateyou can serialize. But the SDK assumes you'll bring your own queue, storage, and UI. No channel. No idempotency guarantee beyond what serialize-and-resume gives you. -
LlamaIndex (10/30) has elegant event-driven HITL via
wait_for_event— but the docs warn: "the runtime pauses by throwing an internal control-flow exception and replays the entire step when the event arrives." Any side effects before the await run twice. Footgun. -
Haystack (9/30) has the best-designed confirmation taxonomy (
AlwaysAskPolicy,BlockingConfirmationStrategy,RichConsoleUI) — and ships only console UIs. Both built-in implementations block the Python process. -
Semantic Kernel (8/30) has two half-built HITL mechanisms (
IFunctionInvocationFilterand Process FrameworkKernelFunctionparameters). The community is openly confused about which to use — issue #10832 literally asks "How to support HITL in Agent and Process frameworks."
The bottom: CrewAI, Claude Agent SDK, LangChain legacy, AutoGen, smolagents
This is where it gets bleak. Five frameworks. Each ships some pause-for-human primitive. Each one is input().
CrewAI: human_input=True — and then what?
The API is one boolean:
task = Task(
description="Research the latest AI advancements...",
expected_output="A comprehensive report",
agent=researcher,
human_input=True,
)
The CrewAI community forum has an officially acknowledged thread titled "Human in the loop - workaround" that opens:
"HITL input is handled through the terminal, which does not work for a production web environment."
The recommended workaround is Streamlit or Chainlit — wrap your CrewAI process behind a UI and hijack stdin. The entire production HITL story for CrewAI is "wrap stdin."
LangChain (legacy), AutoGen, smolagents: the input() family
- LangChain ships
HumanInputRunandHumanApprovalCallbackHandler. Both wrapinput(). The maintainers' standing recommendation for production HITL is "move to LangGraph" — Discussion #21524, Discussion #28217. - AutoGen's
human_input_mode="ALWAYS"is a 2023 primitive still shipping in 2026. Issues #2358 (persistence roadmap) and #5806 (mid-run checkpointing) are both open. - smolagents has
step_callbacksandagent.interrupt(). Memory is in-process. Issue #364 tracks a state-serialization request — open, unfilled.
Claude Agent SDK: not actually HITL
The SDK's canUseTool is permission gating, not human-in-the-loop. The callback is synchronous within the agent loop. If you want to pause for a human review, you make canUseTool block on a Promise that resolves when the human answers. There's no built-in story for "human two timezones away approves via Slack." The SDK is honest about this — it's designed for Claude Code's interactive CLI approval, not for distributed agent ops.
The pattern: three universal gaps
Across all twelve frameworks, three axes are universally missing or near-missing:
1. Channel abstraction. Only Mastra ships anything resembling a channel adapter system. Even there, channels live on agents, not on workflows. Every other framework hands you a payload and walks away.
2. Verifier hook. Zero out of twelve frameworks ship a slot for "the human's response goes through an AI quality check before resuming the agent." The idea that a junior approver clicking "approve" should be pre-validated by an LLM before the agent trusts it — fully absent from the field.
3. Default UI. Zero OSS dashboards for in-flight HITL tasks. LangGraph Platform has one (paid). Mastra has a dev playground (not production). The rest assume you'll build it.
The aggregate pattern: the industry has reduced HITL to "block on stdin," and ten of twelve frameworks are one production deploy away from breaking.
What "good" actually looks like
A production HITL primitive should hit all six axes of the rubric above:
- Persist to durable storage by default — Postgres or equivalent. No in-process state.
- Emit a typed payload describing what the human is being asked. Pydantic / Zod.
- Route through a pluggable channel adapter — Slack, email, dashboard, SMS. One config knob. Audit row written on every delivery + response.
- Accept a typed response from any channel, idempotent on duplicate deliveries.
- Optionally pipe the response through a verifier model before resuming the agent.
- Expose an OSS admin UI so ops teams can see what's in flight.
No framework in this audit ships all six. So I built one: awaithumans. One function call (await_human / awaitHuman), all six properties in the box, channel adapters for Slack and email shipping today, dashboard included, audit trail by default, optional Claude/OpenAI/Gemini/Azure verifier. Apache 2.0. Python + TypeScript. Adapters for the two frameworks that got close — LangGraph and the broader Pydantic AI / Temporal combo — built in.
It is not the final answer. It's an honest answer to the gap this audit exposes.
Methodology
For each framework I read the official docs, the README, the examples folder, and grepped open GitHub issues for "human", "approval", "interrupt", "input", "pause". Code samples are copied from canonical docs; I didn't make any of them up. Scores are subjective on a 1-5 scale per axis but the scorecard is reproducible — pull the framework, grep for the same terms, you'll find the same gaps. The full per-framework deep dive (with source URLs for every claim) lives in the research notes. Audit date: 2026-05-24; the field is moving fast and at least three of these frameworks have HITL work in flight — see linked issues per section. If your framework is on this list and you've shipped one of the missing axes since the audit date, I'd love to update the scorecard — open an issue at github.com/awaithumans/awaithumans or DM @awaithumans on X.
Sources
Selected, in order of relevance. The full list of ~50 source URLs is in the research repo.
- LangGraph — Interrupts · persistence
- Pydantic AI — Deferred Tools · Restate integration
- Mastra — HITL workflows · community HITL repo
- OpenAI Agents SDK — HITL Python · HITL JS
- LlamaIndex — HITL workflows
- Haystack — HITL docs
- Semantic Kernel — Process Framework HITL · issue #10832
- CrewAI — community workaround thread
- Claude Agent SDK — permissions docs
- LangChain (legacy) — Discussion #21524 · Discussion #28217
- AutoGen — HITL tutorial · Issue #2358
- smolagents — Plan customization · Issue #364
If you maintain one of these frameworks and want to argue for a higher score on any axis, please do, open an issue with evidence and I'll update the scorecard. The point of the audit isn't to dunk; it's to make the gap legible.
P.S. — the browser-agent + awaithumans template repo shows what this looks like wired up: a browser-use agent that calls await_human() before clicking Place Order. Slack DM with cart screenshot, one-tap approve, agent resumes. ~90 lines.
Top comments (0)