Agent State Management: How to Build Workflows That Recover Without You
Introduction
There's a comment I keep seeing from practitioners building real agent systems: "The APIs don't fail us as often as our own state does."
Ali Muwwakkil put it plainly in a thread on LLM API reliability:
"Agent reliability often falters not due to API limitations but because of poor state management. Developers sometimes overlook how critical it is to design agents capable of gracefully handling incomplete data states or failed calls."
He's right. And it reframes the API selection problem. You're not just asking "does this API work?" You're asking "when this API returns an ambiguous partial failure at 2am, can my agent figure out what happened and recover from a known state?"
That's a different question. And the answer depends on how well you've designed state into your workflow — and how well the API supports that design.
This guide covers the state management patterns that distinguish agent workflows that recover gracefully from those that require a human to untangle.
Why State Management Is the Multiplier on API Reliability
Consider two failure scenarios:
Scenario A: Your agent calls a payment API. It gets a structured 402 payment_required with a retry_after: 60 header. It checkpoints, waits, retries from the exact pre-call state. No money moves twice. No duplicate charge. Resume is clean.
Scenario B: Your agent calls a CRM API. It gets a 500 Internal Server Error with no body. The contact creation may or may not have succeeded. The agent doesn't know if it should retry (which might duplicate the record) or move on (which might create a broken pipeline downstream).
The API reliability difference is real. But the bigger difference is whether your agent had a state design for Scenario B before it happened.
AN Score data confirms this: providers with high execution scores (Anthropic 8.4, Stripe 8.1, Google AI 8.3) earn those scores precisely because they emit structured errors that enable stateful recovery. Low-scoring providers (HubSpot 4.6, Salesforce 4.8) create state uncertainty as a side effect of their error design.
The defensive code tax estimate — typically 15-20% of agent implementation effort on real production systems — comes almost entirely from state uncertainty at failure boundaries, not from raw API instability.
State Categories in Agent Workflows
Before designing recovery, you need to know what kinds of state your workflow carries.
1. Ephemeral State
Variables in memory: the current plan, intermediate results, context window content. This state dies with the process.
Recovery cost: Recompute. Ephemeral state is fine to lose as long as you can reconstruct it from durable state.
2. Checkpoint State
A snapshot of "where I was" before a significant operation — the task parameters, the stage of execution, the last confirmed outcome.
Recovery cost: Resume from checkpoint. This is the core durability primitive for agent workflows.
3. Durable Side-Effect State
State written to external systems: the database record created, the email sent, the API resource provisioned. You can't roll these back — you can only verify them and decide what to do.
Recovery cost: Verify and branch. After a failure, check whether the side effect happened before deciding whether to retry.
The Four Patterns
Pattern 1: Checkpoint Before Destructive Writes
Before any call that creates a side effect you can't undo, write a checkpoint:
{
"task_id": "onboarding-abc123",
"stage": "contact_creation",
"input": { "email": "user@example.com", "plan": "pro" },
"status": "in_progress",
"started_at": "2026-04-03T10:00:00Z"
}
On any failure, the recovery path is:
- Read checkpoint
- Verify whether the side effect completed (look up the contact, check idempotency key)
- If completed: mark stage done, advance
- If not completed: retry from input
- If ambiguous: escalate or apply a conservative decision rule (e.g., assume not completed for idempotent operations)
Dependency on the API: This pattern works cleanly when the API supports idempotency keys (Stripe does — pass the same key, get the same result). It degrades when the API provides no duplicate-detection surface (HubSpot creates a new contact on each call with no deduplication by default).
Pattern 2: Verify After Side Effect, Not Just After Call
The call completing ≠ the side effect committed. Network partitions, database replication lag, and eventual consistency all create windows where your API client got a 200 OK and the world hasn't caught up yet.
The right pattern for consequential side effects:
1. Call create_contact(email=..., idempotency_key=task_id)
2. Wait for 200 OK
3. Call get_contact(email=...) → verify record exists
4. Only advance workflow stage when verification passes
This adds latency but eliminates a class of silent failures that show up hours later as phantom missing records.
AN Score connection: The AN Score schema_stability dimension measures how predictable API responses are across calls. High-stability APIs (Stripe, Twilio) make verification reads reliable. Low-stability APIs (some CRM APIs) have cross-object consistency issues where verification reads can return stale data on the same node that just wrote.
Pattern 3: Scope Failure Recovery to the Smallest Possible Unit
The blast radius of a state failure scales with how large a unit of work was in progress when the failure happened.
An agent that creates 500 contact records in one monolithic loop has no recovery path without reading back all 500 records and diffing against input. An agent that creates contacts in checkpointed batches of 10 can resume from the last successful batch checkpoint.
The principle: match your checkpoint frequency to your acceptable rework cost.
For APIs with per-call costs (metered APIs, token-based models), there's a tradeoff between checkpoint overhead and retry cost. A rough heuristic: checkpoint after any non-idempotent write that would cost more to redo than the checkpoint overhead.
Pattern 4: Design Explicit Recovery Paths, Not Just Error Handlers
Most agent implementations handle errors. Fewer implement recovery paths.
The difference:
Error handler:
try:
result = api.create(...)
except APIError as e:
log(e)
raise
Recovery path:
checkpoint(task_id, stage="pre_create", input=params)
try:
result = api.create(..., idempotency_key=task_id)
verify(result)
checkpoint(task_id, stage="post_create", output=result)
except AmbiguousStateError:
existing = api.get(external_id=task_id)
if existing:
checkpoint(task_id, stage="post_create", output=existing)
else:
raise RetryableError(task_id, "pre_create")
The recovery path design requires knowing what "ambiguous state" looks like for each API you integrate. That's API-specific knowledge — and it's exactly what failure mode data should surface, rather than compressed into a single reliability score.
How the API You Choose Shapes Your State Design
The choice of API doesn't just affect latency or cost — it determines how much state complexity you inherit.
Stripe (8.1/10): Idempotency is first-class. Pass the same Idempotency-Key header and get the same result. Payment state transitions are explicit and machine-readable. Verification reads are reliable. Your state design can be minimal because the API contracts handle most of the ambiguity.
Anthropic (8.4/10): Structured errors with explicit Retry-After headers. Streaming responses provide partial completion visibility. Your checkpoint design can be simple because the API tells you exactly where it stopped.
HubSpot (4.6/10): No idempotency keys. Object associations require multi-step sequences that can partially complete. Verification reads can surface stale state due to cross-object consistency lags. Your state design has to compensate for all of this — and that compensation is the defensive code tax.
Salesforce (4.8/10): Bulk API has different state semantics than the REST API. Sandbox/production behavior diverges in ways that make pre-deploy state testing unreliable. SOQL limits create hidden failure modes in complex queries. Heavy state instrumentation required.
The practical implication: when you're designing agent state architecture, the APIs with the lowest AN Scores will demand the most state complexity from you. That's not a coincidence — those scores measure exactly the surface area that creates state uncertainty.
A Minimal State Design for Production Agents
You don't need a full event-sourcing architecture to get production-grade state management. This minimal design handles the majority of failure scenarios:
1. Task Store — persistent key/value store for task state:
task_id → { stage, status, input, output, started_at, updated_at }
2. Checkpoint Function — call before and after any consequential step:
def checkpoint(task_id, stage, status, **kwargs):
task_store.set(task_id, {
"stage": stage,
"status": status,
"updated_at": now(),
**kwargs
})
3. Recovery Function — called on any restart or ambiguous failure:
def recover(task_id):
state = task_store.get(task_id)
if not state or state["status"] == "complete":
return None # nothing to recover
if state["stage"] == "pre_create":
# May not have written — safe to retry with idempotency key
return retry_from("pre_create", state["input"])
if state["stage"] == "post_create":
# Write happened — verify before advancing
return verify_and_advance(state["output"])
4. Verification Steps — after any consequential write, confirm before advancing.
This isn't sophisticated. But it handles the common case: ambiguous failures, retries, and mid-workflow restarts. With it, a 2am failure becomes "resume from last checkpoint" rather than "untangle the mess in the morning."
What This Means for API Selection
When you're evaluating APIs for an agent workflow that needs to run unattended, these state-management questions should be on your checklist:
- Does it support idempotency keys? (Stripe yes, most CRMs no)
- Are errors structured enough to distinguish retryable from non-retryable failures?
- Are verification reads reliable? (Does get after create return the created object?)
- Are multi-step operations atomic, or can they partially complete?
- Is there explicit state for in-progress operations? (webhooks, async completions)
These questions map directly to the AN Score execution dimension. High execution scores indicate APIs that answer "yes" to most of these. Low scores indicate APIs that leave the answers to you — and every "no" becomes state complexity in your implementation.
Summary
State management is not a framework problem. It's a design problem. And how much complexity you inherit from it depends heavily on which APIs you've integrated.
The pattern: checkpoint before writes, verify after side effects, scope failures to small units, design recovery paths explicitly. The API dimension: choose providers whose error design makes recovery paths computable, not guesswork.
The defensive code tax on low-scoring APIs isn't arbitrary. It's the cost of building the state infrastructure that those APIs should have provided but didn't.
Rhumb evaluates 1,000+ APIs on 20 dimensions including execution reliability, error clarity, and idempotency support. Explore the full AN Score methodology →
Related: LLM APIs in Agent Loops: What Actually Breaks at Scale · How APIs Fail When Agents Use Them · The Complete Guide to API Selection for AI Agents
Top comments (0)