Maya Andersson

Posted on Jul 1 • Originally published at Medium

I reviewed six "operator-ready" checklists for AI agents. None of them define the problem correctly.

#agents #llmevaluation #mlops #agentreliability

The industry has converged on a definition of "operator-ready" that is measurable, deployable, and wrong.

The most cited frameworks, Anthropic's "Building Effective Agents" (December 2024), Hamel Husain's "Your AI product needs evals" (2024), the LangChain eval documentation, NIST AI RMF (2023), Google's responsible AI practices, OpenAI's model specification (May 2024), share a common structure. They define reliability as pass-rate on a test set. They define readiness as a threshold on that pass-rate.

This is a reasonable definition for production-readiness. It is not a correct definition for operator-readiness.

The distinction is not semantic. It has direct consequences for how you test, what you ship, and what breaks after handoff.

What the existing frameworks get right

Hamel Husain's framework is the most practically useful of the six. His argument that "you cannot improve what you cannot measure" is correct, and his guidance on building eval sets that are representative, diverse, and graded with real human judgment is solid. The Anthropic guide's emphasis on minimal footprint and clear failure modes is well-reasoned for the agent design phase.

The NIST AI RMF is the most complete risk taxonomy. Its four functions (Govern, Map, Measure, Manage) are a useful organizational structure for compliance-conscious deployments.

These are good frameworks. The problem is not that they're wrong. The problem is that they're answering a different question than the one that breaks things in production.

Where they all fall short

Every framework I reviewed defines reliability as a static property: "the agent achieves X% on the eval set." That's a snapshot metric, not a deployment metric.

Operators change things. They add new document types. They expand use cases. They bring their own data. Their users find inputs that your test set never anticipated.

The real question is not "does the agent achieve X% on the eval set." It's "does the agent maintain X% after six weeks of operator usage, on the operator's actual input distribution."

Those are different questions. The first is testable before deployment. The second requires a different kind of eval infrastructure.

The three things the existing frameworks miss

1. Distribution shift is the default condition

Every literature source I checked treats distribution shift as an edge case to handle. It is not an edge case. It is the default condition of operator deployment.

The operator's data is never exactly your eval data. The operator's users will find inputs you did not anticipate. The operator's business context will evolve. Distribution shift is not a risk you mitigate and move past. It's the ongoing condition of every production deployment.

A framework that doesn't include ongoing distribution shift monitoring as a first-class readiness requirement is describing a static artifact, not a live system.

2. "Pass-rate" conflates several different failure modes

An eval pass-rate in the low nineties can coexist with an operator error rate several times higher on real data. I've seen this in practice across multiple deployments. The reason is that pass-rate is an aggregate measure that hides the variance of failure types.

There are at least four different failure modes that look identical in a pass-rate number:

Formatting failures (schema doesn't match, easy to catch and retry)
Content errors on in-distribution inputs (model got the right format, wrong substance)
Content errors on out-of-distribution inputs (distribution shift failures, the most dangerous)
Silent failures (output is wrong but passes all automated checks)

A team with 94% pass-rate and mostly formatting failures is in a very different position from a team with 94% pass-rate and mostly silent content errors. The number looks the same.

Operator-readiness requires disaggregating the failure modes, not aggregating them into a single score.

3. The eval-to-deployment gap is structural

The Anthropic guide correctly notes that agents should be evaluated with "realistic and diverse inputs." It does not address what happens when the operator's inputs are more diverse than your test set in ways you couldn't have anticipated.

This is the eval-to-deployment gap, and it is structural. You build the eval set with the data you have. The operator deploys with the data they have. Those two sets overlap imperfectly.

The only way to close this gap is to treat pre-deployment testing on the operator's own corpus as a mandatory step. Not as a quality assurance nicety. As a readiness gate.

Fifty documents from the operator's actual corpus, reviewed manually, compared to the eval pass-rate on the same task. If the accuracy on those fifty documents is materially lower than the eval accuracy, the deployment is not ready regardless of what the aggregate pass-rate says.

What a correct operator-readiness definition looks like

An agent is operator-ready when:

Pass-rate on the operator's own corpus sample (minimum 50 documents, reviewed manually) is within 5 percentage points of the training eval pass-rate.
Failure mode distribution is documented: what percentage of failures are formatting errors vs. content errors vs. silent failures.
Distribution shift monitoring is in place: a scheduled re-evaluation on a rolling sample of recent operator inputs, with alerting when the pass-rate drift exceeds a defined threshold.
Failure recovery behavior is tested explicitly: what does the agent do with inputs outside its distribution? Does it fail loudly (flag for review) or fail silently (produce wrong output)?

This is more work than "run the eval suite, check the score." It is also the actual test for whether the agent will maintain its quality guarantees six weeks after handoff.

FAQ

What's the fastest way to close the eval-to-deployment gap?

Before handoff, run the agent on 50 documents from the operator's actual corpus. Not synthetic data, not your training set. Documents the operator will actually send. Review those outputs manually. Compare the accuracy to your eval accuracy on the same task.

If the gap is more than 5 percentage points, you have a distribution shift problem to characterize before deploying.

Is there a practical threshold for operator-readiness?

There isn't a universal threshold. The right threshold depends on the stakes of failure. For a low-stakes use case (content categorization), 85% might be acceptable. For a high-stakes use case (contract extraction with legal consequences), 85% probably isn't.

What matters more than the threshold is that you're measuring operator accuracy (on the operator's data) rather than eval accuracy (on your data). Those are different denominators.

Does ongoing monitoring replace pre-deployment testing?

No. Monitoring catches degradation after deployment. Pre-deployment testing on the operator's corpus catches distribution shift before deployment. Both are necessary. Neither substitutes for the other.

What about fine-tuning on the operator's data?

Fine-tuning is the right long-term answer for persistent distribution shift. It's not the short-term answer for a deployment that needs to go live in two weeks. The short-term answer is: characterize the gap, document the failure modes, set up monitoring, and be transparent with the operator about the limitations on their data distribution.

Open question

The hardest problem I haven't seen solved well: how do you define "operator-ready" for an agent that will serve multiple operators with fundamentally different data distributions?

A financial services operator and a healthcare operator running the same document-extraction agent have different input distributions, different failure modes, and different acceptable error rates. Per-operator eval sets are the correct answer in theory. They're expensive in practice.

Is there a reasonable way to stratify a single eval set across operator types without running a full per-operator measurement pass? I've seen teams try domain-stratified sampling (one slice per major input type), but the slice sizes are never large enough to give statistically stable estimates for the rare input categories that are most likely to cause problems.

If you've solved this problem, I'd be interested in how.

Top comments (1)

Luis Cruz • Jul 1

🧠 Central critique

The author’s main point is:

Every “operator-ready checklist” tries to evaluate agents, but none first agree on what an agent is supposed to solve.

So we end up with:

Security checklists
Reliability checklists
UX readiness checklists
Deployment checklists

…but no shared answer to:
👉 What problem is the agent actually responsible for solving?

Without that, evaluation becomes meaningless.

⚠️ The hidden issue in agent engineering

The post is basically arguing that the industry has inverted the order of operations:

What teams do:
Build agent
Add tools
Add memory
Run checklist
Deploy
What they should do:
Define failure domain precisely
Define boundaries of autonomy
Define success conditions
Then design agent + checklist
🔍 Why checklists fail in practice

Most “operator-ready” frameworks assume:

stable environments
stable task definitions
measurable success criteria

But real agents operate in:

ambiguous workflows
shifting context
partially observable outcomes

So a checklist might say:

“Does it have memory?”
“Does it use tools safely?”
“Does it handle errors?”

But none of those matter if:

the agent is solving the wrong abstraction of the problem in the first place.

🧩 Deeper insight: we’re benchmarking the wrong layer

This connects strongly to a broader pattern in AI agent discourse:

We obsess over reliability layers
But ignore problem formulation layers

So we optimize:

prompts
guardrails
tool calling
eval suites

…but we don’t validate:

whether “this should be an agent at all”
whether the workflow is even agentic vs deterministic
whether autonomy is actually needed
🧭 Practical takeaway

A useful reframing from the article is:

Before asking “Is this agent production-ready?”
ask “Is this even an agent-shaped problem?”

If the answer is unclear, no checklist will fix it.