Ethan Walker

Posted on Jul 1

our CI passed. Your agent isn't operator-ready.

#ai #agents #testing #mlops

Your CI passed. Your agent isn't operator-ready.

We shipped a document-extraction agent to an enterprise customer last quarter. Twelve-week eval. 94% pass rate on our test suite. Three weeks into the pilot, it started generating refunds for invoices it couldn't parse. Silently. No error. No trace. Just wrong output that looked like right output.

Our CI was green the entire time.

The issue was not the model. It was not the prompt. It was the six percent of inputs we hadn't tested, arriving as the first thing an actual operator's data sent our way.

That's not an edge case. That's what operator-ready means in practice.

What "production-ready" means vs. what "operator-ready" means

Production-ready is an infrastructure concept. Your service is up. It handles load. It restarts on crash. Logs go somewhere. Alerts exist.

Operator-ready is different. It means your agent can be handed to someone who did not build it, running on data you did not design it for, making decisions that have real consequences if they're wrong.

The distinction matters because most eval pipelines are designed for the first. They measure pass rate on a test set. They don't measure what happens when an operator's input distribution is 30% different from your test set, which it always is.

The three gaps that bite in operator handoffs

Gap 1. Validation theater

A Pydantic model with 97% validation success sounds good. Here's what it hides.

The 3% that fail: what does your agent do? If your retry loop fills missing fields with model-inferred defaults, you've built a silent wrong-answer machine. The schema passed. The output is wrong. And you have no log entry flagging it.

Fix: separate the "schema valid" signal from the "content confidence" signal. Log field-level confidence alongside the output. An output is not trusted until both are above threshold.

We added a field_confidence dict to every extraction response. Low-confidence fields trigger a human-review flag, not a retry. That alone caught 14 of the 18 incidents in our first operator month.

Gap 2. Adversarial input handling

Your test set was built by you or your team. It covers the cases you thought of. An operator's data covers the cases they didn't tell you about.

In our case: multi-page invoices with embedded scanned PDFs. Our test suite had single-page invoices. The agent handled them differently, and "differently" meant "wrong" in ways our eval never measured.

This is not a parsing bug. It's a distribution shift. The correct response is not to fix the parser. It's to test against a sample of the actual operator's data before going live.

Before any operator handoff, we now require 50 documents from the operator's own corpus run through the agent, with manual review of outputs. Not synthetic data. Not our test set. Theirs.

That one change caught the scanner-PDF issue before the pilot started.

Gap 3. The audit log that doesn't log what matters

Every engineer's first logging setup captures: what the model returned. Almost nobody logs: what the model decided not to do.

For an operator deploying an extraction agent inside a compliance workflow, the question isn't just "what did the agent output." It's also: "did the agent flag this document as low confidence," "did it skip any fields," "did it trigger any fallback paths."

If you can't answer those questions from the trace, you can't support the operator when something goes wrong. And something will go wrong.

Minimum viable operator audit trail:

Output with field-level confidence scores
Fallback path indicator (did it retry? did it degrade?)
Input hash (so you can replay the exact document)
Model version and prompt version at inference time (not just "gpt-4o", the specific deployment)

We built this into a standard trace schema and started injecting it into every response. The overhead is negligible. The debuggability improvement is significant.

The pre-operator checklist I actually use

Before handing an agent to any operator, I run through this:

Run 50+ samples from the operator's actual data, not our test set. Measure field-level error rate on their corpus specifically. If there's a gap between their corpus accuracy and your test-set accuracy, that gap is your risk.

Search logs for the last 30 days for any output that passed schema validation but triggered downstream errors. These are your silent failures. Fix them before the operator sees them.

Intentionally feed malformed inputs. Verify the agent degrades to a safe fallback, not a wrong output. "I cannot parse this document" is better than a wrong invoice total.

Confirm you can answer "what did the agent do on document X at timestamp Y" in under 5 minutes. If you can't, your audit trail is incomplete and you're not operator-ready regardless of your eval score.

Check the agent's permission scope. Does it have access to resources it doesn't need for this operator's use case? The principle of least privilege applies to agents too.

The number that actually matters

Our eval pass rate was 94%. Our operator-handoff error rate in month one was 8%.

Those two numbers can coexist because they're measuring different things against different data.

After we added the three changes above (field confidence, operator corpus testing, full audit trail), the month-two operator error rate dropped to 1.4%. The eval pass rate barely moved (95%).

The eval score was not the problem. The eval scope was.

What I'd check first

If you've shipped an agent and you're about to hand it to an operator, here's the three-line diagnostic:

Can you answer "what did the agent decide NOT to do on this input" from your trace? If no, your audit trail is incomplete.
Have you run the agent on at least 50 documents from the operator's actual corpus? If no, your pass rate is a test-set metric, not an operator reliability estimate.
What happens when your agent receives input outside its schema? If the answer is "it retries and fills defaults," you have a silent wrong-answer path. Change it to "it flags for human review."

Operator-ready is not a CI check. It's a claim about how the agent behaves on someone else's data, making decisions with real consequences. The eval suite gets you close. These three checks get you there.

Top comments (1)

Alex Shev • Jul 1

This is the distinction a lot of teams miss. CI proves the code passed a known test path; operator readiness proves the agent can handle uncertainty, escalation, bad inputs, and rollback. I like treating the agent as an operational system instead of a clever function call.