Structured output broke on us three times. The third time taught us operator-ready.

#agents #ai #llm #production

Structured output broke on us three times. The third time taught us what "operator-ready" means.

Last quarter we shipped a contract-extraction agent to an enterprise legal team. Schema validation passing at 97%. Human reviewers satisfied with the output quality in testing. Rollout went smoothly.

Then it broke. Three times. In three completely different ways.

The first two failures we fixed with better prompts and stricter schemas. The third one taught us something the first two hadn't: that "operator-ready" is not a technical checklist. It's a claim about your agent's behavior under conditions you didn't design it for.

Failure one: the validation paradox

Week two. A lease agreement came through with a renewal clause formatted as a table instead of prose. Our extractor looked for renewal terms in a specific JSON path. The table format populated the schema differently. Validation passed. The extracted renewal date was off by two years.

The fix was obvious in retrospect: add a canonical-format normalization step before extraction. But the lesson was sharper than that.

Schema validation tells you the shape of the output, not whether the content is correct. A JSON object with the right keys and the right types can still contain wrong values. Our 97% validation success rate was measuring the wrong thing. It was measuring structure conformance, not content accuracy.

After this failure, we separated validation into two signals: schema validity (does the object have the required fields) and field confidence (do we have evidence the content is correct). We started logging both. An output is trusted only when both signals are above threshold.

Failure two: the retry loop that lies

Month one. A particular clause type appeared in a contract format we hadn't trained our test set on. The extractor failed schema validation on the first attempt. Our retry logic kicked in, filled missing fields with model-inferred defaults, and passed validation on the third try.

The output looked right. The content was wrong. The inferred defaults were plausible values that did not match the actual contract.

No alert fired. No human review was triggered. The error surfaced three weeks later when the legal team flagged a discrepancy in a signed agreement.

This is the retry paradox: the retry loop is supposed to handle uncertainty, but in practice it converts "the model doesn't know" into "the model confidently guessed." The schema never sees the difference.

The fix: when a retry fails because of missing content (not format), the correct behavior is a human-review flag, not a default fill. "I cannot extract this clause with confidence" is a better output than a wrong value that passes validation.

We changed the retry logic to distinguish format failures (retry and reformat) from content failures (flag for review). The human-review rate went up. The silent error rate went to zero.

Failure three: the operator's data

This one took longer to understand.

Six weeks in, a new batch of contracts arrived from a subsidiary the legal team had recently acquired. Different contract structure, different clause naming conventions, different language patterns. Our extraction accuracy dropped from 94% on the training-corpus contracts to 61% on the acquired subsidiary's contracts.

We had not seen a single document from that subsidiary during development. Neither had our test suite.

This is the distribution shift problem. And it is the actual definition of not-operator-ready.

Production-ready means your agent handles the inputs you tested it on. Operator-ready means your agent handles the inputs the operator is actually going to give it. Those are not the same set.

The fix was not a better model or a better prompt. It was a process change: before any operator handoff, run the agent on a sample of the operator's own documents, measure accuracy on that corpus specifically, and establish a baseline before you commit to SLA numbers.

We now require 50 documents from the operator's corpus as part of the pre-handoff checklist. Not synthetic. Not ours. Theirs. If the accuracy on those 50 documents is not close to the accuracy on our training corpus, the handoff gets delayed until we understand why.

What these three failures have in common

All three were invisible to our eval suite. All three were visible with the right diagnostic.

The pattern: our eval was measuring our best case (our data, our test set, our format assumptions). Operator-ready means measuring the operator's case. Those are different measurement problems.

The three things we added to our pre-handoff process:

Field-level confidence scoring on every output (not just schema validity)
Content-failure-vs-format-failure separation in retry logic (fail loudly, not silently)
Operator corpus sampling before go-live (50 documents from their actual data, reviewed manually)

None of these are in the standard "production-ready" checklist. They're in the operator-ready checklist.

Where I'd push back on this

The common response to these failures is "just add more training data" or "fine-tune on the operator's corpus." That's the right long-term fix. It's not the short-term answer.

Fine-tuning takes weeks and requires labeling budget. An operator pilot that's already started does not have that runway. The faster path is: understand the distribution shift before you commit to accuracy numbers, not after you've already missed them.

There's also a steelman for the current "validation is enough" approach: for low-stakes use cases with structured, predictable inputs, schema validation really is sufficient. If every contract you're extracting is from the same template, format conformance and content accuracy are highly correlated.

The problem is that enterprise operators rarely have one template. The legal team that deployed our extractor manages contracts from 14 different counterparties, each with their own conventions. Validation-only was always going to break.

The concession I'll make: this is a data problem as much as an engineering problem. The teams that invest in building labeled corpora per operator will have substantially better outcomes than the teams that treat operator-ready as a single deployment decision. We didn't invest in that early enough. The second and third failures were partly the cost of that.

Operator-ready is not a state you reach. It's a process you run.