Four days into a new supplier's first batch, my invoice extraction agent had filed 31 documents with amounts shifted by a decimal. Nothing raised an error. The downstream system accepted every record. The agent returned a 200 each time.
The demo had run on five clean PDFs. Clear fonts, properly formatted dates, consistent layout. The extraction agent pulled vendor name, amount, due date, line items. Every field populated, every output valid. I ran it for the stakeholder meeting and it looked exactly like something you would ship.
Three months in, the agent had processed around 800 invoices without complaint. Then a new supplier switched to scanned documents. Slightly rotated, thin fonts, OCR doing what it could on degraded source material. The model found text that resembled amounts and dates, and returned confident structured output. 1,247.50 read as 12,475.0. A due date resolved to a valid date three years in the future. The confidence was the problem. The model had no mechanism to say it was uncertain. It just answered.
Nobody caught it for four days.
What I built after
The problem was not the model. The model did what it was designed to do. Find structure in text and return it. The straight pipeline from input to output had no gate in it.
The fix was not more prompting or a better model. I added a validation layer between the agent output and the downstream system. It runs synchronously, takes about 80ms, and checks four things:
- Every required field is non-null.
- Amounts parse as positive numbers within a configured range for that supplier type.
- Dates fall within a 90-day future window.
- Extracted totals are consistent with line item sums, within a small tolerance.
Anything failing a check routes to a review inbox instead of the queue. A human looks at it, corrects it if needed, marks it resolved. The system logs which check triggered and what the input looked like.
In the first week after deployment, the layer caught 23 documents out of about 1,400. Eleven were bad scans. Seven were valid invoices in a format the model had not seen before. Five were duplicates that had slipped through upstream. All 23 would have gone through clean before the layer existed.
The review inbox is not impressive. It is an HTML table and a textarea. It took three hours to build. It has caught every significant extraction failure since I shipped it.
Reliability is the only feature
I run the agent operation at Agent Enterprise (aienterprise.dk) and this pattern shows up in every domain we deploy into. The model capability is mostly not the question. What does not improve automatically is the boundary between what the agent produces and what the downstream system trusts.
Every deployment has its own version of this guard. For a scheduling agent it is a check that the proposed slot is actually open. For a classification agent it is a threshold below which the label goes to review rather than being applied automatically. The pattern is constant. The agent produces something, and before that something becomes a fact in your system, something deterministic verifies it is plausible.
The demo proves the agent can. Production proves it does, correctly, on the bad input, on the rotated scan, at 3am when no one is watching. That second proof is the one your users care about. It is also the one that does not come from the model.
The validation layer is not exciting to ship. It is the right call every time.
Top comments (0)