DEV Community

Todd Linnertz
Todd Linnertz

Posted on • Originally published at devopsdiary.blog

The Agent Is 20% of the Work. The Platform Is the Other 80%.

Originally published at devopsdiary.blog. Post F-AID1 in the "Governing AI in the Enterprise" series.

A payroll team shipped a production AI agent last year. Real workload, not a demo: processing 3,000+ emails a day, classifying them, extracting data and entering payroll. Six distinct steps, end to end.

Their test accuracy: 94%. Good enough to ship.

Their production accuracy: 70%.

That's the talk I keep thinking about from AI Dev 26. The drop itself isn't news. What they did about it is.

The accuracy gap has a cause

The 94% looked clean because the test set was curated. It covered the cases the team had thought of. Production didn't care about that. It sent typos. Impossible numbers. Screenshots. Hand-drawn notes. Vague references with no context. Conflicting instructions from two people in the same email thread.

The test distribution and the production distribution weren't the same. They almost never are.

A better model didn't close the gap. They ran shadow testing: the agent processed real production emails alongside their human team for four weeks, generating payroll entries but not submitting them. Humans reviewed the shadow outputs. Edge cases surfaced. New tests got written.

Final accuracy: 98%. The agent didn't change. The scaffolding around it did.

Month Accuracy
M1 55%
M2 97%
M3 87%
M4 94%
M5 (live) 70%
M6 (shadow) 98%

Six months of accuracy data from the payroll agent. The dip at M5 is what shipping without production-distribution testing looks like.

The 20/80 problem

The final slide from that talk had a number I wrote down immediately: agent engine = 20% of the work. The durable system around the agent = 80%.

That ratio feels off if you've spent most of your time thinking about which model to use, how to prompt it, how to evaluate it against a benchmark. Those things matter. They're just not where a production AI project lives or dies.

The 80% is the multi-stage evaluation pipeline. Shadow testing infrastructure. The control tower that gives ops and leadership visibility into what the agent is actually doing. Input governance for the weird formats production throws at you. The routing logic that decides which step of the workflow a given input actually belongs in.

None of that is prompt engineering. All of it is platform work.

I've spent 30 years watching organizations adopt new technology and invest heavily in the visible capability while underbuilding the infrastructure that makes it last. The pattern is consistent. AI isn't running a different play.

What breaks without the infrastructure

Enterprise AI conversations split fast once you get past the demo stage. Some teams want to know about governance, evaluation pipelines, how outputs get reviewed before they do anything irreversible. Most are asking which model to use and when they can ship.

The 70% drop happens. Without a control tower to surface it, teams find out through complaints, not metrics.

That's a platform problem. Someone has to own the pipeline, not just the agent.

The line I can't stop thinking about

Day two had a closing panel. Loose, riffing. One panelist dropped a line that's been with me since:

"If you don't own your harness, you don't own your memory."

It took a beat to unpack. Your harness is your evaluation infrastructure: the test pipelines, the shadow mode, the tooling that decides what "good" looks like for your specific agents on your specific workloads. Your memory is what that harness teaches you over time: where your agents fail, which prompts hold up under real traffic, what your actual production distribution looks like.

Outsource the harness to a vendor and the vendor runs your evaluation loop. They see your production failures first. Every edge case your agents surface builds their system's understanding, not yours.

Most teams are focused on which LLM provider to pick, which coding assistant to standardize on. The harness question comes later, usually when a vendor relationship turns complicated and they realize how hard it is to move.

The payroll team built their own. Multi-stage evals, shadow infrastructure, control tower, four weeks of real production traffic before anything touched the write path. That's why they landed at 98%. And that's why the knowledge of how to get there belongs to them.

Twenty percent for the agent. Eighty percent for the system around it. Teams that understand that ratio are the ones shipping agents that stick.

Top comments (0)