Four out of five companies now run at least one AI agent in production. By 2027, Gartner expects 40% of those projects to be scrapped.
Read the post-mortems, and you notice something: the model is rarely the cause.
The reasons are boring and structural. Bad data. No way to measure whether the thing works. No visibility into why it does what it does. Software the company rented but never owned. A shiny demo bolted onto a broken process.
None of that is a machine learning problem. It's an engineering discipline problem. And the teams whose AI survives contact with production all do the same three boring things.
I run an AI product engineering shop. We build production systems for mid-market companies, and we've watched this pattern enough times to bet the company on it. Here's the discipline.
1. Evals before features
The single most common mistake: writing the prompt before writing the test suite.
An eval is a set of cases that define what "correct" means for your specific problem. Not a public benchmark. Your problem. The invoices that are actually fraudulent. The tickets that actually need a human. The outputs that would actually get someone fired if they were wrong.
If you can't measure correctness, you can't improve, you can't catch regressions when the model provider ships an update, and you can't tell a stakeholder why you trust the system. You're guessing with extra steps.
We write the evals first. Before the prompt. Here's the shape of it:
# evals run BEFORE you build the agent,
# and again on every change after
def test_fraud_screening(model, ground_truth):
results = run(model, ground_truth.invoices)
# catch the fraud that already cost us money
assert recall(results, ground_truth.known_fraud) >= 0.95
# but don't flag everything, or the team ignores you
assert false_positive_rate(results) <= 0.15
# every decision must be explainable
assert all(r.has_reason for r in results)
return Scorecard(recall, fpr, precision)
The most important rows in that test set are the failures you already know about. If the system can't catch the fraud that has already happened, nothing else matters.
This is also the gate. No eval pass, no ship. It turns "we don't ship broken AI" from a slogan into a build step.
2. Telemetry from day one
"The model is wrong" is an unfalsifiable statement without logs.
Production AI without instrumentation is a black box you can't open. When it makes a bad call, and it will, you have no way to diagnose it, no way to improve it, and no way to know if it's quietly drifting.
Instrument from the first deployment. For every decision, log:
- the input for the decision was based
- the output and the confidence
- the reason codes (which signals drove it)
- the model version that produced it
- whether a human overrode it, and how
That last one is the gold. Every human override is a labeled training example telling you exactly where the system is wrong. Feed it back into the eval set, and the system improves on your real data instead of someone else's averages.
Telemetry is also what lets you take the human off the loop safely, one category at a time, as the data earns it. Adoption rate goes up when you sign a contract. Autonomy goes up when the telemetry proves it's safe.
3. Owned infrastructure
A lot of "AI solutions" are a subscription that disappears when the vendor pivots. The client owns a login, not a system.
Build it so the client owns the code, the hosting, the data, and the models. Not for ideological reasons. For continuity. If the system breaks at 2am, someone needs to be able to open it. If the vendor vanishes, the business can't stop. Owned infrastructure is the difference between a competitive asset and a dependency with a monthly invoice.
This one is less about technique and more about how you structure the engagement. But it's the one buyers feel most when it goes wrong.
What it looks like together
Concrete example. A factoring company screens thousands of invoices a month for fraud, by hand. Two analysts, full time. The off-the-shelf tool they bought scored everything "medium," couldn't explain itself, and never learned from their actual losses. So the team ignored it.
The fix wasn't a smarter model. It was the discipline:
- an eval harness built from their three known fraud cases plus clean invoices, so accuracy was proven before launch
- telemetry on every score, so every flag was explainable to the credit committee
- the whole system in their infrastructure, their data never leaving their environment
The model was the easy part. The engineering around it was the job.
The takeaway
If you're evaluating an AI build, internal or external, ask three questions:
- Do you write evals before features?
- Do you instrument telemetry from day one?
- Do I own the code, the data, and the models?
If the answer is no, no, and no, you don't have an AI product. You have an expensive experiment with a deadline.
Take the eval harness pattern above and use it, it's yours, no strings.
And if you're a mid-market company that wants to see this applied to your own bottleneck, we put the front of our process online as a free tool. It runs an AI diagnostic on your biggest operational bottleneck, no signup, instant result. You can try it at prionation.io. Worst case, you get a free read on where AI would actually help.

Top comments (0)