A field guide to the fragilities the agent dashboard does not show.
Photo by Compagnons on Unsplash
Third in a short series. The first piece argued that the gap between AI-agents-in-pilots and AI-agents-in-production is being closed by an unglamorous infrastructure rebuild. The second piece showed what that infrastructure actually looks like — process registries, tool gates, audit trails, the boring stuff. This piece is about what the previous two were too polite to say.
I know an engineer who used to work in fraud detection. She told me, once, that the team was proud of how rarely they saw an alert. The dashboard was almost always green. Months would go by without a single high-severity event firing.
What she eventually realised was that the absence of alerts was not the absence of fraud. The absence of alerts was the absence of fraud that her detector was trained to find. The fraud kept happening. It had simply moved.
I have been thinking about that conversation a lot, lately, in the context of AI agents.
*The average is fine. It’s the variance that takes you out.
*
Here is what most agentic dashboards in 2026 look like. Success rate: 98.7%. P95 latency: 3.2 seconds, well under target. Cost per instance: four cents. The little arrows next to each number are pointing in the right direction. The deck is green. The CFO nods.
This is the friendly hump. It is real. It is also not the whole story.
Consider the actual shape of an agent’s payoffs.

Figure 1. The visible distribution and the part that takes you out.
The hump on the right is what the dashboard measures — the small, frequent wins. The mean of the whole distribution is positive. Looks great in the deck. But the dashboard does not show you the long, ugly tail on the left, and it especially does not show you the spike at the very edge — the one that goes off the chart at minus two million dollars on a Tuesday.
Most days, the agent saves you money. One day a year, it does something irreversible.
The mean does not capture this. Variance does not capture this. Standard deviation, ratio metrics, three-sigma confidence intervals — none of them capture this. All of them assume the world is symmetric. The agentic world is not symmetric. It is a world where the upside is bounded — you can save at most $200 a day; you cannot save $2M a day — and the downside is not. The agent really can lose you $2M in an afternoon if it deletes the wrong table or sends the wrong wire.
This is what statisticians call a fat-tailed distribution. The tech industry has been designing for the friendly hump for thirty years. The collision with this kind of risk shape was always going to be ugly.
A practical implication, free of charge: any KPI that summarises agent performance into a single number is, structurally, lying to you. Not because the people who built it are dishonest — most of them are excellent — but because the math itself does not reduce to a single number when the distribution has a fat tail. If your weekly review shows you the green KPI and not the loss histogram, you are looking at the wrong artifact.
What gets measured causes what gets ignored
There is a specific thing that happens when you put a dashboard up on a wall.
The numbers on the dashboard improve. The numbers not on the dashboard get worse. This is not a metaphor; it is a sociology law, observed often enough to have a name. When a measure becomes a target, it ceases to be a good measure.
I have watched it happen with agentic systems. A team measures success rate. Engineers, who are clever, learn to define “success” generously. The success rate goes up. Costs go up too, but those weren’t on the dashboard until later. By the time they appear, the team has six months of “success” history that is impossible to question without admitting the metric was wrong.
Or: a team measures cost per instance. Engineers learn to break a job into smaller instances. Cost per instance goes down. Total cost goes up. The dashboard is happy.
Or: a team measures user satisfaction. Engineers learn to time the survey for after a successful interaction. Satisfaction goes up. The angry users who churn before the survey arrives are invisible.
The agentic era is especially susceptible to this because the metrics are easier to game. An agent’s “success” is partially a matter of LLM-as-judge interpretation. An agent’s “cost” depends on which tokens you count. An agent’s “latency” is meaningful only if you fix the routing. Every degree of freedom in the metric is a place where reality and the dashboard part ways.
What you actually want to know is what the worst output looked like this week. Not the average. The worst.
Pull up the ten ugliest traces from production every Friday and read them. You will learn more in twenty minutes than from a quarter of summary statistics.
The cure becomes the disease
You might reasonably ask: if the tail risk is so awful, why not bolt on a bunch of safety gates? Approval workflows. Human checkpoints. Audit logs. Guardrails. The previous piece in this series catalogued them at length.
The answer, the one I want this piece to spend the most time on, is that interventions in complex systems often cause the harm they were meant to prevent. The medical word for this is iatrogenic — from the Greek iatros, “physician,” plus the suffix for causation. Doctors used to bleed patients. The patients got worse. The doctors interpreted the worsening as proof more bleeding was needed.
The agentic version of this is depressingly common.

Figure 2. The intervention becomes the cause.
A safety gate gets added to a workflow. The gate is real, well-intentioned, and at first glance reduces risk. The wait time on approvals starts to grow. As the wait time grows, the operations team — who are measured on throughput, not safety — starts to lean on approvers to sign off faster. The approvers, who are not domain experts and have a queue of forty other items, develop a habit of rubber-stamping. The rubber-stamped items contain the exact failures the gate was designed to catch. The gate now provides a false sense of safety while the system is actually less safe than before, because the people downstream have learned to trust it.
The gate caused the failure mode it was meant to prevent. This is not a hypothetical. Ask anyone who has worked in a regulated industry. The compliance team will tell you about the time the audit checkbox became the only thing that mattered, and the actual quality of the audit declined for five years until something broke loudly enough to reset.
The deeper point: every intervention you make has a second-order effect. You cannot reason about the first-order effect alone. If you add a gate, ask yourself what behaviour the gate will create downstream. If it creates pressure to bypass, you may have made things worse. If it creates a more thoughtful pause, you may have made things better. The same intervention, different outcomes — depending on the context the rest of the system provides.
Any safety mechanism the people downstream have learned to circumvent is, in practice, an unsafe mechanism.
Treat it as such. Remove it, redesign it, or replace it with something that creates the right kind of pressure rather than the wrong kind.
Three responses to the same wave
If the agent is going to face stress — edge cases, unexpected inputs, malformed responses, a tool that comes back garbled at 3 AM — then the right question is not how do we eliminate stress. The stress is the world. You do not get to eliminate it. You only get to choose how your system responds when it arrives.
There are three ways something can respond to stress, and they look identical for a long time before they don’t.

Figure 3. The three response curves. They diverge only when stress is high enough to matter.
The fragile system runs beautifully under normal load. Right up until it doesn’t. There is no warning, because the system was operating well within its capacity all along, and the failure mode is not a gradual degradation but a discontinuity. The agent that has never seen Cyrillic text in its training distribution will handle Latin characters perfectly until the day a Russian customer file shows up, and then it will produce something that looks confident and is very wrong.
The robust system, the one most engineers aim for, takes the stress and shrugs. It survives. It does not get better. It does not get worse. This is fine, but it is also a ceiling. A robust system in 2026 will be a robust system in 2030, but the world in 2030 will have moved.
The third kind — the kind that actually improves under stress — is the rare and worth-aiming-for one. Each edge case it encounters becomes part of its training data. Each failed run becomes a regression test. Each unexpected tool error tightens its retry logic. Stress is not a threat; stress is the input to its improvement.
Most agents in production today are the leftmost curve. They look the same as the others — until they don’t. The teams that operate them mistake the absence of failures for the presence of robustness. These are not the same thing.
A diagnostic, free of charge: if your team’s response to a near-miss is to file a Jira ticket and forget about it, you are running a fragile system. If the response is “let’s add this to the test suite,” you are running a robust one. If the response is “let’s update the agent’s prompt and the eval set and the regression tests in one PR, automatically generated from the trace,” you are running the third kind.
The barbell
Here is the strategy I recommend for any team trying to build agents that survive contact with reality. It is not original. The investing world has known about it for decades. It applies to engineering with no modification.

Figure 4. Two extremes, deliberately. The middle is where the demos live and the post-mortems get written.
Take everything your business does. Sort it into the things that matter (a wire transfer, a regulatory filing, a customer’s medical record) and the things that don’t (a rough draft, a tool suggestion, a search query summary).
For the things that matter, do not use an AI agent. Use a deterministic rule. A SQL query. A validation. A handwritten if-then. Boring. Predictable. Auditable. Coded by a human who can be fired if it goes wrong.
For the things that don’t matter, give the agent the most freedom you can. Open-ended exploration. Creative drafting. Multi-step research. The downside is bounded — at worst, the output is bad, and you throw it away. The upside is large, because the agent can do things no rule could anticipate.
What you must not do is the comfortable middle. The “we’ll let the agent decide, but with some guidelines” approach. The “human in the loop, but only sort of” approach. The “rules engine that calls an LLM that calls a rules engine” approach. The middle is where you get the worst of both: the unboundedness of an agent attached to outcomes that matter.
This middle, incidentally, is where most enterprise AI demos live. A loan officer agent that “advises” but the advice gets followed 95% of the time. A pricing agent that “suggests” but the suggestion ships unchanged. A hiring agent that “helps” but no human ever overrides it. These are not the bounded delegations they appear to be. They are the unbounded ones, dressed up in the language of constraint.
The barbell is harder to sell than the middle. The middle sounds reasonable in a meeting. The barbell sounds extreme. But extremity, in this domain, is closer to the shape of reality than reasonableness is. Reality is mostly small frequent events plus rare large ones. The barbell is the strategy that fits that shape.
What the dashboard does not show
I want to come back to the dashboard, because I have a specific complaint about it.
Most agentic dashboards in production are designed to show you what already broke. They are autopsies. The thing that takes down production next quarter is not on this quarter’s dashboard, because the dashboard only knows about failure modes it has already seen.

Figure 5. The KPIs are green. The substrate of incubating failure is not on the deck.
Above the line: the green KPIs that go in the leadership deck. Below the line: the dozens of small near-misses that nobody is paid to look at. The silent retries. The schema drifts. The hallucinated arguments to tool calls that happened to land on a parameter the API was lenient about. The timeouts that succeeded on retry. The user complaints that came in and were quietly closed because they did not match a known issue.
Each of these is, individually, beneath notice. Together, they are the substrate from which Tuesday’s catastrophe will incubate.
There is a specific kind of engineer who loves this stuff. They are the people who, when nothing is on fire, go looking for things that almost caught fire. They tail logs nobody asks them to tail. They run analyses on retry distributions. They set up alerts on metrics nobody else cares about. They are the reason your system has not yet had its Tuesday.
Most organisations, sadly, do not value these engineers. They are not promoted, because their work is invisible. The dashboards, you see, are still all green.
Find the engineer who is looking under the surface, and pay them more. Pay them visibly more. This comes after seeing myself in mirror.
Make it institutional. Or accept that you are betting your firm’s future on the assumption that this Tuesday will not be the wrong Tuesday.
Things that will not save you
A list, in no particular order, of things that are sold as solutions to agentic fragility but mostly are not:
More elaborate prompts. A prompt is a wish. The model is under no obligation to honour your wish. Wishes that worked yesterday will fail tomorrow when the model gets updated, the input distribution shifts, or the moon phase changes. Anything that depends on a prompt being followed cannot be a load-bearing safety mechanism.
Bigger models. A bigger model is more capable. It is also more capable of being wrong with confidence. The error mode of older models was obvious nonsense. The error mode of frontier models is plausible, well-formatted, internally consistent nonsense. The dangerous failures got prettier with scale, not rarer.
More guardrails. Guardrails work the way fences work. They are great until something climbs them, and then they make it harder to see what is happening on the other side. The thing about an adversary — and a misbehaving agent is, in effect, an adversary — is that an adversary will route around the guardrail you specifically built. Guardrails should be one layer in a defence-in-depth, not the whole defence.
Fine-tuning on incidents. You will fine-tune the model on this quarter’s incidents. The model will get better at this quarter’s incidents. Next quarter’s incidents will be different — that is what makes them next quarter’s incidents. Fine-tuning is rear-view-mirror engineering.
Audit logs. An audit log tells you what already happened. It does not stop the next thing from happening. It is a forensic tool, not a preventive one. Treat it accordingly.
A bigger model evaluating the smaller model. The fashionable architecture: a small fast model does the work, a big slow model judges. This works fine until the big slow model has the same blind spot as the small fast one — which it often does, because they share most of their training data. You have not added a check; you have added a correlated one. This is the same statistical mistake that made 2008 worse than it should have been.
A committee. Committees do not reduce risk. They distribute responsibility, which is a different thing. A failure of the committee is no one’s failure in particular, which means no one will be in a position to learn from it. The fastest-improving organisations have decisions owned by named individuals who pay a real price for being wrong.
This is a depressing list. I am sorry. The good news is that most of these are not necessary if you get the architecture right in the first place. The bad news is that getting the architecture right is itself difficult, and most of the people selling you these things have never had to do it.
Things that actually help
A shorter list. Hopefully more useful.
Skin in the game for the people building the system. If the team that ships the agent is also on the pager when it fails — at 3 AM, every time, no exceptions — they will design differently. If the team that ships the agent never feels the failure, they will optimise for the green dashboard and let the variance grow. Skin in the game is not a virtue; it is a feedback mechanism. Without it, no amount of process will produce safe systems.
Slow rollouts on irreversible actions. Reversible actions can be deployed quickly. Irreversible ones — anything involving money, customer records, sent communication — should roll out to 1% of traffic for two weeks, then 5%, then 20%, then 100%. The point is not the percentages. The point is that you give your detection systems enough exposure to find the failure modes before the blast radius is everyone.
Read the bad traces. Every Friday, every team that runs an agent in production should pull the ten worst traces from the past week and read them out loud. Out loud. Together. You will be amazed how many failure modes you spot when the bad output is in the room with you, and how few you spot in a Jira ticket.
Kill features that are not paying their way. The agentic world has a horrible habit of accumulating capabilities. Every new tool, every new connector, every new sub-agent expands the attack surface. Periodically — quarterly is fine — go through and kill the ones nobody is using. The system gets simpler. The blast radius shrinks. Nothing is lost, because nothing was being used.
Pay people to think about what could go wrong. Most engineering teams pay to ship features. The most expensive engineers, in any system that survives the long run, are the ones who do not ship features and instead think about what would happen if the features already shipped behaved badly. These engineers are unpopular. They are the reason your firm exists in five years.
That is the whole list. Not glamorous. Not new. Five practices, every one of them invented before the LLM was, all of them ignored by approximately 90% of the teams currently shipping agentic systems.
The practices that survive
A small thing to close on.
The practices that worked for previous generations of mission-critical software — version control, code review, slow rollouts, blameless post-mortems, on-call rotations, runbooks, capacity planning — all still work. They have survived for thirty years because they encode something true about how complex systems fail. Things that have lasted that long usually keep lasting. New things usually don’t.
The practices being invented today specifically for AI agents are mostly not in this category. Most of them are someone’s preferred opinion, untested by time. Some will turn out to be useful. Most will turn out to be cargo-culted variations on practices we already had, dressed up in new vocabulary and sold at a premium.
If you have to choose, in 2026, between an engineering practice that has been around since the 1990s and a “novel agentic safety framework” announced last quarter at a conference, bet on the older one. It will be wrong less often. And when it is wrong, it will be wrong in known ways, which is much safer than being wrong in surprising ones.
The previous piece in this series ended on the line that the boring infrastructure takes the 11% to 60%.
I want to add the obvious corollary: getting to 60% on a fragile foundation just means a 60% rate of getting to your Tuesday faster. The number on the dashboard is not the thing. The shape of the distribution is.
Look at the shape.
The series continues. The next piece will go into specific patterns — the ones that survive Tuesday, and the ones that don’t — and what tells them apart. If you found this useful, the previous two pieces are linked at the top.
Top comments (0)