DEV Community

Alex Cloudstar
Alex Cloudstar

Posted on • Originally published at alexcloudstar.com

AI Agent Reliability Engineering in 2026: SLOs, Error Budgets, And Failure Modes That Actually Matter

The dashboard said the agent was at 99.4 percent uptime for the quarter. The customer told me, on the same call where I was about to celebrate that number, that the feature had been broken for him for three weeks. He had stopped using it. He was not going to renew. The agent was returning two-hundreds the entire time. The HTTP layer was fine. The thing the agent was supposed to actually do, which was generate a report he could ship to his client, was not working at all. The model had silently regressed when we swapped a cheaper variant in. The pipeline carried on returning success codes for outputs nobody could use.

That call ended my career as a person who measures AI agent reliability with traditional service metrics. The numbers we had been shipping to the leadership deck were technically correct and operationally meaningless. The agent was up. The agent was also broken. Both can be true. The reliability framework I had inherited from a decade of regular service work could not see the difference, and I had to build one that could.

Two years on, the patterns for measuring and improving AI agent reliability have stabilized enough that I trust them. They are not the same as the SRE playbook for normal services, and trying to retrofit one onto the other is the most common reason teams ship reliability dashboards that do not match user reality. This is what actually works.

Why Traditional Reliability Numbers Lie About Agents

The reason a 200 OK does not mean an agent worked is that the agent is doing more than serving a request. It is making decisions. It is calling tools. It is generating outputs that have to be useful, not just well-formed. None of that is captured by an HTTP status code, a latency histogram, or a process uptime number.

A traditional service has a small number of failure modes. The process crashes. The database is unreachable. The deploy was bad. The disk is full. Each of these has a clear signal and a clear remediation. The reliability engineering for these failure modes is mature, and tools like Prometheus and PagerDuty solve most of it.

An agent has all of those failure modes plus a long list of new ones. The model regresses on a class of inputs after a provider-side update. The tool call returns the right shape but the wrong content. The retrieval pipeline pulls a stale document. The prompt template gets an extra newline that breaks the JSON-mode parsing. A schema validator was relaxed during a deploy and now garbage is flowing through. The user phrased the request in a way that hits a known weak spot. None of these surface as 500s. They surface as outputs that look fine to the system and wrong to the user.

The reliability engineering for these failure modes is not as mature as it should be by 2026, but the patterns have started to converge. The headline insight is that you have to measure outcome, not just throughput. A request that succeeds at the HTTP layer and fails at the task layer is still a failure. If your dashboard cannot see that, your dashboard is going to lie to you about the user experience.

The Three Layers Of An Agent SLO

The reliability target for an agent is not one number. It is at least three, stacked, and they have to be tracked separately because they fail in different ways and have different remediations.

Service-level reliability. Did the request hit the agent and come back with a non-error response in a reasonable time. This is the layer your existing tooling already covers. The HTTP success rate, the p95 latency, the deploy success rate. Necessary but not sufficient. A target of 99.5 percent here is conventional and reasonable.

Output validity. Did the agent return something that conforms to the contract it was supposed to return. JSON that parses. Tool calls with the right schema. Outputs that pass the type check before they get rendered. This is the layer where most teams realize the gap exists. A 200 with malformed JSON is not a success. The target here should usually be tighter than the service-level reliability, because the failures here often surface to the user as broken UI. I tend to target 99.9 percent.

Task success. Did the agent actually do the thing the user wanted. This is the layer that takes real eval work to measure, because "did the user get value" is fuzzier than "did the JSON parse." The tools for this in 2026 are evals run on a sample of production traffic, with grading by either a human, a verifier program, or another LLM. The target here is product-dependent, but for serious applications it is rarely below 95 percent and often higher. The same eval discipline I covered in AI evals for solo developers is what makes this measurable in the first place.

The reason all three are needed is that they fail independently. A model regression can collapse task success while service-level reliability stays at 100 percent. A bad deploy can collapse service-level reliability while task success is unaffected on the requests that actually go through. A schema change can collapse output validity while the other two are fine. If you only track one of these, you get a partial view of reality, and partial views are how customers churn while your dashboard is green.

Error Budgets That Match The Reality

The classic SRE error budget assumes that failures are independent, attributable, and roughly evenly distributed in time. None of that is true for agent failures.

A model regression after a provider-side update is not independent. It hits every request in the affected class until you switch models. A retrieval pipeline failure correlates across users who happen to query the same stale documents. A prompt template change ships at one moment and affects every request after it. The error budget burns in spikes, not in smooth curves, and the alerting has to reflect that.

The pattern that has worked is to set separate error budgets for each of the three SLO layers and to track burn rate, not just total burn. A burn rate that goes from 1x to 10x over an hour is the signal that something just broke, even if the absolute burn is still within budget. Alert on the rate, not on the total. The total tells you the story after the fact. The rate tells you the story while there is still time to act.

The other adjustment is that the error budget for task success has to be reset more often than for service reliability. A model upgrade, a prompt template change, a tool addition, any of these can shift the underlying success rate. If you carry over a budget calculated against the old behavior, you will spend it in a week and have nothing left for the rest of the month. I tend to reset task success budgets after any meaningful change to the agent's underlying components, with a fresh measurement of the baseline before declaring the new budget.

The last adjustment is that the budget should account for the cost of the failure, not just the count. A failure on a free-tier user is not the same as a failure on an enterprise user. A failure that the user can retry is not the same as a failure that loses their work. Weighted budgets, where high-stakes failures count for more, force the team to triage by impact instead of by volume, and that prioritization is what keeps the worst failures from being deprioritized just because they are rare.

The Failure Modes Worth Naming

Treating "the agent is broken" as a single failure mode is what produces incident reviews that go nowhere. The reality is that there are a small number of distinct failure modes, each with their own signal, their own remediation, and their own postmortem shape. Naming them is what lets the team build a runbook.

Model regression. The model you are calling has changed behavior on a class of inputs. The output validity rate or the task success rate drops on the affected bucket. The fix is to pin to a specific model version, switch providers, or roll forward with a new prompt that handles the changed behavior. The detection is your eval running on production traffic and noticing the drop. The runbook step is to compare current outputs against a holdout set from the last known-good period.

Tool failure. A tool the agent calls is returning errors, returning the wrong shape, or returning data that is stale or wrong. The output validity may stay high if the agent recovers gracefully. The task success will drop because the agent is operating on bad inputs. Detection is per-tool error rates and per-tool semantic checks. The runbook step is to verify the tool independently of the agent, isolating whether the issue is the tool or the agent's use of it. This is the same observability shape I covered in agent observability and debugging, and most of the recurring tool failures trace back to the tool design choices made before the agent shipped.

Retrieval drift. The retrieval pipeline is returning documents that are stale, irrelevant, or duplicated. The agent's outputs feel slightly off. The user does not always notice individual failures, but renewal numbers slip. Detection requires sampling retrieval results and grading them. The runbook step is to verify the index freshness, the embedding pipeline, and the similarity thresholds.

Prompt regression. A change to the prompt template, often well-intentioned, has broken a class of requests. The window between deploy and detection is the danger zone. Detection is an eval that runs on every prompt change and an alert on task success rate after deploys. The runbook step is to revert the prompt change and triage in a non-production environment.

Schema drift. The agent is returning outputs that pass the looser validators but fail the stricter ones, or that have started to drift from the expected shape. Detection is a strict schema validator running on a sample of production outputs and surfacing drift before the looser one starts letting bad data through. The runbook step is to tighten the validator and rerun the eval.

Provider outage. The provider is returning errors, rate limiting, or timing out. The fallback should pick this up. The signal that something is wrong is the fallback firing rate going up. Detection is the router's own metrics. The runbook step is to verify the fallback is actually working and to switch primary providers if the outage is sustained. The patterns I covered in the LLM router pattern guide are what make this a runbook step instead of an incident.

Cost spike. The bill is climbing faster than the traffic. Something has changed in the cost shape of the work. A new prompt is longer than the old one. A bucket is escalating to the expensive model more often than expected. A user has discovered a way to drive up token usage. Detection is per-bucket and per-user cost dashboards with alerts on derivative changes. The runbook step is to identify the cost source and either contain it, optimize it, or surface it as a billing issue.

Hallucination. The agent is producing outputs that look right and are wrong. This is the hardest failure mode to detect because the surface signals are clean. Detection requires either a verifier that catches the specific class of hallucination (a tool call that references a non-existent file, a citation that does not match the source, a number that does not appear in the input) or a sampled review by a human. The runbook step is to harden the verifier and to retrain or reprompt against the failure mode.

Each of these has a different signature, a different signal, and a different remediation. The runbook should have a section for each. Pattern matching the symptom to the failure mode is the first step. Without a named failure mode, the team is in "the agent is broken" mode, and that mode does not converge on a fix.

Drills That Find The Bugs Before Production Does

Most agent reliability bugs hide until production traffic finds them. The reason is that the input space is large and the test traffic is usually small. The fix is to run drills that simulate real failure modes and verify the system handles them.

The drills that have caught the most for me.

Provider outage drill. Take the primary provider offline in staging. Run a real traffic pattern. Verify the fallback fires, the latency stays within budget, and the task success rate stays above the SLO. The first time you run this, something will be missing. A key not configured. A timeout set wrong. A fallback model that does not actually exist. Better to find it on Tuesday afternoon than during the actual outage.

Model regression drill. Swap the model behind a bucket to a deliberately weaker variant. Run the eval. Verify the alerting fires before the budget is exhausted. The drill verifies that your eval-based detection is connected to your alerting, which is the part that almost always has a gap.

Tool failure drill. Make a tool return errors, then make it return malformed responses, then make it return slow responses. Each is a different failure shape. Verify the agent handles each gracefully and the metrics surface the failure correctly. The slow-response case in particular tends to cause subtle bugs where requests pile up and timeouts compound.

Cost runaway drill. Simulate a user driving heavy traffic to an expensive bucket. Verify the cost dashboards alert. Verify the rate limiting kicks in before the budget is blown. Verify the postmortem path includes attributing the cost to the user. The first time someone runs this drill, the cost alerts are usually slower than they should be, and the rate limiting is often missing entirely.

Prompt change drill. Ship a prompt change to a staging environment with a deliberately broken section. Verify the eval catches it. Verify the rollout pauses or rolls back automatically. The drill is about verifying that your deployment process for prompt changes is as careful as your deployment process for code changes, which is rarely the case by default.

The shape of the drill is always the same. Force a known failure. Verify the system detects it. Verify the system mitigates it. Verify the runbook for handling it actually works. Repeat on a schedule. The drill calendar is what turns a reliability claim into a reliability fact.

Observability That Connects Layers

The observability that supports all of this has to span the three SLO layers, not just one. A trace that shows the HTTP request and the latency is not enough. The trace has to include the prompt, the tool calls, the retrieval results, the model used, the validator results, and the final output. Without that, debugging a task-level failure means reproducing it manually, which is the slow path.

The minimum I want to see in a production agent trace.

The full prompt that was sent, including the system prompt, the user message, and any context. Redacted as needed for privacy, but not stripped to the point of being unhelpful.

Every tool call, with the tool name, the arguments, the result, and the time taken. Tool calls are where most agent bugs live, and a trace without tool detail is missing the most useful part.

The model used and the version. If the router picked a different model than the default, the reason. The cost incurred. The token counts.

The validator results. Did the output pass schema validation. Did it pass any semantic checks. Did the verifier reject it and trigger a fallback.

The final output that was returned to the user. The thing the user actually saw. Without this, you cannot reproduce the user's experience.

The user identifier and the request bucket. Both are needed for cohort analysis when failures correlate with user segment or with workload type.

The shape that has won is OpenTelemetry traces with custom attributes for the agent-specific fields. The infrastructure for normal services already understands the trace format, and the custom attributes give you the agent-specific context. Most observability platforms can ingest these without bespoke work, and the analysis tools that have grown up around traces work for agent debugging without much adaptation.

The Postmortem Discipline That Actually Helps

Postmortems for agent incidents are different from postmortems for service incidents. The traditional template assumes a deterministic system and a clear root cause. Agent incidents often have several contributing factors and a fuzzy root cause that is more like "the model started doing this for these reasons."

The postmortem fields that have produced useful changes after agent incidents.

Which SLO was breached. Service, output validity, or task success. Each implies a different remediation surface.

Which failure mode it was. From the named list. If it does not fit a named mode, the postmortem produces a new mode and adds it to the list.

The detection lag. The time from when the failure started to when the team knew. Long detection lag is a signal that the metrics or the alerts need work, regardless of what caused the failure.

The mitigation lag. The time from detection to a contained state. Long mitigation lag is a signal that the runbook needs work.

The blast radius. Which users were affected, what they saw, whether they got a clean error or an incorrect output, whether they retried, whether they churned. Agent failures often produce silent damage that the metrics do not capture, and the postmortem has to surface that explicitly.

The eval delta. What the eval looked like before and after the incident. Did the eval catch the failure, did it miss it, did the eval need to be updated. The eval is part of the system. When it fails, that is part of the postmortem.

The followups. Specific, dated, owned. Drills to add. Alerts to tighten. Runbooks to update. Validators to harden. The followups are the only output of the postmortem that changes the system. The narrative is for sharing context. The followups are for fixing things.

The discipline that makes this work is treating the postmortem as the input to the next round of reliability work, not as a closing artifact for the incident. Every incident produces material for the next sprint of reliability improvements. The agents that get more reliable over time are the ones whose teams have a steady drip of these improvements landing.

What Does Not Carry Over From Traditional SRE

A few patterns from the SRE playbook do not work for agents and should be skipped.

Five-nines targets. The math that makes 99.999 percent reliability achievable in traditional services does not work when the underlying model has a non-zero error rate that you do not control. Aim for the highest reliability that the business actually needs and do not chase numbers that the underlying components cannot deliver.

Pure synthetic monitoring. A synthetic prompt run every minute will tell you the agent is alive. It will not tell you the agent is doing useful work on the actual traffic mix you serve. Sample real traffic for the eval signal. Use synthetic monitoring for the service layer only.

Strict deployment gates on latency alone. A change that improves latency by 10 percent and drops task success by 5 percent is a regression, not a win. The deployment gates have to include task success, not just latency.

Identical staging environments. A staging environment with a different model, a smaller dataset, or a synthetic traffic generator does not reproduce the failure modes of production. Either invest in staging that mirrors production or accept that some failures will only appear in production and build the rollback story for that case.

Treating the model as infrastructure. It is a dependency, but it is a dependency that changes behavior on its own schedule and that does not have a release notes page that captures all the relevant changes. Pin where you can, monitor where you cannot, and assume the dependency will surprise you on a regular basis.

The summary is that the framework looks similar but the parameters are different. The names of the artifacts (SLO, error budget, postmortem, runbook) carry over. The contents of those artifacts have to be rebuilt for the agent context.

What This Looks Like When It Works

A team running this discipline has a reliability dashboard that has three numbers, not one. The service number is high and steady. The output validity number is high and twitches occasionally on schema changes. The task success number is the one with the most history and the most attention, and it is the one the leadership cares about.

The team has a runbook with named failure modes, each with a detection signal and a remediation. New failures get added to the runbook after each postmortem. The runbook is the living artifact, not the dashboard.

The team runs drills on a schedule. Provider outage, model regression, tool failure, cost spike. The drills find one or two issues each time. The drills do not stop. The first time a drill finds nothing in three rounds is the signal that the drills have stopped being aggressive enough.

The team has eval gates on every prompt change, every model change, every tool change. The gates are integrated with the deployment pipeline. A prompt change that fails the eval does not ship.

The team has cost dashboards that surface spikes by bucket and by user. Cost is treated as a reliability concern, because a runaway cost is an outage of the business model, even if the service is up.

The team writes postmortems that produce followups. The followups land in the sprint. The next set of incidents rarely repeats the patterns of the last set, because the patterns get fixed.

This is not glamorous work. It is the same kind of unsexy reliability discipline that has kept normal services up for decades, adapted for the new failure surface that agents introduce. The teams that take it seriously ship products that work for years. The teams that do not get to live the experience I had on that customer call, where the dashboard says one thing and the customer says another and the customer is the one who is right.

The dashboard I run now would have caught that quarter's regression on day three. The customer would not have spent three weeks on a broken feature. The renewal would still have been at risk for other reasons, the product is hard, but it would not have been at risk for that one. That is what reliability engineering for agents buys you. Not perfection. Just the chance to know what is actually happening in time to do something about it. The pattern is the floor, not the ceiling, and every agent product I ship now starts from it.

Top comments (0)