Nex Tools

Posted on May 18

Claude Code for Error Budgets: How I Stopped Arguing About Reliability and Started Measuring It

#claudecode #sre #devops #reliability

For the first three years of running production systems I had the same fight with the same people about the same thing. The product team wanted to ship faster. The on-call team wanted to ship slower. Both sides had data. Neither side could prove the other was wrong. The arguments would end in compromise that nobody felt good about, and the next incident would restart the cycle.

The fix was not better arguments. The fix was an error budget. Once we had a budget, the question stopped being "should we ship this risky change" and started being "do we have budget left to spend." That is a much smaller question with a much clearer answer, and it changes the entire conversation between the people who build features and the people who keep them running.

Setting up an error budget program sounded simple in the SRE book. In practice it took me eighteen months and three failed attempts before I had something that actually worked. The thing that finally made it work was using Claude Code to handle the parts of the program that humans were too inconsistent to handle on their own. Here is the workflow I built and what it taught me about reliability as an engineering discipline rather than a debate topic.

Why Error Budgets Are Hard to Run in Practice

The theory of error budgets is straightforward. You pick a service level objective, usually expressed as a percentage of successful requests over some window. The difference between 100% success and your objective is your budget. When you have budget left, you can ship risky things. When you have burned through your budget, you slow down and focus on reliability work until the budget recovers.

The theory is clean. The practice is messy in ways the theory does not warn you about.

The first mess is measurement. Picking the right success criteria turns out to be much harder than it sounds. A request that returned 200 but took 12 seconds is not really a success. A request that returned 500 because the user sent malformed input is not really a failure. A request that succeeded for the user but failed in a way that corrupted internal state is the worst possible outcome and looks like a success in your metrics. Every team that runs an error budget hits these edge cases, and most teams either ignore them or argue about them forever.

The second mess is enforcement. Error budgets only work if the organization actually changes behavior when the budget is exhausted. In practice, the moment the budget runs out is also the moment somebody important wants to ship something important, and the budget gets waived. After this happens three or four times, the budget becomes a number on a dashboard that everyone ignores. The credibility cost of ignoring the budget once is much higher than people realize.

The third mess is cadence. A budget that resets monthly behaves very differently from a budget that resets quarterly, and both behave very differently from a sliding window. Each cadence has different failure modes. The wrong cadence for your traffic patterns can make the budget either too punishing or too lax, and neither extreme produces the cultural changes you wanted.

An error budget is not just a number. It is a contract between the people who build features and the people who run them, and like any contract, the value comes from how rigorously it is enforced rather than how cleverly it is written.

The teams I have seen succeed with error budgets are not the teams with the most sophisticated objectives. They are the teams that wired the budget into their actual release process so that the contract enforced itself. The teams that failed are the ones that left enforcement up to human judgment, because human judgment under pressure always finds a reason to ship.

The Objective Skill

The first skill in my error budget workflow handles objective definition. Given a service and its traffic patterns, the skill proposes a service level objective and the supporting indicators that feed into it.

The skill does not just pick a percentage. It looks at historical traffic, current failure rates, and customer impact patterns to recommend an objective that is achievable but meaningful. An objective of 99.99% sounds impressive but is usually meaningless for a service whose current state is 99.5%. An objective of 99% is useless for a service that already runs at 99.95%. The right objective is one that requires real work to maintain but does not require fantasy.

The skill also defines the success criteria precisely. It does not just say "successful requests." It specifies what success means for this particular service, including the edge cases. A successful request might require a 2xx status code, a response time under a specific threshold, and the absence of any internal error logs correlated with that request ID. The precision is important because vagueness is where the arguments start.

The output is a structured objective document that can be reviewed, debated, and eventually signed off by both the engineering team and the product team. The document is the foundation of the budget program. Without precise definitions, every subsequent decision becomes a fight about what the words mean.

The Burn Rate Skill

The second skill tracks burn rate in real time. The burn rate is the speed at which the budget is being consumed relative to the time remaining in the window.

A budget that burns at exactly the expected rate is not interesting. A budget that burns at three times the expected rate is a warning. A budget that burns at ten times the expected rate is an active fire. The skill watches the burn rate continuously and surfaces deviations from expected behavior.

The interesting design choice is the smoothing. A naive burn rate calculation produces wild swings every time a single bad request comes in. A heavily smoothed calculation hides real problems for hours. The skill uses multiple time horizons in parallel. A one-hour view catches active fires. A six-hour view catches sustained degradations. A 24-hour view catches gradual drift that nobody would notice in shorter windows.

When the burn rate crosses a threshold, the skill alerts. The alert is not just a number. It includes the context needed to act. Which endpoint is contributing most to the burn. Which deploy correlated with the change in burn rate. Whether the burn is concentrated in a single tenant or distributed across the user base. The context turns the alert from a notification into a starting point for investigation.

The investigation often connects directly to a log analysis pass. If you have set up the workflow I described in Claude Code for Log Analysis, the burn rate alert can hand off straight into pattern detection, which compresses the time from "budget is burning" to "we know why" by a meaningful margin.

The Policy Skill

The third skill enforces the budget policy in the release pipeline. The skill sits between the deploy command and the actual deploy and checks whether the current budget state permits the release.

The policy is configurable per service. A typical configuration might say that any deploy is permitted while the budget is above 50%, that only low-risk deploys are permitted between 25% and 50%, and that no non-critical deploys are permitted below 25%. The thresholds and the risk classifications are defined in the objective document so that the policy is mechanical rather than negotiable.

The mechanical enforcement is the entire point. When the budget gets low and the policy blocks a deploy, the response is not a debate about whether to override the policy. The response is a question about whether to spend the remaining budget on this specific change. If the answer is yes, the change ships and the budget burns further. If the answer is no, the change waits. Either way, the budget stays meaningful.

The skill also produces an audit trail. Every deploy that was permitted under the policy is logged with the budget state at the time. Every deploy that was blocked is logged with the reason. The audit trail makes it possible to look back at a quarter and see exactly how the budget was spent and whether the spending decisions were the right ones in retrospect.

The Postmortem Skill

The fourth skill connects error budget consumption to postmortem actions. After every significant budget burn, the skill produces a draft postmortem that documents what happened, how much budget was consumed, and what changes would prevent the burn from recurring.

The draft is not a finished postmortem. It is a structured starting point. The skill fills in the data sections automatically, including the timeline, the affected metrics, and the related deploys. The human writes the analysis sections, which are the parts that actually require judgment. The split between mechanical sections and judgment sections cuts the time to produce a postmortem roughly in half without reducing its quality.

The postmortem also includes a budget impact summary. The summary expresses the incident in terms of how much of the quarterly budget it consumed and how that affects the remaining release capacity for the quarter. The budget framing makes the postmortem read differently. Instead of saying "this incident lasted 47 minutes," the postmortem says "this incident consumed 18% of the quarterly budget." The second framing leads to different priorities about prevention.

For incident response context that pairs naturally with this postmortem workflow, the system I described in Claude Code for Incident Response handles the live response side and feeds directly into the postmortem skill once the incident is closed.

How the Workflow Runs in Practice

The workflow runs continuously rather than on demand. The burn rate skill is always watching. The policy skill is always sitting in front of the deploy pipeline. The postmortem skill triggers automatically when a burn crosses a threshold.

When the burn rate alerts, my first move is to check the context the alert provides. The endpoint, the correlated deploy, the affected user segment. Most of the time the context points directly at the cause. If the alert correlates with a recent deploy, the deploy is probably the cause and the response is a rollback. If the alert correlates with a tenant spike, the cause is probably load and the response is a scaling decision.

When the policy skill blocks a deploy, the response is a conversation rather than a fight. The conversation is about whether the change is important enough to spend remaining budget on, given that the budget cannot be replenished mid-quarter. Sometimes the answer is yes and the team takes ownership of the increased risk. Sometimes the answer is no and the change moves to the next quarter. Either answer is fine because both are deliberate.

When the postmortem skill produces a draft, my first move is to fill in the human analysis sections. The data is already there. The narrative, the root cause, the action items are the parts that need judgment. The structured starting point makes it easier to focus on the parts that matter rather than getting bogged down in data assembly.

The objective skill runs once per service when the service is onboarded and then again at quarterly reviews. The review cycle keeps the objectives calibrated to actual traffic and actual customer expectations. A service whose traffic has grown 10x in a year usually needs a tighter objective than the one it was launched with.

What This Workflow Did to My Practice

The most visible change is that the arguments stopped. The product team and the engineering team no longer fight about whether to ship a particular change. They look at the budget, they look at the change, and they make a decision. The decision is not always the one I would have preferred, but the decision-making process is much faster and much less politically expensive than the old arguments were.

The second change is that incidents feel different. An incident that would previously have produced a vague sense of "things were bad for a while" now produces a precise statement about how much budget was consumed and what that means for the rest of the quarter. The precision makes prevention work easier to justify, because the cost of the incident is no longer abstract.

The third change is in how the team thinks about reliability investments. Before the budget program, reliability work was something to argue for during planning. After the budget program, reliability work happens whenever the budget gets tight, because the alternative is freezing deploys. The forcing function is mechanical rather than political, which means the work actually happens.

The fourth change, which I did not expect, is in how features get scoped. The product team has started asking about expected error budget impact early in the design process. A feature that requires a risky new dependency now gets weighed against the budget cost of integrating it, not just the engineering cost of building it. The conversation about scope is informed by reliability data instead of opinions.

For the broader set of workflows that connect to this one, my Claude Code Practical Workflows series on DEV.to covers everything from observability through incident response, refactoring, migrations, and security. The error budget workflow ties many of them together because the budget is the unifying measurement that tells you whether the rest of the practice is working.

FAQ

What if the team does not want to commit to a service level objective?

That resistance is usually about fear of being held to an unrealistic number rather than disagreement with the concept. The objective skill helps because it grounds the objective in actual traffic and current failure rates, which makes the number defensible. Once the team sees that the proposed objective is achievable, the resistance usually fades.

How do I handle services with very low traffic?

Low-traffic services have noisy budget calculations because a single bad request consumes a much larger percentage of the budget. The skill handles this by using longer windows for low-traffic services and by combining related services into a single budget where appropriate. A budget calculated over 1,000 requests per quarter is meaningful in a way that a budget calculated over 50 is not.

What happens when the budget is exhausted halfway through the quarter?

The policy skill blocks non-critical deploys for the rest of the quarter. The team uses the time to do the reliability work that the burn rate revealed. This is the intended behavior. If exhausting the budget produces no behavioral change, the budget program is not working and the program needs to be reconsidered, not the budget.

Can I run multiple budgets per service?

Yes. A single service might have a budget for availability and a separate budget for latency. The skills handle multiple budgets per service and produce aggregated views for cases where the budgets need to be reasoned about together. Most teams start with a single availability budget and add additional budgets only when the first one is operating well.

How do I get product team buy-in?

The biggest unlock is reframing the budget as a permission slip rather than a restriction. The budget is what gives the product team the right to ship risky changes. Without the budget, every risky change has to be argued individually. With the budget, the team can ship anything that fits within the available capacity. Most product teams respond well to this framing once they see that the budget enables faster shipping when reliability is healthy.

The error budget workflow is the piece of my SRE practice that I would recommend to any team that has ongoing tension between product and engineering about reliability. The tension is real and it does not resolve itself through arguments. It resolves through a contract that enforces itself mechanically, and the workflow I described is how I made that contract operational. The investment is significant. The payoff is that reliability stops being a source of conflict and becomes a source of shared planning, which is the version of the relationship that healthy teams have.