DEV Community

ParthibanRajasekaran
ParthibanRajasekaran

Posted on

The Lie of the Global Average: Why Taming Complex SLIs Requires Bucketing

OBservability

As an engineering manager responsible for keeping serverless, event-driven systems alive, I’ve learned to fear green dashboards.

There is no worse feeling than seeing your main status page happily report “99.9% up” while Slack is full of screenshots from customers who can’t check out, can’t pay, or can’t log in and a senior leader is asking, “Why does everything look fine if the business is clearly on fire?”

That’s the core problem:

  • Global availability is a vanity metric.
  • It smooths out the spikes.
  • It tells you the system is fine, but it doesn’t tell you if the business is fine.

In site reliability engineering (SRE), we like tidy global SLIs because they compress a lot of behaviour into one number. But that compression is exactly how pain gets hidden.

If you have 1,000,000 requests and 1,000 fail, the math says you’re 99.9% successful. Everyone relaxes.

But if those 1,000 failures are all POST /submit payment or POST /confirm-order, you’re not reliable. You’re losing money and trust.

Reliability is not a percentage. It is a relationship with your users. And global averages destroy that relationship.

The Antidote: Bucketing (Adding Dimensionality)

The only honest way to deal with this is to stop worshipping global averages and start segmenting your SLIs along the dimensions that actually threaten your business.

Call it bucketing, slicing, or adding dimensionality, the idea is simple:

  • Stop asking, “Is the site OK?”
  • Start asking, “Who is it broken for, and where?”

A global SLI is like a city wide traffic report:

“Traffic is moving at 40 mph on average.”

That’s a nice number, but it’s useless if I’m sitting in standstill traffic on the only bridge into the city.

A bucketed SLI is more honest:

“The highway is clear, but the bridge to downtown is blocked.”

Same city. Same “average speed.” Completely different lived reality.

In real systems, three buckets almost always matter.

Bucket A : Reads vs Writes: The Hidden Fire

Most systems are heavily skewed toward reads. In a typical API:

  • 90–95% of traffic is GET : browsing, listing, fetching.
  • 5–10% is POST/PUT : creating orders, payments, sign-ups, profile changes.

The trap

Imagine this:

  • 95% of your traffic is product browsing (GET /products, GET /search).
  • 5% is checkout and payment (POST /checkout, POST /payment).

Now a bad deploy or downstream issue makes every payment call fail.

From the user’s perspective:

  • Browsing looks fine.
  • Every attempt to give you money fails.

From the global SLI’s perspective:

  • 5% of all requests are failing.
  • You’re still at 95% “availability.”
  • Depending on your alert thresholds, you might not even get paged.

This is how you end up in the classic “everything is green” screenshot while Support is drowning and Finance is asking why conversion just fell off a cliff.

The fix

You split the SLI:

  • availability_read : success rate for read-only requests.
  • availability_write : success rate for state changing requests.

Suddenly a total outage on the write path shows up as 0% availability for writes, not “a small blip.” You can:

  • Alert specifically on write failures.
  • Tie that bucket to more conservative error budgets.
  • Treat it as a higher-severity incident even if the homepage still loads.

The business impact

Reads going down is annoying. Writes going down is existential.

  • If people can’t browse, they might come back later.
  • If they can browse but can’t pay, they leave angry and they tell others.

Buckets make that difference painfully visible.

Bucket B : Mobile vs Web: The Client Reality

Web clients and mobile clients live in different universes:

  • Web tends to run on stable connections, with up-to-date JS, easy rollbacks.
  • Mobile runs on flaky 4G, old app versions, and aggressive retry logic that can turn a subtle bug into a DDoS-shaped traffic pattern.

The trap

You ship a change that:

  • Works fine for web checkout.
  • Breaks a specific flow on the iOS app hitting POST /checkout with an older payload shape.

Global metrics barely blink:

  • Web traffic dominates volume.
  • Retries hide some of the errors.
  • The average success rate looks “acceptably noisy.”

Meanwhile:

  • Your App Store rating is sliding.
  • Support is logging “mobile checkout broken” tickets.
  • Product is asking, “Why didn’t we catch this before it hit customers?”

The fix

You bucket by client type:

  • client_type = web
  • client_type = ios
  • client_type = android

You don’t need per-device madness. You need just enough segmentation to see when one channel is quietly dying while the others hide it.

Once you do this, you can ask:

  • “What is write availability for iOS for the checkout journey?”
  • “How does Android latency for search compare to web?”

The business impact

Now when you get paged, it’s not:

“High error rate on /checkout.”

It’s:

“Write availability for iOS clients on /checkout has dropped below SLO.”

That alert:

  • Tells on-call who is broken.
  • Points directly to where to start looking.
  • Stops the “blame the network” dance and focuses everyone on the right API, payload, or versioning issue.

Bucket C : Premium vs Standard: The Revenue Bucket

This is where engineering stops talking about “traffic” and starts talking about revenue.

Not all users are equal:

  • A single enterprise customer might be worth more than 10,000 free-tier users.
  • A VIP credit-card holder being unable to transact has a different blast radius than a trial user who can’t update a profile picture.

The trap

Without buckets, a retry storm from:

  • 50,000 free users hitting a low-value feature

can drown out:

  • 50 failures from your top 10 enterprise customers hitting your highest-margin features.

On a global SLI chart, it all collapses into one line. If the total error rate is “within budget,” you might be technically winning while strategically losing.

The fix

You tag traffic with user tier:

  • user_tier = premium / enterprise
  • user_tier = standard / free

Then you define separate SLOs:

  • Premium checkout success: 99.99%
  • Standard/free checkout success: 99.5%

Same system. Different promises.

And more importantly: different reactions when the budget burns. If premium write availability wobbles, you slow down changes or roll back quickly. If free tier browsing is a bit flaky but within tolerance, you don’t knee-jerk into a full change freeze.

The business impact

This is how you align:

  • Engineering anxiety with Where the company actually makes money.

It also changes roadmap debates. Once you show Product and Sales a graph labeled “Enterprise checkout success”, nobody argues that a free-tier bug and an enterprise bug are equivalent priorities anymore.

The Senior Caveat: Bucketing, SLOs, and the Cost of Cardinality

At this point, every experienced engineer is thinking:

“If I bucket by method, client, region, and tier… won’t this explode my Prometheus / Datadog bill?”

Yes. It can.

Bucketing adds cardinality. If you:

  • Put user_id or request_id into labels, or
  • Try to slice by every endpoint and every dimension

you will:

  • Melt your metrics backend, or
  • Hand Finance a monitoring bill that looks like a production incident.

The art of bucketing is restraint:

  • You don’t need a bucket for every variable.
  • You need a bucket for every distinct failure domain you care about.

In practice, that means:

  • Be ruthless about which labels are allowed on “SLO grade” metrics.
  • Keep high cardinality detail (like raw logs) in cheaper systems, and only promote a small set of aggregated counters/gauges as SLIs.
  • Review your metric schema regularly with both SREs and Finance in the room.

If bucketing is free, you’re doing it wrong either technically (too coarse) or financially (too expensive). A senior team treats cardinality as part of the reliability design, not an afterthought.

Conclusion: Silence Is Golden

When you bucket SLIs along the fault lines that actually matter, reads vs writes, mobile vs web, premium vs standard, something interesting happens:

  • The noise stops.
  • “High error rate” pages that send you spelunking through logs get replaced with alerts like: > “Write availability for iOS premium users in checkout has dropped below SLO.”

That’s an honest alert. It tells you:

  • Who is affected.
  • Where to look.
  • How worried the business should be.

It’s also an alert your engineers will respect. When every page comes with a clear, specific blast radius, the pager stops feeling like a random punishment generator and starts feeling like what it should have been all along: a surgical tool.

As a manager, that’s the difference between a team that dreads the pager and one that trusts it. Honest alerts are a retention tool as much as a reliability tool.

Most of these opinions come from incidents I’d rather not repeat, not from slides. Revisiting Google’s “SRE: Measuring and Managing Reliability” course recently mainly gave me sharper language for what the on-call rotation already knew.

Reliability isn’t about the nines you show the board; it’s about the promises you keep to your users and about whether your teams can keep those promises without burning out.

So the next time your dashboard says 99.9% green, don’t congratulate yourself.

Ask a harder question:

“Who is stuck on the bridge?”

Because if you don’t know, your global average is lying to you.

Top comments (0)