DEV Community

Gbenga Kusade
Gbenga Kusade

Posted on

SRE in Action: Understanding How Real Teams Use SLOs, SLIs, and Error Budgets to Stay Reliable Through Case Studies - Part 1

When people talk about Site Reliability Engineering (SRE), they often share abstract principles about SLIs, SLOs, and error budgets. But here's the problem: understanding the concepts isn't the same as knowing how to apply them.

The truth is, reliability challenges look radically different depending on where you sit. This article presents two SRE implementations from completely different perspectives, a complete walkthrough for beginners:

  • For startups (CompanyA): it's about moving fast without breaking everything as you scale.
  • For enterprises (CompanyB): it's about coordinating dozens of teams who can't agree on what "reliable" even means.

Both need SRE principles. But the implementation couldn't be more different.

Let's dive in.


CASE 1: How a FinTech Startup Moved from Firefighting to Measurable Reliability


What You will Learn

By the end of this case study, you will understand:

  • How to identify what metrics actually matter to users (SLIs)
  • How to set realistic reliability targets (SLOs)
  • What an error budget is and why it is your secret weapon
  • How to balance shipping features with maintaining reliability

No complex theory, just a startup's journey from chaos to confidence.


Meet CompanyA: A Growing Startup with Growing Pains

CompanyA is a x-year-old fintech startup providing digital wallets and payment APIs to small businesses across Africa. They recently crossed 1 million users, exciting news! But with growth came pain.

CompanyA Tech Stack

  • Frontend: React web app
  • Backend: Containerized services on AWS ECS
  • Database: PostgreSQL on RDS
  • Infrastructure: API Gateway + CloudFront CDN

The Problem

During high-traffic periods (Black Friday, salary week, etc.), things started breaking:

  • Payment success rates dropped to 96%
  • Users complained: "Transfers hang for minutes!"
  • Engineers were burning out from constant alerts
  • Every issue felt equally urgent
  • No one could agree on what "reliable" meant

If this sounds familiar, let us see how it was fixed. We will pick the process up step by step.


Step 1: Understanding SLIs (Service Level Indicators)

What is an SLI?

Think of SLIs as the vital signs of your service. Just like a doctor checks your heart rate and blood pressure, SLIs tell you what your users are actually experiencing.

CompanyA's Journey: Finding What Matters

The team started by asking: What does success look like from a user's perspective?

They mapped out the critical user journey:

user journey

For each step, they asked: What metric shows if this step is working well?

The SLIs They Chose

chosen SLI

Why 95th Percentile?

Instead of looking at average response time (which hides problems), the 95th percentile shows: 95% of users experience this speed or better.

For Example:

  • Average latency: 1.5s (looks good!)
  • 95th percentile: 5s (problem! 5% of users wait too long) The 95th percentile catches issues that averages hide.

Step 2: Setting SLOs (Service Level Objectives)

What's an SLO?

An SLO is your reliability target: a specific, measurable goal you commit to internally. It answers: How reliable should this service be?

CompanyA's Approach: Data-Driven Targets

They did not guess. They analyzed 3 months of real data to see:

  • What reliability were they currently achieving
  • Where users dropped off
  • What was realistically achievable

Here is what they decided:

decided SLO

Critical Decision: Why Not 100%?

CompanyA learned that chasing 100% is a trap:

  • It is impossibly expensive
  • It slows innovation to a crawl
  • Real-world systems have dependencies that fail

Instead, they accepted 0.1% failure (about 43 minutes of downtime per month). This is not giving up; it is being realistic.

The Golden Question

When setting SLOs, ask yourself:

  • What is the minimum reliability that keeps users happy and the business healthy? By setting the bar too low -> users leave. By setting the bar too high -> you never ship features

Step 3: Understanding Error Budgets

What is an Error Budget?

This is the game-changer concept. Your error budget is the amount of unreliability you can afford before breaking your SLO.

Think of it like this:

  • SLO says: "Be available 99.9% of the time"
  • That means you can be down 0.1% of the time
  • That 0.1% is your error budget

CompanyA's Error Budget Calculation

For their 99.9% availability SLO over 30 days:

Total time in month = 30 days × 24 hours × 60 minutes = 43,200 minutes
Allowed downtime (0.1%) = 43.2 minutes
Enter fullscreen mode Exit fullscreen mode

This is their error budget: 43.2 minutes of downtime per month

The Policy That Changed Everything

CompanyA created a simple rule:

If the Error budget > 50% remaining -> Ship new features confidently

If the Error budget is 25-50% remaining  -> Review what's burning the budget, slow down risky changes

If the Error budget is < 25% remaining -> FREEZE new features, focus only on reliability

If the Error budget is exhausted (100% used) -> Complete feature freeze until the budget recovers the next month
Enter fullscreen mode Exit fullscreen mode

Why This Matters

why error budget matters

The error budget gave engineers objective authority to say "not now" when reliability was at risk.


Step 4: Building Visibility with Dashboards

Making It Real: The Dashboard

CompanyA built a simple Grafana dashboard that everyone (incl. engineers, product managers, executives, etc.) could understand:

dashboard

The Architecture Behind It

arch behind

What Made This Dashboard Effective

  1. Single source of truth: No more "it works on my machine"
  2. Real-time visibility: See problems as they happen
  3. Clear status: Green/Yellow/Red, no ambiguity
  4. Actionable alerts: Only fires when SLOs are actually at risk

Step 5: Using Data to Drive Improvements

Problem 1: Slow API Response Times

What the data showed:

  • P95 latency: 3.2s (breaching the 2.5s SLO)
  • Happened during peak traffic
  • Correlated with database query spikes

Root cause investigation:

  • Database connection pool maxing out
  • Repeated queries for the same data
  • No caching layer

Fixes implemented:

  1. Increased connection pool size
  2. Added Redis caching for frequent queries
  3. Implemented query result pagination Result: P95 latency dropped to 1.8s

Problem 2: Success Rate Drops During Promotions

What the data showed:

  • Success rate dropped to 96% during Black Friday
  • Error budget consumed 40% in one week
  • Alerts everywhere

Root cause investigation:

  • System couldn't handle 5x normal traffic
  • No load testing before promotional launches
  • Services crashed under load

Fixes implemented:

  1. Added load testing to CI/CD pipeline
  2. Implemented auto-scaling based on request rate
  3. Added circuit breakers to prevent cascade failures Result: Next promotion maintained 99.9% success rate

Problem 3: Alert Fatigue

What the data showed:

  • On-call engineers getting 50+ pages per week
  • 80% of alerts resolved themselves
  • Team morale is suffering

Root cause investigation:

  • Alerts triggered on any spike, not SLO breaches
  • No distinction between warning and critical
  • Noisy metrics creating false positives

Fixes implemented:

  1. Rewrote Prometheus alerting rules to focus on SLO breaches
  2. Added "sustained for 10 minutes" threshold
  3. Differentiated between "watch" and "act now" alerts

Result: Pages reduced by 65%, MTTR improved by 40%


Step 6: Understanding SLAs (Service Level Agreements)

What's an SLA?

An SLA is your external promise to customers, usually with financial consequences if you break it.

CompanyA's SLA Design

They created customer-facing SLAs backed by internal SLOs:

Internal SLO: 99.9% availability
Customer SLA: 99.5% availability (with buffer!)

Why the buffer?
- Protects against measurement differences  
- Gives room for maintenance windows
- SLO breach doesn't automatically mean SLA breach
Enter fullscreen mode Exit fullscreen mode

The Customer Agreement

CompanyA Payment API - Service Level Agreement

Availability Guarantee: 99.5% uptime monthly
Sample Credit Structure:
- 99.0-99.49% uptime → 10% credit
- 98.0-98.99% uptime → 25% credit  
- Below 98.0% uptime → 50% credit

Exclusions:
- Scheduled maintenance (announced 48hrs ahead)
- Customer's infrastructure issues
- Force majeure events
Enter fullscreen mode Exit fullscreen mode

Key Lesson

Your SLAs should be less strict than your SLOs.

This buffer means:

  • You have time to fix issues before customers are affected
  • You're not paying out credits for every tiny breach
  • You can meet customer expectations consistently

The Results: 6 Months Later

Metrics

the metrics

Cultural Wins

Cultural wins


Key Takeaways for Your Team

  • Start Small: You don't need to instrument everything. Pick one critical service and one user journey.
  • Use Real Data: Don't guess at SLOs. Look at 2-3 months of actual performance and user behavior.
  • Your Error Budgets Are Power: They transform reliability from a vague concept into something you can negotiate with data.
  • Dashboards Create Alignment: When everyone sees the same numbers, conversations shift from blame to improvement.
  • SLAs Need Buffers: Your internal targets (SLOs) should be stricter than your customer promises (SLAs).
  • Perfect Is the Enemy of Reliable: Chasing 100% uptime kills innovation. Accept some failure, measure it, and stay within budget.

Want To Try This With Your Team?

The template below can help

Week 1: Define

  • Pick one critical service
  • Map the user journey
  • Choose 2-3 key SLIs
  • Pull historical data

Week 2: Measure

  • Set realistic SLOs based on data
  • Calculate error budgets
  • Set up basic monitoring

Week 3: Monitor

  • Build a simple dashboard
  • Review daily: Are we within SLOs?
  • Document what's consuming the budget

Week 4: Improve

  • Hold a team review
  • Pick the top budget-burner
  • Plan improvements
  • Iterate!

Resources to Go Deeper


Up Next!

In the next case study (final), we will explore how CompanyB, a big Telecom organization with both legacy and modern systems, applied these same principles across multiple teams and vendors.

Watch Out!


For your questions, thoughts, additions, or suggestions, please share in the comments section!

Top comments (0)