Gbenga Kusade

Posted on Nov 16

SRE in Action: Understanding How Real Teams Use SLOs, SLIs, and Error Budgets to Stay Reliable Through Case Studies - Part 1

#sre #devops #observability #fintech

When people talk about Site Reliability Engineering (SRE), they often share abstract principles about SLIs, SLOs, and error budgets. But here's the problem: understanding the concepts isn't the same as knowing how to apply them.

The truth is, reliability challenges look radically different depending on where you sit. This article presents two SRE implementations from completely different perspectives, a complete walkthrough for beginners:

For startups (CompanyA): it's about moving fast without breaking everything as you scale.
For enterprises (CompanyB): it's about coordinating dozens of teams who can't agree on what "reliable" even means.

Both need SRE principles. But the implementation couldn't be more different.

Let's dive in.

CASE 1: How a FinTech Startup Moved from Firefighting to Measurable Reliability

What You will Learn

By the end of this case study, you will understand:

How to identify what metrics actually matter to users (SLIs)
How to set realistic reliability targets (SLOs)
What an error budget is and why it is your secret weapon
How to balance shipping features with maintaining reliability

No complex theory, just a startup's journey from chaos to confidence.

Meet CompanyA: A Growing Startup with Growing Pains

CompanyA is a x-year-old fintech startup providing digital wallets and payment APIs to small businesses across Africa. They recently crossed 1 million users, exciting news! But with growth came pain.

CompanyA Tech Stack

Frontend: React web app
Backend: Containerized services on AWS ECS
Database: PostgreSQL on RDS
Infrastructure: API Gateway + CloudFront CDN

The Problem

During high-traffic periods (Black Friday, salary week, etc.), things started breaking:

Payment success rates dropped to 96%
Users complained: "Transfers hang for minutes!"
Engineers were burning out from constant alerts
Every issue felt equally urgent
No one could agree on what "reliable" meant

If this sounds familiar, let us see how it was fixed. We will pick the process up step by step.

Step 1: Understanding SLIs (Service Level Indicators)

What is an SLI?

Think of SLIs as the vital signs of your service. Just like a doctor checks your heart rate and blood pressure, SLIs tell you what your users are actually experiencing.

CompanyA's Journey: Finding What Matters

The team started by asking: What does success look like from a user's perspective?

They mapped out the critical user journey:

For each step, they asked: What metric shows if this step is working well?

The SLIs They Chose

Why 95th Percentile?

Instead of looking at average response time (which hides problems), the 95th percentile shows: 95% of users experience this speed or better.

For Example:

Average latency: 1.5s (looks good!)
95th percentile: 5s (problem! 5% of users wait too long) The 95th percentile catches issues that averages hide.

Step 2: Setting SLOs (Service Level Objectives)

What's an SLO?

An SLO is your reliability target: a specific, measurable goal you commit to internally. It answers: How reliable should this service be?

CompanyA's Approach: Data-Driven Targets

They did not guess. They analyzed 3 months of real data to see:

What reliability were they currently achieving
Where users dropped off
What was realistically achievable

Here is what they decided:

Critical Decision: Why Not 100%?

CompanyA learned that chasing 100% is a trap:

It is impossibly expensive
It slows innovation to a crawl
Real-world systems have dependencies that fail

Instead, they accepted 0.1% failure (about 43 minutes of downtime per month). This is not giving up; it is being realistic.

The Golden Question

When setting SLOs, ask yourself:

What is the minimum reliability that keeps users happy and the business healthy? By setting the bar too low -> users leave. By setting the bar too high -> you never ship features

Step 3: Understanding Error Budgets

What is an Error Budget?

This is the game-changer concept. Your error budget is the amount of unreliability you can afford before breaking your SLO.

Think of it like this:

SLO says: "Be available 99.9% of the time"
That means you can be down 0.1% of the time
That 0.1% is your error budget

CompanyA's Error Budget Calculation

For their 99.9% availability SLO over 30 days:

Total time in month = 30 days × 24 hours × 60 minutes = 43,200 minutes
Allowed downtime (0.1%) = 43.2 minutes

This is their error budget: 43.2 minutes of downtime per month

The Policy That Changed Everything

CompanyA created a simple rule:

If the Error budget > 50% remaining -> Ship new features confidently

If the Error budget is 25-50% remaining  -> Review what's burning the budget, slow down risky changes

If the Error budget is < 25% remaining -> FREEZE new features, focus only on reliability

If the Error budget is exhausted (100% used) -> Complete feature freeze until the budget recovers the next month

Why This Matters

The error budget gave engineers objective authority to say "not now" when reliability was at risk.

Step 4: Building Visibility with Dashboards

Making It Real: The Dashboard

CompanyA built a simple Grafana dashboard that everyone (incl. engineers, product managers, executives, etc.) could understand:

The Architecture Behind It

What Made This Dashboard Effective

Single source of truth: No more "it works on my machine"
Real-time visibility: See problems as they happen
Clear status: Green/Yellow/Red, no ambiguity
Actionable alerts: Only fires when SLOs are actually at risk

Step 5: Using Data to Drive Improvements

Problem 1: Slow API Response Times

What the data showed:

P95 latency: 3.2s (breaching the 2.5s SLO)
Happened during peak traffic
Correlated with database query spikes

Root cause investigation:

Database connection pool maxing out
Repeated queries for the same data
No caching layer

Fixes implemented:

Increased connection pool size
Added Redis caching for frequent queries
Implemented query result pagination Result: P95 latency dropped to 1.8s

Problem 2: Success Rate Drops During Promotions

What the data showed:

Success rate dropped to 96% during Black Friday
Error budget consumed 40% in one week
Alerts everywhere

Root cause investigation:

System couldn't handle 5x normal traffic
No load testing before promotional launches
Services crashed under load

Fixes implemented:

Added load testing to CI/CD pipeline
Implemented auto-scaling based on request rate
Added circuit breakers to prevent cascade failures Result: Next promotion maintained 99.9% success rate

Problem 3: Alert Fatigue

What the data showed:

On-call engineers getting 50+ pages per week
80% of alerts resolved themselves
Team morale is suffering

Root cause investigation:

Alerts triggered on any spike, not SLO breaches
No distinction between warning and critical
Noisy metrics creating false positives

Fixes implemented:

Rewrote Prometheus alerting rules to focus on SLO breaches
Added "sustained for 10 minutes" threshold
Differentiated between "watch" and "act now" alerts

Result: Pages reduced by 65%, MTTR improved by 40%

Step 6: Understanding SLAs (Service Level Agreements)

What's an SLA?

An SLA is your external promise to customers, usually with financial consequences if you break it.

CompanyA's SLA Design

They created customer-facing SLAs backed by internal SLOs:

Internal SLO: 99.9% availability
Customer SLA: 99.5% availability (with buffer!)

Why the buffer?
- Protects against measurement differences  
- Gives room for maintenance windows
- SLO breach doesn't automatically mean SLA breach

The Customer Agreement

CompanyA Payment API - Service Level Agreement

Availability Guarantee: 99.5% uptime monthly
Sample Credit Structure:
- 99.0-99.49% uptime → 10% credit
- 98.0-98.99% uptime → 25% credit  
- Below 98.0% uptime → 50% credit

Exclusions:
- Scheduled maintenance (announced 48hrs ahead)
- Customer's infrastructure issues
- Force majeure events

Key Lesson

Your SLAs should be less strict than your SLOs.

This buffer means:

You have time to fix issues before customers are affected
You're not paying out credits for every tiny breach
You can meet customer expectations consistently

The Results: 6 Months Later

Metrics

Cultural Wins

Key Takeaways for Your Team

Start Small: You don't need to instrument everything. Pick one critical service and one user journey.
Use Real Data: Don't guess at SLOs. Look at 2-3 months of actual performance and user behavior.
Your Error Budgets Are Power: They transform reliability from a vague concept into something you can negotiate with data.
Dashboards Create Alignment: When everyone sees the same numbers, conversations shift from blame to improvement.
SLAs Need Buffers: Your internal targets (SLOs) should be stricter than your customer promises (SLAs).
Perfect Is the Enemy of Reliable: Chasing 100% uptime kills innovation. Accept some failure, measure it, and stay within budget.

Want To Try This With Your Team?

The template below can help

Week 1: Define

Pick one critical service
Map the user journey
Choose 2-3 key SLIs
Pull historical data

Week 2: Measure

Set realistic SLOs based on data
Calculate error budgets
Set up basic monitoring

Week 3: Monitor

Build a simple dashboard
Review daily: Are we within SLOs?
Document what's consuming the budget

Week 4: Improve

Hold a team review
Pick the top budget-burner
Plan improvements
Iterate!

Resources to Go Deeper

Google SRE Book - Free, comprehensive
Google SRE Workbook - Practical exercises
Prometheus Best Practices - Metrics collection
Grafana SLO Plugin - Dashboard tooling

Up Next!

In the next case study (final), we will explore how CompanyB, a big Telecom organization with both legacy and modern systems, applied these same principles across multiple teams and vendors.

Watch Out!

For your questions, thoughts, additions, or suggestions, please share in the comments section!