When people talk about Site Reliability Engineering (SRE), they often share abstract principles about SLIs, SLOs, and error budgets. But here's the problem: understanding the concepts isn't the same as knowing how to apply them.
The truth is, reliability challenges look radically different depending on where you sit. This article presents two SRE implementations from completely different perspectives, a complete walkthrough for beginners:
- For startups (CompanyA): it's about moving fast without breaking everything as you scale.
- For enterprises (CompanyB): it's about coordinating dozens of teams who can't agree on what "reliable" even means.
Both need SRE principles. But the implementation couldn't be more different.
Let's dive in.
CASE 1: How a FinTech Startup Moved from Firefighting to Measurable Reliability
What You will Learn
By the end of this case study, you will understand:
- How to identify what metrics actually matter to users (SLIs)
- How to set realistic reliability targets (SLOs)
- What an error budget is and why it is your secret weapon
- How to balance shipping features with maintaining reliability
No complex theory, just a startup's journey from chaos to confidence.
Meet CompanyA: A Growing Startup with Growing Pains
CompanyA is a x-year-old fintech startup providing digital wallets and payment APIs to small businesses across Africa. They recently crossed 1 million users, exciting news! But with growth came pain.
CompanyA Tech Stack
- Frontend: React web app
- Backend: Containerized services on AWS ECS
- Database: PostgreSQL on RDS
- Infrastructure: API Gateway + CloudFront CDN
The Problem
During high-traffic periods (Black Friday, salary week, etc.), things started breaking:
- Payment success rates dropped to 96%
- Users complained: "Transfers hang for minutes!"
- Engineers were burning out from constant alerts
- Every issue felt equally urgent
- No one could agree on what "reliable" meant
If this sounds familiar, let us see how it was fixed. We will pick the process up step by step.
Step 1: Understanding SLIs (Service Level Indicators)
What is an SLI?
Think of SLIs as the vital signs of your service. Just like a doctor checks your heart rate and blood pressure, SLIs tell you what your users are actually experiencing.
CompanyA's Journey: Finding What Matters
The team started by asking: What does success look like from a user's perspective?
They mapped out the critical user journey:
For each step, they asked: What metric shows if this step is working well?
The SLIs They Chose
Why 95th Percentile?
Instead of looking at average response time (which hides problems), the 95th percentile shows: 95% of users experience this speed or better.
For Example:
- Average latency: 1.5s (looks good!)
- 95th percentile: 5s (problem! 5% of users wait too long) The 95th percentile catches issues that averages hide.
Step 2: Setting SLOs (Service Level Objectives)
What's an SLO?
An SLO is your reliability target: a specific, measurable goal you commit to internally. It answers: How reliable should this service be?
CompanyA's Approach: Data-Driven Targets
They did not guess. They analyzed 3 months of real data to see:
- What reliability were they currently achieving
- Where users dropped off
- What was realistically achievable
Here is what they decided:
Critical Decision: Why Not 100%?
CompanyA learned that chasing 100% is a trap:
- It is impossibly expensive
- It slows innovation to a crawl
- Real-world systems have dependencies that fail
Instead, they accepted 0.1% failure (about 43 minutes of downtime per month). This is not giving up; it is being realistic.
The Golden Question
When setting SLOs, ask yourself:
- What is the minimum reliability that keeps users happy and the business healthy? By setting the bar too low -> users leave. By setting the bar too high -> you never ship features
Step 3: Understanding Error Budgets
What is an Error Budget?
This is the game-changer concept. Your error budget is the amount of unreliability you can afford before breaking your SLO.
Think of it like this:
- SLO says: "Be available 99.9% of the time"
- That means you can be down 0.1% of the time
- That 0.1% is your error budget
CompanyA's Error Budget Calculation
For their 99.9% availability SLO over 30 days:
Total time in month = 30 days × 24 hours × 60 minutes = 43,200 minutes
Allowed downtime (0.1%) = 43.2 minutes
This is their error budget: 43.2 minutes of downtime per month
The Policy That Changed Everything
CompanyA created a simple rule:
If the Error budget > 50% remaining -> Ship new features confidently
If the Error budget is 25-50% remaining -> Review what's burning the budget, slow down risky changes
If the Error budget is < 25% remaining -> FREEZE new features, focus only on reliability
If the Error budget is exhausted (100% used) -> Complete feature freeze until the budget recovers the next month
Why This Matters
The error budget gave engineers objective authority to say "not now" when reliability was at risk.
Step 4: Building Visibility with Dashboards
Making It Real: The Dashboard
CompanyA built a simple Grafana dashboard that everyone (incl. engineers, product managers, executives, etc.) could understand:
The Architecture Behind It
What Made This Dashboard Effective
- Single source of truth: No more "it works on my machine"
- Real-time visibility: See problems as they happen
- Clear status: Green/Yellow/Red, no ambiguity
- Actionable alerts: Only fires when SLOs are actually at risk
Step 5: Using Data to Drive Improvements
Problem 1: Slow API Response Times
What the data showed:
- P95 latency: 3.2s (breaching the 2.5s SLO)
- Happened during peak traffic
- Correlated with database query spikes
Root cause investigation:
- Database connection pool maxing out
- Repeated queries for the same data
- No caching layer
Fixes implemented:
- Increased connection pool size
- Added Redis caching for frequent queries
- Implemented query result pagination Result: P95 latency dropped to 1.8s
Problem 2: Success Rate Drops During Promotions
What the data showed:
- Success rate dropped to 96% during Black Friday
- Error budget consumed 40% in one week
- Alerts everywhere
Root cause investigation:
- System couldn't handle 5x normal traffic
- No load testing before promotional launches
- Services crashed under load
Fixes implemented:
- Added load testing to CI/CD pipeline
- Implemented auto-scaling based on request rate
- Added circuit breakers to prevent cascade failures Result: Next promotion maintained 99.9% success rate
Problem 3: Alert Fatigue
What the data showed:
- On-call engineers getting 50+ pages per week
- 80% of alerts resolved themselves
- Team morale is suffering
Root cause investigation:
- Alerts triggered on any spike, not SLO breaches
- No distinction between warning and critical
- Noisy metrics creating false positives
Fixes implemented:
- Rewrote Prometheus alerting rules to focus on SLO breaches
- Added "sustained for 10 minutes" threshold
- Differentiated between "watch" and "act now" alerts
Result: Pages reduced by 65%, MTTR improved by 40%
Step 6: Understanding SLAs (Service Level Agreements)
What's an SLA?
An SLA is your external promise to customers, usually with financial consequences if you break it.
CompanyA's SLA Design
They created customer-facing SLAs backed by internal SLOs:
Internal SLO: 99.9% availability
Customer SLA: 99.5% availability (with buffer!)
Why the buffer?
- Protects against measurement differences
- Gives room for maintenance windows
- SLO breach doesn't automatically mean SLA breach
The Customer Agreement
CompanyA Payment API - Service Level Agreement
Availability Guarantee: 99.5% uptime monthly
Sample Credit Structure:
- 99.0-99.49% uptime → 10% credit
- 98.0-98.99% uptime → 25% credit
- Below 98.0% uptime → 50% credit
Exclusions:
- Scheduled maintenance (announced 48hrs ahead)
- Customer's infrastructure issues
- Force majeure events
Key Lesson
Your SLAs should be less strict than your SLOs.
This buffer means:
- You have time to fix issues before customers are affected
- You're not paying out credits for every tiny breach
- You can meet customer expectations consistently
The Results: 6 Months Later
Metrics
Cultural Wins
Key Takeaways for Your Team
- Start Small: You don't need to instrument everything. Pick one critical service and one user journey.
- Use Real Data: Don't guess at SLOs. Look at 2-3 months of actual performance and user behavior.
- Your Error Budgets Are Power: They transform reliability from a vague concept into something you can negotiate with data.
- Dashboards Create Alignment: When everyone sees the same numbers, conversations shift from blame to improvement.
- SLAs Need Buffers: Your internal targets (SLOs) should be stricter than your customer promises (SLAs).
- Perfect Is the Enemy of Reliable: Chasing 100% uptime kills innovation. Accept some failure, measure it, and stay within budget.
Want To Try This With Your Team?
The template below can help
Week 1: Define
- Pick one critical service
- Map the user journey
- Choose 2-3 key SLIs
- Pull historical data
Week 2: Measure
- Set realistic SLOs based on data
- Calculate error budgets
- Set up basic monitoring
Week 3: Monitor
- Build a simple dashboard
- Review daily: Are we within SLOs?
- Document what's consuming the budget
Week 4: Improve
- Hold a team review
- Pick the top budget-burner
- Plan improvements
- Iterate!
Resources to Go Deeper
- Google SRE Book - Free, comprehensive
- Google SRE Workbook - Practical exercises
- Prometheus Best Practices - Metrics collection
- Grafana SLO Plugin - Dashboard tooling
Up Next!
In the next case study (final), we will explore how CompanyB, a big Telecom organization with both legacy and modern systems, applied these same principles across multiple teams and vendors.
Watch Out!
For your questions, thoughts, additions, or suggestions, please share in the comments section!








Top comments (0)