Anderson Leite

Posted on Oct 28

Your SLA (maybe?) is a Lie: Why most companies get RTO, RPO and Service Level Agreements wrong

#saas #servicelevel #product #architecture

This article is a follow-up to "Multi-Cloud and Return to On-Prem Aren't Your Silver Bullets"

After publishing my analysis of the recent AWS outage and the multi-cloud/on-prem debate, I received some direct messages with variations of the same question: "How calculate our SLAs properly since we depend of third-parties?"

You know how to do it? Most companies don't. They pick aspirational numbers that sound good in sales decks without understanding the mathematical reality of their dependencies. Worse, many don't realize their promised SLAs are literally impossible given their supplier SLAs.

Let me show you why your 99.99% uptime promise might be mathematically impossible, and how to fix it.

The Fundamental Misunderstanding

Let's start with definitions that actually matter in practice, first a graphical representation:

Note: (I've asked Google nano banana to generate a image, but can't make it fix the closing parenthesis after "service level agreement" and "recovery time objective" so bear with me)

Recovery Time Objective (RTO): The maximum acceptable time your system can be down before you've failed your business requirements. This is measured in actual time units (minutes, hours) and represents the point where business damage becomes unacceptable.

Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. RPO of zero means no data loss is acceptable.

Service Level Agreement (SLA): The contractual commitment you make to your customers about availability, typically expressed as a percentage (99.9%, 99.99%, etc.) and measured over a defined period (usually monthly or annually).

Here's what most companies get wrong: These three metrics are interconnected, and your suppliers' numbers directly constrain yours.

The Math Nobody Wants to Do

Understanding Availability Percentages

Let's make this concrete. Here's what different availability percentages actually mean:

Availability	Downtime/Month	Downtime/Year	What This Really Means
99% ("two nines")	7.2 hours	3.65 days	Unacceptable for most production systems
99.5%	3.6 hours	1.83 days	Still pretty rough
99.9% ("three nines")	43.2 minutes	8.76 hours	AWS IAM control plane
99.95%	21.6 minutes	4.38 hours	Common for business services
99.99% ("four nines")	4.32 minutes	52.56 minutes	Premium tier promise
99.999% ("five nines")	26 seconds	5.26 minutes	Telecom-grade (very expensive)

Most companies promise 99.9% or 99.99% without understanding what it takes to achieve it.

The Dependency Chain Problem

Here's where it gets real: Your availability is constrained by your dependencies.

If your application depends on AWS (99.9% SLA), a database service (99.95% SLA), and a payment gateway (99.9% SLA), your theoretical maximum availability is the product of these dependencies:

Your Max Availability = 0.999 × 0.9995 × 0.999 = 0.9975 = 99.75%

You cannot promise 99.99% uptime if your dependencies only give you 99.75%.

And this assumes perfect implementation on your part: No bugs, no deployment issues, no configuration errors. In reality, you need to account for your own operational reliability as well.

The Real-World Example

Let's build a realistic scenario for a SaaS application:

Your Dependencies:

AWS EC2/EKS (data plane): 99.99% SLA
AWS RDS Multi-AZ: 99.95% SLA
AWS S3: 99.99% SLA (but only for durability, not availability)
Your CDN (Cloudflare): 100% SLA (with credits, not guaranteed uptime)
Auth0 for authentication: 99.99% SLA
Stripe for payments: 99.99% SLA
SendGrid for emails: 99.95% SLA

Theoretical maximum availability:

0.9999 × 0.9995 × 0.9999 × 1.0 × 0.9999 × 0.9999 × 0.9995 = 0.9987 = 99.87%

But wait: There is more! We also need to account for:

Your application code and bugs (let's say 99.9% operational reliability)
Your deployment processes (let's say 99.95% - one bad deploy per year)
Your monitoring and response time (99.9%)

0.9987 × 0.999 × 0.9995 × 0.999 = 0.9962 = 99.62%

If you promised customers 99.9% SLA, you're cutting it very close. If you promised 99.99%, you're already in breach before you even start.

The RTO/RPO Cascade Effect

RTO and RPO have similar cascade problems that most organizations ignore.

RTO Cascade Example

Your RTO is not just "how long until we restart the server." It's the sum of:

Detection Time: How long until you know something is wrong?
- Monitoring delay: 1-5 minutes
- Alert processing: 1-2 minutes
- Initial investigation: 5-15 minutes
Diagnosis Time: Understanding what failed and why
- Simple issues: 5-10 minutes
- Complex issues: 30-120 minutes
- Third-party dependency issues: Unknown (you're waiting on them)
Decision Time: Deciding on the recovery approach
- Clear runbook: 2-5 minutes
- Novel scenario: 10-30 minutes
- Need management approval: Add 15-60 minutes
Execution Time: Actually performing the recovery
- Restart service: 2-5 minutes
- Failover to backup region: 10-30 minutes
- Restore from backup: 30 minutes to hours
- Rebuild infrastructure: Hours to days
Validation Time: Confirming the system is actually healthy
- Automated health checks: 2-5 minutes
- Manual verification: 5-15 minutes
- Customer validation: Ongoing

Real RTO for a "simple" database failover:

Detection (5 min) + Diagnosis (10 min) + Decision (5 min) + 
Execution (15 min) + Validation (5 min) = 40 minutes minimum

And this is for a well-practiced scenario with good runbooks. Add third-party dependencies, and your RTO balloons.

If AWS RDS takes 30 minutes to complete a failover (their documented range), and you need 15 minutes to validate and restore service, your minimum RTO is 45 minutes (regardless of how fast your team is)

RPO Cascade Example

RPO calculations are often even worse. Consider this common architecture:

Application writes to primary database (replicated to standby with ~1 second lag)
Database backed up to S3 every hour
S3 replicated to another region (asynchronous, ~15 minute lag)

Your RPO in different failure scenarios:

Primary DB failure, standby healthy: ~1 second (replication lag)
Primary DB failure, standby corrupted: Up to 1 hour (last backup)
Regional failure: Up to 1 hour 15 minutes (backup + cross-region lag)
S3 regional failure during restore: Potentially hours (need to restore from alternate region)

You cannot promise "near-zero RPO" if your backup strategy involves hourly snapshots.

The Supplier SLA Fine Print

Here's what makes this even more complex: Not all SLA breaches are created equal.

AWS's Actual Commitments

Looking at the October 2024 AWS outage, many services were down for 14+ hours. Let's look at what AWS actually promises:

AWS EC2 (Instance-level):

SLA: 99.99% monthly uptime percentage for each individual instance
Credit: 10% if below 99.99%, 30% if below 99.0%
What they don't cover: Control plane unavailability (can't launch new instances)

AWS RDS:

Multi-AZ: 99.95% monthly uptime
Single-AZ: No SLA
What they don't cover: Performance degradation, replication lag

AWS DynamoDB:

Global Tables: 99.999% monthly uptime
Standard Tables: 99.99% monthly uptime
What this covers: Data plane only (read/write requests)
What they don't cover: DNS resolution failures (as we saw in the outage)

The Critical Detail: During the October outage, DynamoDB's data plane technically met its SLA because the DNS issue prevented requests from reaching it. No requests = no failed requests = SLA maintained. This is technically correct but useless to customers who couldn't connect.

Understanding SLA Credits

Even when suppliers breach SLAs, the remediation is limited:

AWS SLA Credits:

10% credit for 99.0-99.99% availability (depending on service)
30% credit for 95.0-99.0% availability
Maximum credit: 100% of monthly service charges

What this means in practice:

If you spend $10,000/month on AWS and they have a catastrophic failure, your maximum compensation is $10,000
If that outage cost your business $500,000 in lost revenue, you're not covered
Credits are NOT automatic: You must claim them within 30 days

This is why your SLA to customers cannot simply pass through supplier SLAs.

How to Actually Calculate Your SLAs

Step 1: Map Your Critical Path

Identify every component required for your core service to function:

Customer Request
    ↓
CDN/Load Balancer (Cloudflare: 100%*)
    ↓
Application Server (AWS EKS: 99.99%)
    ↓
├─ Authentication (Auth0: 99.99%)
├─ Database (AWS RDS: 99.95%)
├─ Cache (AWS ElastiCache: 99.99%)
├─ Object Storage (AWS S3: 99.99%)
└─ Payment Processing (Stripe: 99.99%)

Step 2: Calculate Theoretical Maximum

Multiply all dependencies in the critical path:

Base = 1.0 × 0.9999 × 0.9999 × 0.9995 × 0.9999 × 0.9999 × 0.9999
Base = 99.88%

Step 3: Apply Operational Reality Multiplier

Account for your own operations:

Application reliability: 99.9% (assumes mature application with good testing)
Deployment safety: 99.95% (assumes good CI/CD practices)
Configuration management: 99.9% (assumes IaC and proper change management)
Human error factor: 99.95% (assumes good runbooks and training)

Operational = 0.999 × 0.9995 × 0.999 × 0.9995 = 99.70%

Step 4: Calculate Realistic Maximum

Realistic Maximum = Base × Operational
                  = 0.9988 × 0.9970
                  = 99.58%

Step 5: Add Safety Margin

Never promise your theoretical maximum. Add a safety margin of at least 0.5-1%:

Safe Customer SLA = 99.58% - 1.0% = 98.58%

Round down to the nearest standard tier: 98.5% or 99.0%

This is your honest, achievable SLA.

Step 6: Calculate RTO Based on Dependencies

Map out your recovery time for each component failure:

Component	Detection	Diagnosis	Decision	Execution	Validation	Total RTO
App Server	2 min	5 min	2 min	5 min	3 min	17 min
Database	2 min	10 min	5 min	30 min	5 min	52 min
Cache	2 min	5 min	2 min	10 min	3 min	22 min
Auth Provider	2 min	5 min	2 min	Supplier-dependent	5 min	Unknown
Payment Gateway	2 min	5 min	2 min	Supplier-dependent	5 min	Unknown

Your RTO must be set to the longest of these: Assume 60 minutes minimum for database failures.

For third-party services you don't control, you need to add their published RTOs (if they have them) or make conservative estimates based on historical performance.

Step 7: Calculate RPO Based on Backup Strategy

Identify your data protection mechanisms:

Database replication: 1-5 second lag (real-time-ish)
Database automated backups: Every 6 hours
Transaction log shipping: Every 15 minutes
Cross-region replication: 30 minute lag

Worst-case RPO by scenario:

Local failure with healthy replica: ~5 seconds
Local failure requiring backup restore: Up to 15 minutes (last transaction log)
Regional failure: Up to 30 minutes (cross-region replication lag)

Your RPO promise should be: 30 minutes (worst realistic case)

If you need better RPO, you need to invest in more frequent backups or real-time replication.

The Management Conversation

Here's how to present this to leadership:

The Bad News

There is no way to drop a bomb and it be soft, so rip the bandaid fully: "Our current SLA promise of 99.99% uptime (52 minutes downtime/year) is mathematically impossible given our dependencies. Our realistic maximum is 99.6% (2.9 hours downtime/year)"

The Options

Option 1: Reduce SLA to realistic levels

Pros: Honest, achievable, reduces legal liability
Cons: May impact sales, customer confidence
Cost: Minimal
Recommendation: Most honest approach

Option 2: Over-engineer for the promised SLA

Pros: Maintain customer promise
Cons: Significant cost increase
Cost: Estimate 3-5x infrastructure costs for true 99.99%
Requirements:
- Multi-region active-active (you can afford it? if didn't read it yet, go have a look on my previous article about it, here)
- Eliminate single points of failure
- Automated failover for all components
- 24/7 on-call engineering team
- Extensive monitoring and alerting

Option 3: Implement tiered SLAs

Pros: Different customer needs met at different price points
Cons: Increased complexity
Example:
- Basic: 99.0% ($X/month)
- Standard: 99.5% ($2X/month)
- Premium: 99.9% ($5X/month)
- Enterprise: 99.95% ($10X/month with dedicated support)

The Questions to Ask

Before committing to any SLA, leadership must answer:

What is the actual business cost per hour of downtime?
- Lost revenue
- Customer churn
- Regulatory penalties
- Reputation damage
What is the cost to achieve each availability tier?
- Infrastructure costs
- Engineering effort
- Operational complexity
- Monitoring and tooling
What are our legal obligations if we breach SLA?
- Service credits
- Contract penalties
- Regulatory fines
What is our competitor landscape?
- What SLAs do competitors offer?
- What can we realistically achieve that's better?
What is our customer expectation vs. requirement?
- Customers may ask for 99.99% but function fine with 99.5%
- Usage patterns matter (B2B vs. B2C, business hours vs. 24/7)

Practical Implementation Strategy

Phase 1: Audit (Month 1)

Document all dependencies:
- List every third-party service
- Record their published SLAs
- Identify services without SLAs (risk!)
Measure actual performance:
- Review 12 months of uptime data
- Calculate real availability
- Identify patterns (time of day, day of week)
Map failure scenarios:
- What happens when each component fails?
- What's the recovery process?
- Test your assumptions (actually run drills)

Phase 2: Calculate (Month 2)

Build dependency tree: Create visual map of all dependencies
Calculate theoretical maximum: Use the formulas above
Apply operational reality: Be honest about your capabilities
Determine safe promise: Add appropriate safety margin
Calculate financial impact: Model costs of breaches vs. engineering improvements

Phase 3: Communicate (Month 3)

Internal stakeholders:
- Present findings to leadership
- Show options and trade-offs
- Get buy-in on approach
Customer communication (if changing SLAs):
- Plan communication strategy
- Offer transitional options
- Consider grandfathering existing customers
Sales enablement:
- Train sales on realistic promises
- Create competitive positioning
- Develop value-based messaging

Phase 4: Implement (Months 4-6)

Update contracts: Legal review of new SLA terms
Implement monitoring: Track against new SLAs
Create runbooks: Document recovery procedures for each scenario
Train teams: Ensure everyone understands the commitments
Establish SLA reporting: Regular reporting on performance vs. commitment

Phase 5: Iterate (Ongoing)

Monthly SLA review: Track performance
Quarterly dependency audit: Has anything changed?
Annual SLA assessment: Should commitments be adjusted?
Continuous improvement: Address recurring failure modes

The Hard Truth About High Availability

Achieving true high availability is expensive. Here's what it actually takes:

For 99.9% (Three Nines)

Multi-AZ deployment within one region
Automated failover
Good monitoring
On-call rotation
Rough cost multiplier: 1.5-2x vs. single-AZ

For 99.99% (Four Nines)

Multi-region active-active or hot standby
Zero single points of failure
Automated everything
24/7 operations team
Regular disaster recovery drills
Advanced monitoring and observability
Rough cost multiplier: 3-5x vs. single-region

For 99.999% (Five Nines)

Multi-region active-active mandatory
Chaos engineering as standard practice
Dedicated SRE team
Custom infrastructure tooling
Extensive automation
Global follow-the-sun support
Rough cost multiplier: 5-10x vs. single-region
Reality check: Few companies actually need this

Red Flags in SLA Discussions

Watch out for these warning signs:

🚩 "We'll promise 99.99% and just work really hard"

Hope is not a strategy
Availability requires engineering, not effort

🚩 "Our competitors promise 99.99%, so we have to match"

Your competitors might be lying too
Or they're losing money on it

🚩 "We've never had downtime, so 99.99% is safe"

Past performance doesn't guarantee future results
You need data, not anecdotes

🚩 "The SLA is just for sales, it doesn't really matter"

Legal liability says otherwise
Trust once broken is hard to rebuild

🚩 "We'll figure out the technical details after the contract is signed"

You're writing checks your infrastructure can't cash
This is how companies go out of business

Conclusion: Be Honest or Be Prepared

You have two choices:

Promise what you can actually deliver (and have data to prove it)
Engineer and pay for what you promise (and have budget to support it)

There is no third option. The math doesn't care about your sales targets.

The companies that succeed long-term are those that:

Understand their true dependencies
Calculate realistic SLAs based on supplier commitments
Invest appropriately in availability architecture
Communicate honestly with customers
Track and improve continuously

The AWS outage taught us that even giant cloud providers with unlimited resources operate at 99.9% for many services. If AWS can't promise 99.99% for everything, why do you think you can?

Start with honesty. Build from there.

Appendix: SLA Calculation Worksheet

Use this worksheet to calculate your realistic SLA:

Step 1: List Critical Dependencies
1. _________________ (____% SLA)
2. _________________ (____% SLA)
3. _________________ (____% SLA)
4. _________________ (____% SLA)
5. _________________ (____% SLA)

Step 2: Calculate Dependency Chain
Base Availability = (Dep1) × (Dep2) × (Dep3) × (Dep4) × (Dep5)
                  = _______% 

Step 3: Apply Operational Factors
Application Reliability: _____%
Deployment Safety: _____%
Configuration Management: _____%
Human Error Factor: _____%

Operational Multiplier = _____%

Step 4: Calculate Realistic Maximum
Realistic Maximum = Base × Operational = _____%

Step 5: Add Safety Margin
Safety Margin: ____% (recommend 0.5-1%)
Safe SLA = Realistic Maximum - Safety Margin = _____%

Step 6: Map to Standard Tier
Your achievable SLA tier: _____%

Step 7: Calculate Allowed Downtime
Monthly downtime allowed: ______ minutes
Annual downtime allowed: ______ hours

Step 8: Determine RTO
Longest component recovery time: ______ minutes
Safety buffer: ______ minutes
Committed RTO: ______ minutes

Step 9: Determine RPO
Worst-case data loss: ______ minutes
Safety buffer: ______ minutes
Committed RPO: ______ minutes

Step 10: Financial Impact
Monthly revenue at risk per hour of downtime: $_______
Annual cost of SLA breaches at this tier: $_______
Cost to improve by one SLA tier: $_______
ROI calculation: _______

Is your company making impossible promises to customers? Have you run the math on your SLAs? Share your experiences in the comments.

Tags: #sla #rto #rpo #sre #devops #availability #cloudcomputing #systemsdesign #infrastructure #businesscontinuity

DEV Community