DEV Community

Anderson Leite
Anderson Leite

Posted on

Your SLA (maybe?) is a Lie: Why most companies get RTO, RPO and Service Level Agreements wrong

This article is a follow-up to "Multi-Cloud and Return to On-Prem Aren't Your Silver Bullets"

After publishing my analysis of the recent AWS outage and the multi-cloud/on-prem debate, I received some direct messages with variations of the same question: "How calculate our SLAs properly since we depend of third-parties?"

You know how to do it? Most companies don't. They pick aspirational numbers that sound good in sales decks without understanding the mathematical reality of their dependencies. Worse, many don't realize their promised SLAs are literally impossible given their supplier SLAs.

Let me show you why your 99.99% uptime promise might be mathematically impossible, and how to fix it.
 

The Fundamental Misunderstanding

 
Let's start with definitions that actually matter in practice, first a graphical representation:

RPOxRTOxSLA
Note: (I've asked Google nano banana to generate a image, but can't make it fix the closing parenthesis after "service level agreement" and "recovery time objective" so bear with me)
 
Recovery Time Objective (RTO): The maximum acceptable time your system can be down before you've failed your business requirements. This is measured in actual time units (minutes, hours) and represents the point where business damage becomes unacceptable.

Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. An RPO of 1 hour means you can tolerate losing up to 1 hour of data. RPO of zero means no data loss is acceptable.

Service Level Agreement (SLA): The contractual commitment you make to your customers about availability, typically expressed as a percentage (99.9%, 99.99%, etc.) and measured over a defined period (usually monthly or annually).

Here's what most companies get wrong: These three metrics are interconnected, and your suppliers' numbers directly constrain yours.
 

The Math Nobody Wants to Do

 

Understanding Availability Percentages

 
Let's make this concrete. Here's what different availability percentages actually mean:

Availability Downtime/Month Downtime/Year What This Really Means
99% ("two nines") 7.2 hours 3.65 days Unacceptable for most production systems
99.5% 3.6 hours 1.83 days Still pretty rough
99.9% ("three nines") 43.2 minutes 8.76 hours AWS IAM control plane
99.95% 21.6 minutes 4.38 hours Common for business services
99.99% ("four nines") 4.32 minutes 52.56 minutes Premium tier promise
99.999% ("five nines") 26 seconds 5.26 minutes Telecom-grade (very expensive)

Most companies promise 99.9% or 99.99% without understanding what it takes to achieve it.
 

The Dependency Chain Problem

 
Here's where it gets real: Your availability is constrained by your dependencies.

If your application depends on AWS (99.9% SLA), a database service (99.95% SLA), and a payment gateway (99.9% SLA), your theoretical maximum availability is the product of these dependencies:

Your Max Availability = 0.999 × 0.9995 × 0.999 = 0.9975 = 99.75%
Enter fullscreen mode Exit fullscreen mode

You cannot promise 99.99% uptime if your dependencies only give you 99.75%.

And this assumes perfect implementation on your part: No bugs, no deployment issues, no configuration errors. In reality, you need to account for your own operational reliability as well.
 

The Real-World Example

 
Let's build a realistic scenario for a SaaS application:

Your Dependencies:

  1. AWS EC2/EKS (data plane): 99.99% SLA
  2. AWS RDS Multi-AZ: 99.95% SLA
  3. AWS S3: 99.99% SLA (but only for durability, not availability)
  4. Your CDN (Cloudflare): 100% SLA (with credits, not guaranteed uptime)
  5. Auth0 for authentication: 99.99% SLA
  6. Stripe for payments: 99.99% SLA
  7. SendGrid for emails: 99.95% SLA

Theoretical maximum availability:

0.9999 × 0.9995 × 0.9999 × 1.0 × 0.9999 × 0.9999 × 0.9995 = 0.9987 = 99.87%
Enter fullscreen mode Exit fullscreen mode

But wait: There is more! We also need to account for: 

  • Your application code and bugs (let's say 99.9% operational reliability)
  • Your deployment processes (let's say 99.95% - one bad deploy per year)
  • Your monitoring and response time (99.9%)
0.9987 × 0.999 × 0.9995 × 0.999 = 0.9962 = 99.62%
Enter fullscreen mode Exit fullscreen mode

If you promised customers 99.9% SLA, you're cutting it very close. If you promised 99.99%, you're already in breach before you even start.
 

The RTO/RPO Cascade Effect

 
RTO and RPO have similar cascade problems that most organizations ignore.

 

RTO Cascade Example

 
Your RTO is not just "how long until we restart the server." It's the sum of:

  1. Detection Time: How long until you know something is wrong?

    • Monitoring delay: 1-5 minutes
    • Alert processing: 1-2 minutes
    • Initial investigation: 5-15 minutes
  2. Diagnosis Time: Understanding what failed and why

    • Simple issues: 5-10 minutes
    • Complex issues: 30-120 minutes
    • Third-party dependency issues: Unknown (you're waiting on them)
  3. Decision Time: Deciding on the recovery approach

    • Clear runbook: 2-5 minutes
    • Novel scenario: 10-30 minutes
    • Need management approval: Add 15-60 minutes
  4. Execution Time: Actually performing the recovery

    • Restart service: 2-5 minutes
    • Failover to backup region: 10-30 minutes
    • Restore from backup: 30 minutes to hours
    • Rebuild infrastructure: Hours to days
  5. Validation Time: Confirming the system is actually healthy

    • Automated health checks: 2-5 minutes
    • Manual verification: 5-15 minutes
    • Customer validation: Ongoing

 
Real RTO for a "simple" database failover:

Detection (5 min) + Diagnosis (10 min) + Decision (5 min) + 
Execution (15 min) + Validation (5 min) = 40 minutes minimum
Enter fullscreen mode Exit fullscreen mode

And this is for a well-practiced scenario with good runbooks. Add third-party dependencies, and your RTO balloons.

If AWS RDS takes 30 minutes to complete a failover (their documented range), and you need 15 minutes to validate and restore service, your minimum RTO is 45 minutes (regardless of how fast your team is)

 

RPO Cascade Example

 
RPO calculations are often even worse. Consider this common architecture:

  1. Application writes to primary database (replicated to standby with ~1 second lag)
  2. Database backed up to S3 every hour
  3. S3 replicated to another region (asynchronous, ~15 minute lag)

 
Your RPO in different failure scenarios:

  • Primary DB failure, standby healthy: ~1 second (replication lag)
  • Primary DB failure, standby corrupted: Up to 1 hour (last backup)
  • Regional failure: Up to 1 hour 15 minutes (backup + cross-region lag)
  • S3 regional failure during restore: Potentially hours (need to restore from alternate region)

 
You cannot promise "near-zero RPO" if your backup strategy involves hourly snapshots.

 

The Supplier SLA Fine Print

 
Here's what makes this even more complex: Not all SLA breaches are created equal.

AWS's Actual Commitments

 
Looking at the October 2024 AWS outage, many services were down for 14+ hours. Let's look at what AWS actually promises:

AWS EC2 (Instance-level):

  • SLA: 99.99% monthly uptime percentage for each individual instance
  • Credit: 10% if below 99.99%, 30% if below 99.0%
  • What they don't cover: Control plane unavailability (can't launch new instances)

AWS RDS:

  • Multi-AZ: 99.95% monthly uptime
  • Single-AZ: No SLA
  • What they don't cover: Performance degradation, replication lag

AWS DynamoDB:

  • Global Tables: 99.999% monthly uptime
  • Standard Tables: 99.99% monthly uptime
  • What this covers: Data plane only (read/write requests)
  • What they don't cover: DNS resolution failures (as we saw in the outage)

The Critical Detail: During the October outage, DynamoDB's data plane technically met its SLA because the DNS issue prevented requests from reaching it. No requests = no failed requests = SLA maintained. This is technically correct but useless to customers who couldn't connect.

 

Understanding SLA Credits

 
Even when suppliers breach SLAs, the remediation is limited:

AWS SLA Credits:

  • 10% credit for 99.0-99.99% availability (depending on service)
  • 30% credit for 95.0-99.0% availability
  • Maximum credit: 100% of monthly service charges

What this means in practice:

  • If you spend $10,000/month on AWS and they have a catastrophic failure, your maximum compensation is $10,000
  • If that outage cost your business $500,000 in lost revenue, you're not covered
  • Credits are NOT automatic: You must claim them within 30 days

This is why your SLA to customers cannot simply pass through supplier SLAs.

 

How to Actually Calculate Your SLAs

 

Step 1: Map Your Critical Path

Identify every component required for your core service to function:

Customer Request
    ↓
CDN/Load Balancer (Cloudflare: 100%*)
    ↓
Application Server (AWS EKS: 99.99%)
    ↓
├─ Authentication (Auth0: 99.99%)
├─ Database (AWS RDS: 99.95%)
├─ Cache (AWS ElastiCache: 99.99%)
├─ Object Storage (AWS S3: 99.99%)
└─ Payment Processing (Stripe: 99.99%)
Enter fullscreen mode Exit fullscreen mode

Step 2: Calculate Theoretical Maximum

Multiply all dependencies in the critical path:

Base = 1.0 × 0.9999 × 0.9999 × 0.9995 × 0.9999 × 0.9999 × 0.9999
Base = 99.88%
Enter fullscreen mode Exit fullscreen mode

Step 3: Apply Operational Reality Multiplier

Account for your own operations:

  • Application reliability: 99.9% (assumes mature application with good testing)
  • Deployment safety: 99.95% (assumes good CI/CD practices)
  • Configuration management: 99.9% (assumes IaC and proper change management)
  • Human error factor: 99.95% (assumes good runbooks and training)
Operational = 0.999 × 0.9995 × 0.999 × 0.9995 = 99.70%
Enter fullscreen mode Exit fullscreen mode

Step 4: Calculate Realistic Maximum

Realistic Maximum = Base × Operational
                  = 0.9988 × 0.9970
                  = 99.58%
Enter fullscreen mode Exit fullscreen mode

Step 5: Add Safety Margin

Never promise your theoretical maximum. Add a safety margin of at least 0.5-1%:

Safe Customer SLA = 99.58% - 1.0% = 98.58%
Enter fullscreen mode Exit fullscreen mode

Round down to the nearest standard tier: 98.5% or 99.0%

This is your honest, achievable SLA.

Step 6: Calculate RTO Based on Dependencies

Map out your recovery time for each component failure:

Component Detection Diagnosis Decision Execution Validation Total RTO
App Server 2 min 5 min 2 min 5 min 3 min 17 min
Database 2 min 10 min 5 min 30 min 5 min 52 min
Cache 2 min 5 min 2 min 10 min 3 min 22 min
Auth Provider 2 min 5 min 2 min Supplier-dependent 5 min Unknown
Payment Gateway 2 min 5 min 2 min Supplier-dependent 5 min Unknown

Your RTO must be set to the longest of these: Assume 60 minutes minimum for database failures.

For third-party services you don't control, you need to add their published RTOs (if they have them) or make conservative estimates based on historical performance.

Step 7: Calculate RPO Based on Backup Strategy

Identify your data protection mechanisms:

  1. Database replication: 1-5 second lag (real-time-ish)
  2. Database automated backups: Every 6 hours
  3. Transaction log shipping: Every 15 minutes
  4. Cross-region replication: 30 minute lag

Worst-case RPO by scenario:

  • Local failure with healthy replica: ~5 seconds
  • Local failure requiring backup restore: Up to 15 minutes (last transaction log)
  • Regional failure: Up to 30 minutes (cross-region replication lag)

Your RPO promise should be: 30 minutes (worst realistic case)

If you need better RPO, you need to invest in more frequent backups or real-time replication.

 

The Management Conversation

Here's how to present this to leadership:
 

The Bad News

 

There is no way to drop a bomb and it be soft, so rip the bandaid fully: "Our current SLA promise of 99.99% uptime (52 minutes downtime/year) is mathematically impossible given our dependencies. Our realistic maximum is 99.6% (2.9 hours downtime/year)"
 

The Options

 
Option 1: Reduce SLA to realistic levels

  • Pros: Honest, achievable, reduces legal liability
  • Cons: May impact sales, customer confidence
  • Cost: Minimal
  • Recommendation: Most honest approach

Option 2: Over-engineer for the promised SLA

  • Pros: Maintain customer promise
  • Cons: Significant cost increase
  • Cost: Estimate 3-5x infrastructure costs for true 99.99%
  • Requirements:
    • Multi-region active-active (you can afford it? if didn't read it yet, go have a look on my previous article about it, here)
    • Eliminate single points of failure
    • Automated failover for all components
    • 24/7 on-call engineering team
    • Extensive monitoring and alerting

Option 3: Implement tiered SLAs

  • Pros: Different customer needs met at different price points
  • Cons: Increased complexity
  • Example:
    • Basic: 99.0% ($X/month)
    • Standard: 99.5% ($2X/month)
    • Premium: 99.9% ($5X/month)
    • Enterprise: 99.95% ($10X/month with dedicated support)

 

The Questions to Ask

 
Before committing to any SLA, leadership must answer:

  1. What is the actual business cost per hour of downtime?

    • Lost revenue
    • Customer churn
    • Regulatory penalties
    • Reputation damage
  2. What is the cost to achieve each availability tier?

    • Infrastructure costs
    • Engineering effort
    • Operational complexity
    • Monitoring and tooling
  3. What are our legal obligations if we breach SLA?

    • Service credits
    • Contract penalties
    • Regulatory fines
  4. What is our competitor landscape?

    • What SLAs do competitors offer?
    • What can we realistically achieve that's better?
  5. What is our customer expectation vs. requirement?

    • Customers may ask for 99.99% but function fine with 99.5%
    • Usage patterns matter (B2B vs. B2C, business hours vs. 24/7)

 

Practical Implementation Strategy

 

Phase 1: Audit (Month 1)

  1. Document all dependencies:

    • List every third-party service
    • Record their published SLAs
    • Identify services without SLAs (risk!)
  2. Measure actual performance:

    • Review 12 months of uptime data
    • Calculate real availability
    • Identify patterns (time of day, day of week)
  3. Map failure scenarios:

    • What happens when each component fails?
    • What's the recovery process?
    • Test your assumptions (actually run drills)

 

Phase 2: Calculate (Month 2)

  1. Build dependency tree: Create visual map of all dependencies
  2. Calculate theoretical maximum: Use the formulas above
  3. Apply operational reality: Be honest about your capabilities
  4. Determine safe promise: Add appropriate safety margin
  5. Calculate financial impact: Model costs of breaches vs. engineering improvements

 

Phase 3: Communicate (Month 3)

  1. Internal stakeholders:

    • Present findings to leadership
    • Show options and trade-offs
    • Get buy-in on approach
  2. Customer communication (if changing SLAs):

    • Plan communication strategy
    • Offer transitional options
    • Consider grandfathering existing customers
  3. Sales enablement:

    • Train sales on realistic promises
    • Create competitive positioning
    • Develop value-based messaging

 

Phase 4: Implement (Months 4-6)

  1. Update contracts: Legal review of new SLA terms
  2. Implement monitoring: Track against new SLAs
  3. Create runbooks: Document recovery procedures for each scenario
  4. Train teams: Ensure everyone understands the commitments
  5. Establish SLA reporting: Regular reporting on performance vs. commitment

 

Phase 5: Iterate (Ongoing)

  1. Monthly SLA review: Track performance
  2. Quarterly dependency audit: Has anything changed?
  3. Annual SLA assessment: Should commitments be adjusted?
  4. Continuous improvement: Address recurring failure modes

 

The Hard Truth About High Availability

 

Achieving true high availability is expensive. Here's what it actually takes:

 

For 99.9% (Three Nines)

  • Multi-AZ deployment within one region
  • Automated failover
  • Good monitoring
  • On-call rotation
  • Rough cost multiplier: 1.5-2x vs. single-AZ

For 99.99% (Four Nines)

  • Multi-region active-active or hot standby
  • Zero single points of failure
  • Automated everything
  • 24/7 operations team
  • Regular disaster recovery drills
  • Advanced monitoring and observability
  • Rough cost multiplier: 3-5x vs. single-region

For 99.999% (Five Nines)

  • Multi-region active-active mandatory
  • Chaos engineering as standard practice
  • Dedicated SRE team
  • Custom infrastructure tooling
  • Extensive automation
  • Global follow-the-sun support
  • Rough cost multiplier: 5-10x vs. single-region
  • Reality check: Few companies actually need this

 

Red Flags in SLA Discussions

 

Watch out for these warning signs:
 

🚩 "We'll promise 99.99% and just work really hard"

  • Hope is not a strategy
  • Availability requires engineering, not effort

🚩 "Our competitors promise 99.99%, so we have to match"

  • Your competitors might be lying too
  • Or they're losing money on it

🚩 "We've never had downtime, so 99.99% is safe"

  • Past performance doesn't guarantee future results
  • You need data, not anecdotes

🚩 "The SLA is just for sales, it doesn't really matter"

  • Legal liability says otherwise
  • Trust once broken is hard to rebuild

🚩 "We'll figure out the technical details after the contract is signed"

  • You're writing checks your infrastructure can't cash
  • This is how companies go out of business

 

Conclusion: Be Honest or Be Prepared

 

You have two choices:
 

  1. Promise what you can actually deliver (and have data to prove it)
  2. Engineer and pay for what you promise (and have budget to support it)

There is no third option. The math doesn't care about your sales targets.
 

The companies that succeed long-term are those that:

  • Understand their true dependencies
  • Calculate realistic SLAs based on supplier commitments
  • Invest appropriately in availability architecture
  • Communicate honestly with customers
  • Track and improve continuously  

The AWS outage taught us that even giant cloud providers with unlimited resources operate at 99.9% for many services. If AWS can't promise 99.99% for everything, why do you think you can?

 
Start with honesty. Build from there.


 

Appendix: SLA Calculation Worksheet

Use this worksheet to calculate your realistic SLA:

Step 1: List Critical Dependencies
1. _________________ (____% SLA)
2. _________________ (____% SLA)
3. _________________ (____% SLA)
4. _________________ (____% SLA)
5. _________________ (____% SLA)

Step 2: Calculate Dependency Chain
Base Availability = (Dep1) × (Dep2) × (Dep3) × (Dep4) × (Dep5)
                  = _______% 

Step 3: Apply Operational Factors
Application Reliability: _____%
Deployment Safety: _____%
Configuration Management: _____%
Human Error Factor: _____%

Operational Multiplier = _____%

Step 4: Calculate Realistic Maximum
Realistic Maximum = Base × Operational = _____%

Step 5: Add Safety Margin
Safety Margin: ____% (recommend 0.5-1%)
Safe SLA = Realistic Maximum - Safety Margin = _____%

Step 6: Map to Standard Tier
Your achievable SLA tier: _____%

Step 7: Calculate Allowed Downtime
Monthly downtime allowed: ______ minutes
Annual downtime allowed: ______ hours

Step 8: Determine RTO
Longest component recovery time: ______ minutes
Safety buffer: ______ minutes
Committed RTO: ______ minutes

Step 9: Determine RPO
Worst-case data loss: ______ minutes
Safety buffer: ______ minutes
Committed RPO: ______ minutes

Step 10: Financial Impact
Monthly revenue at risk per hour of downtime: $_______
Annual cost of SLA breaches at this tier: $_______
Cost to improve by one SLA tier: $_______
ROI calculation: _______
Enter fullscreen mode Exit fullscreen mode

Is your company making impossible promises to customers? Have you run the math on your SLAs? Share your experiences in the comments.

Tags: #sla #rto #rpo #sre #devops #availability #cloudcomputing #systemsdesign #infrastructure #businesscontinuity

Top comments (0)