velprove

Posted on May 8 • Originally published at velprove.com

SLA vs SLO vs SLI: Which One Pays Out When You're Down?

#monitoring #webdev #devops #uptime

** Quick answer: Only the SLA actually entitles you to money when your vendor goes down. The SLO is the vendor's internal target, almost always tighter than the SLA they sold you, and the gap between the two is the vendor's safety margin and your blind spot. Atlassian's April 2022 outage took 13 days to fully restore, and the contract caps service credits at 100% of one month's fees. If you want to know whether the vendor hit the number, you need an independent SLI source. We will not redo the credit-mechanics walkthrough here. The vendor SLA receipts and the file-a-claim deep dive live in our hosting SLA verification post . **

The three letters in 60 seconds

The three letters get treated as interchangeable in vendor marketing. They are not. From a customer's point of view, only one of them comes with a credit you can actually claim, and the other two exist mostly to set expectations the vendor never formally agreed to. The customer-side definitions are short.

SLI: what the vendor measures

A Service Level Indicator is the metric the vendor measures their own service against. The Google SRE Book's SLO chapter is the canonical reference ( sre.google/sre-book/service-level-objectives ). It frames the SLI as a quantitative measure of some aspect of the service: request latency, error rate, throughput, or availability. For a customer-facing SLA the SLI is almost always some flavor of monthly uptime percentage. AWS EC2's Region-Level SLA, for example, defines an availability incident as all your running instances across two or more Availability Zones in the same region losing external connectivity, with monthly uptime calculated as the percentage of minutes in the month during which that condition held. The SLI is just the measurement. It does not commit the vendor to anything by itself.

SLO: what the vendor commits internally

A Service Level Objective is the internal target the vendor's engineers are paged against. The Google SRE Book SLO chapter recommends keeping a safety margin by setting an internal SLO tighter than the SLO advertised to users, so engineering teams have room to recover before the contractual line is crossed. A vendor might run their internal Confluence Cloud team to a 99.99% SLO while publishing 99.9% in the SLA. Missing the SLO triggers an internal incident review. It does not entitle the customer to anything. You never see the SLO number, and the SLO has no contractual force outside the vendor's engineering org.

SLA: what the vendor commits to YOU contractually

A Service Level Agreement is the only one of the three the vendor signed in writing with you. Microsoft's own Azure reliability documentation is explicit on the contractual binding ( learn.microsoft.com/en-us/azure/reliability/concept-service-level-agreements ): the SLA is the formal financial commitment, defining the uptime target, the credit schedule, and the claim process. If the vendor misses the SLO but hits the SLA, you get nothing. If they miss the SLA, you get a service credit, almost never a cash refund. The SLA is the only one of the three letters that pays out, and only if you file the claim with evidence inside the window.

Why the gap between SLO and SLA matters

Most articles about SLA vs SLO vs SLI are written for vendor SREs. The structural insight that matters from a customer angle is simpler. The SLO is tighter than the SLA. The space between them is where most outages live, and you have no contractual claim on that space at all.

The vendor's SLO is tighter than the SLA they sold you

The Google SRE Book SLO chapter ( sre.google/sre-book/service-level-objectives ) recommends keeping a safety margin by setting an internal SLO tighter than the SLO advertised to users. The chapter's primary distinction between SLO and SLA is consequences: if there is no explicit consequence for missing the target, you are looking at an SLO, not an SLA. The chapter frames the internal SLO as the operational alarm, and the SLA as the legal floor. The vendor knows this. Their SREs are paged on the SLO. Their lawyers wrote the SLA. The customer mostly sees marketing copy that quotes the looser of the two and presents it as a guarantee.

The gap is the vendor's safety margin

A concrete example. Suppose a vendor publishes a 99.9% monthly SLA and runs internally to a 99.99% SLO. 99.9% allows roughly 43 minutes and 50 seconds of downtime per 30-day month. 99.99% allows about 4 minutes 22 seconds. The 0.09% gap is the vendor's monthly safety margin: roughly 39 minutes that the vendor can burn through quietly without owing anyone a credit. Most months the vendor consumes some of that buffer. Their SREs handle the noisy incidents internally. You see an ordinary green status page and no credit ever fires. The system works as designed. The design is just not optimized for the customer's benefit.

The gap is your blind spot

You care about the SLA, but the vendor's status page and marketing pages talk in SLO-shaped language. "We aim for 99.99%" is an SLO sentence. "Our service is guaranteed at 99.9% per the SLA" is an SLA sentence. The two look similar in marketing copy and lead you to believe you have a tighter contractual claim than you actually have. Reading the actual SLA legal page is the only way to know what was signed. The number you saw in marketing is almost never the number in the legal document, and the legal document is the only one a court or a credit-claim adjudicator cares about.

When 13 days of downtime gets you a $0 credit

The cleanest case study of why the SLA gap matters is Atlassian's April 2022 Cloud outage. On April 5, 2022, a faulty maintenance script ran inside Atlassian Cloud and permanently deleted the active customer data of 883 customer sites belonging to 775 customers, instead of the legacy data it was supposed to target. Atlassian published a detailed post-incident review on their engineering blog ( atlassian.com/blog/atlassian-engineering/post-incident-review-april-2022-outage ) describing the cascade of small mistakes that produced the incident. Pragmatic Engineer's reporting at the time ( newsletter.pragmaticengineer.com/p/scoop-atlassian ) added the customer-side timeline.

The recovery was slow. The first restored customer sites came back on April 8, three days after the incident began. Full restoration of all affected sites was completed on April 18, 13 days in. For a Confluence or Jira tenant, those 13 days were not a degraded experience. They were a complete outage of business-critical tooling for product teams, support teams, and engineering teams who lived inside those products every day.

Now read the SLA. Atlassian publishes their service level agreement at atlassian.com/legal/sla . At time of writing, Atlassian's SLA covers Premium (99.9%) and Enterprise (99.95%) tiers only, caps service credits at 100% of the affected Cloud Product's monthly invoice, and requires customers to file a credit claim within 15 days of the end of the calendar month in which the failure occurred. The cap is the ceiling on the financial obligation. It does not scale with the duration of the outage past that ceiling.

Run the math on a typical affected tenant. A Confluence Standard tenant at $10 per user per month with 50 seats bills $500 a month. Even a 100% credit on one month's fees caps the recovery at $500. 13 days of business-critical tooling unavailable cost most affected teams orders of magnitude more than $500: blocked product launches, missed support SLAs of their own, customer churn, contractor hours rebuilding wikis from cached pages. The contract Atlassian signed was honored. The contract was not designed to compensate you for the downstream impact of a 13-day outage.

The punchline is not that Atlassian acted in bad faith. They published a long, candid post-incident review and they paid credits per the contract they signed. The punchline is that reading the SLA before signing would have flagged the credit cap and the claim window, and an affected team would at least have known what their contractual exposure looked like in the worst case. The lost productivity was uninsurable under the contract the vendor signed, and that is the structural truth across most enterprise SaaS SLAs, not just Atlassian's.

We are not redoing the vendor SLA receipts table here. For a verbatim seven-vendor table with credit mechanics and claim deadlines drawn directly from each vendor's legal page, see the seven-vendor SLA receipts table .

How to read your vendor's SLA in 5 minutes

Most SLA legal pages run two to four pages of dense formatting. You do not need to read every clause. There are five things that actually matter, and they are usually in the same order across vendors. Skim the headline percentage, then go to the parts that determine what the percentage covers and how you collect.

Find the carve-out section first

Every SLA is mostly about what is excluded, not what is covered. Microsoft's Azure SLA reading guide ( learn.microsoft.com/en-us/azure/reliability/concept-service-level-agreements ) walks through the standard exclusion structure: scheduled maintenance, force majeure, customer-caused issues, third-party network failures, and beta features. Atlassian's legal page uses the same structure. AWS does too. The carve-outs are not bad faith. They are standard. The point is that a 99.99% SLA with broad carve-outs (every weekly maintenance window, every partner outage, every beta-tagged feature) is materially weaker than a 99.9% SLA with narrow carve-outs.

Find the credit cap

Almost every SLA caps the credit at 100% of one month's fees on the affected service, and most credit schedules max out well before that. On a Hostinger shared plan at $4 a month, the 5% credit for a missed-SLA month is 20 cents. On a Kinsta plan at $40 a month, a one-hour outage credit is $4. The dollar amounts are small at the bottom of the market and not much larger at the enterprise end relative to the cost of an extended outage. The credit is rarely the leverage. The documented record that you are owed it is.

Find the claim deadline

Claim windows vary widely. Cloudflare Business is 5 business days. OVHcloud allows 60 calendar days. 30 days is the most common window across SaaS and hosting. You are required to file inside the window, with evidence, or the credit is forfeited regardless of how clear the outage was. If you do not have monitoring running before the outage, by the time you notice the analytics decline the next week, the Cloudflare window has already closed.

Find the SLI definition

The SLA points back to a specific SLI, and the SLI definition determines what counts as "up." AWS's EC2 SLA ( aws.amazon.com/compute/sla ) defines the SLI at the region level, not the instance level for the headline number. Vercel's SLA ( vercel.com/legal/sla ) excludes Hobby and Pro entirely from any uptime SLI. If your single EC2 instance was down but the rest of your multi-AZ deployment was still externally reachable, the AWS Region-Level SLA was not breached and you have no claim against the 99.99% commitment. The SLI definition is where the percentage actually gets calculated, and where most surprises live.

Hand-off to the deep dive

Once you know the carve-outs, the cap, the deadline, and the SLI definition, the remaining work is filing the claim with evidence. For credit-claim mechanics across seven major hosts, including the verbatim evidence the SLAs demand and the per-host claim deadlines, see the credit-claim mechanics walkthrough .

How to track the SLI yourself

The SLA is enforceable, but only with evidence. The vendor calculates the SLI from their own infrastructure, and they get to decide what the headline number was for the month. To file a credible claim, or to know whether the vendor is quietly consuming the SLO/SLA gap month after month, you need an independent SLI source measured from outside the vendor's network. This is also why standard monitoring misses real outages when it is configured wrong: low-frequency probes, single region, no auth-protected pages.

Why vendor self-reporting can't be the only source

AWS's own post-mortem on the December 7, 2021 us-east-1 outage ( aws.amazon.com/message/12721 ) admits a 52-minute gap between when the outage began and when their Service Health Dashboard reflected it. AWS is the most resourced cloud vendor on Earth and their own status page lagged a major incident by nearly an hour. The structural reason is that the status page links to SLA financial exposure and the publishing decision sits with operations, not real-time automation. The full reporting and quotes from former AWS engineers on this lives in our hosting SLA verification post . For the SLA-tracking purposes here, the takeaway is enough: the vendor's self-reported SLI is not, by itself, evidence you can rely on.

What an independent monitor needs to do

An SLI tracker for a third-party vendor needs four things: probe cadence at less than half the SLA error budget (a 5-minute interval is the floor for a 99.9% monthly SLA, which leaves about 43 minutes of allowed downtime), multi-region coverage so a single-region blip does not look like a vendor outage, calendar-month rollups because every published SLA is calculated on a calendar-month basis, and incident history that retains at least the longest claim window in your stack (60 days for OVHcloud) so the data is still around when you file.

How Velprove fits

Velprove is an independent SLI source for your own SLO tracking. We are not an SLO management tool. We do not let you define formal SLOs, calculate error budgets, or alert on burn-rate windows. We probe the URLs you give us from outside the vendor's network from 5 global regions, including a browser login monitor for auth-protected pages, and store timestamped probe results rolled up to monthly uptime percentages and incident counts. The SLO definition stays with you, in whatever spreadsheet or internal tracker you already use. Velprove provides the measurement.

The free plan is sized for exactly this: 10 monitors, 5-minute intervals on HTTP, 1 browser login monitor at 15-minute intervals, monitoring from 5 global regions, 30-day incident history, email alerts, and SSL certificate monitoring. Multi-step API monitors with up to 3 steps are included. No credit card required. The Free plan is enough to track an SLI for a small set of vendors and produce calendar-month rollups you can attach to a credit claim.

When the SLA is too weak to bother tracking

Not every SLA is worth the effort. Some plans have no SLA at all, and some SLAs have carve-outs broad enough to make the headline percentage close to meaningless. In those cases the honest answer is to monitor for your own awareness, not for credit recovery.

Vercel Hobby and Pro: no SLA at all

Vercel's SLA ( vercel.com/legal/sla ) is titled "Enterprise Service Level Agreement" and applies only to Enterprise customers. Hobby and Pro have no published uptime SLA at all. If you are on Vercel Hobby or Pro, you have no contractual recourse for downtime, period. You are still welcome to monitor the site for your own benefit, but there is no credit claim to file when an outage happens.

Most consumer SaaS: 99.9% with carve-outs that gut the number

A common pattern across mid-market SaaS: published 99.9%, scheduled maintenance excluded, third-party services excluded, beta features excluded, customer-side network issues excluded. By the time the carve-outs apply, the effective SLA may cover a quite narrow slice of real failures. The 99.9% headline is accurate as a marketing number and weak as a contractual one. You can still track the SLI honestly. The credit math just is not going to be the reason you bother.

When to renegotiate or migrate

Renegotiation is rare and almost always reserved for Enterprise tiers with named account teams. For most plans the realistic options are accept the SLA as a best-effort dependency or migrate to a vendor with a tighter contract. If you are weighing monitoring tools as part of that decision, our review of how the major monitoring tools compare covers what matters when you are picking the SLI source itself.

Frequently Asked Questions

What is the difference between SLA, SLO, and SLI?

SLI is the metric the vendor measures (e.g., monthly uptime percentage); SLO is the internal target the vendor's engineers are paged against (typically tighter than the SLA); SLA is the contractual commitment to the customer that triggers a service credit if missed. Of the three, only the SLA is enforceable by the customer. The SLO is the vendor's internal goal. The SLI is the input both the SLO and the SLA are calculated from. The Google SRE Book SLO chapter (sre.google/sre-book/service-level-objectives) is the canonical reference for the distinction.

Is SLO part of SLA?

No, but they are related. The SLA contains an SLO-shaped commitment (e.g., "99.9% monthly uptime") that the vendor signs in writing. The vendor's actual internal SLO is typically tighter than the published SLA number, treated as a private operational target. The two are distinct documents. The SLA is legal. The SLO is operational. The customer only sees the SLA.

What is an example of SLI vs SLO?

SLI: AWS EC2's Region-Level SLA defines an availability incident as all your running instances across two or more Availability Zones in the same region losing external connectivity, with monthly uptime calculated as the percentage of minutes in the month during which that condition held. SLO: AWS's internal target for that SLI, set tighter than the published 99.99% SLA so engineers are paged before the SLA is breached. The SLI is the measurement. The SLO is the goal. The SLA wraps a customer-facing version of the SLO into a contract. The AWS Compute SLA (aws.amazon.com/compute/sla) is the public-facing example.

Are SLAs legally binding?

Yes, but the binding is narrower than most customers assume. The SLA commits the vendor to a specific service credit if a specific SLI is missed by a specific amount, claimed within a specific window. It does not commit the vendor to compensate the customer for downstream damages, lost revenue, or reputational harm. Most SLAs explicitly disclaim consequential damages. The credit cap is almost always a fraction of one month's fees on the affected service. Microsoft's Azure SLA reading guide (learn.microsoft.com/en-us/azure/reliability/concept-service-level-agreements) is a clean reference for the legal shape.

What happens when an SLO is missed but the SLA is met?

Nothing happens to the customer. The vendor's internal SREs are paged and the engineering team works to recover, but no service credit is owed because the SLA threshold was not crossed. This is the gap that makes the SLO/SLA distinction matter to customers. The vendor consumed their internal safety margin. The customer experienced degraded service. The contract is silent.

How do I measure my own SLI for a third-party vendor?

Run an independent monitor that probes the vendor from outside their network on a fixed schedule, captures timestamps and response codes, and stores monthly rollups. The monitor needs multi-region coverage so a regional probe failure does not look like a vendor outage, and a probe interval at less than half the SLA error budget. Velprove's free plan covers exactly this surface: 10 HTTP monitors at 5-minute intervals from 5 global regions, 1 browser login monitor at 15-minute intervals, 30-day incident history, no credit card required. Calendar-month rollups for credit claims included.

What's the most important thing to read in an SLA before signing?

The carve-out section, not the headline percentage. Every major SLA excludes scheduled maintenance, force majeure, customer-caused issues, third-party network failures, and beta features. The exclusions determine what the percentage actually covers. A 99.99% SLA with broad carve-outs is materially weaker than a 99.9% SLA with narrow carve-outs. After the carve-outs, read the credit cap (almost always capped at 100% of one month's fees) and the claim window (5 to 60 days, varies wildly).

Want to track your vendor's SLI honestly?

The SLA is the only one of the three that pays out, and only if you have the evidence. Velprove's free plan is an independent SLI source for your own SLO tracking: 10 HTTP monitors, 1 browser login monitor every 15 minutes, 5 global regions, 30-day incident history, no credit card required. We are not enterprise SLA tracking software. We are the calendar-month receipts you bring when the credit window opens. Start a free Velprove account.

DEV Community