DEV Community: AlertSleep

How We Handle SSL Certificate Expiration Alerts at Scale

AlertSleep — Fri, 17 Apr 2026 07:10:00 +0000

It was a Tuesday morning in June 2021. LinkedIn — a platform used daily by hundreds of millions of professionals — went partially down. Not because of a DDoS attack, a bad deploy, or a database failure. Their SSL certificate had expired.

The issue was resolved within hours, but the damage was done: broken links, frustrated users, and a very public reminder that one of the most preventable failures in infrastructure still catches well-resourced engineering teams off guard. LinkedIn was not alone. Microsoft Teams suffered a similar SSL expiry incident in 2020. Spotify has had certificate-related hiccups. Even government sites regularly show up in breach reports because of expired certs.

If it can happen to them, it can happen to you.

What SSL Certificates Actually Are (and Why They Expire)

An SSL/TLS certificate is a cryptographically signed document that proves your server is who it says it is. It binds your domain name to a public key, and a trusted Certificate Authority (CA) vouches for that binding.

There are three main validation levels:

DV (Domain Validation) — Cheapest and fastest. CA only verifies you control the domain. Used by most personal sites and small services. 90-day Let's Encrypt certs fall here.
OV (Organization Validation) — CA verifies the organization's legal existence. Common for company sites.
EV (Extended Validation) — Strictest vetting. Used by banks and payment platforms.

Historically, SSL certificates were issued for 1–2 year terms. In 2020, Apple, Google, and Mozilla enforced a hard cap of 398 days for certificates trusted in their browsers. Then Let's Encrypt popularized 90-day certificates, arguing shorter lifespans reduce the damage window if a certificate is compromised.

The result: certificates expire faster than ever, and the margin for error is shrinking.

Why Manual Tracking Fails

When a team has two or three certificates, the spreadsheet approach works fine. Someone adds a row, sets a calendar reminder, done.

Then the company grows. Suddenly you have:

A wildcard cert for *.yourdomain.com
A separate cert for api.yourdomain.com managed by a different team
A staging cert someone set up and forgot about
A cert for a third-party integration endpoint you technically own
Let's Encrypt auto-renew that "should be working" but nobody has verified in six months

The spreadsheet becomes stale. Calendar reminders get snoozed. The person who set up the cert leaves the company. Auto-renewal fails silently because the DNS challenge no longer resolves correctly after a migration.

This is not a people problem. It is a systems problem. Manual tracking does not scale.

The Alert Timeline That Actually Works

After dealing with enough SSL-related incidents, the SRE community has largely converged on a tiered alerting strategy:

Days Until Expiry	Alert Type	Who Gets Notified
60 days	Awareness ping	Primary engineer / infra team
30 days	Action required	Team lead + primary engineer
14 days	Escalation	Manager + entire team
7 days	All-hands	Engineering leadership
1 day	Emergency	PagerDuty / on-call rotation

The 60-day notification is intentionally low-urgency. It gives the responsible party time to renew without pressure. By the time you hit 7 days, something has already gone wrong in your process — the earlier alerts were missed or ignored. The 1-day alert should be treated like a production incident.

The key insight: alert early enough that the first notification is never urgent. If your team is routinely panicking at 7 days or fewer, your alert window is too short.

Checking SSL Expiry: Code Examples

Using openssl CLI

# Check expiry date for a domain
echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null \
  | openssl x509 -noout -dates

# Output:
# notBefore=Jan  1 00:00:00 2025 GMT
# notAfter=Mar 31 23:59:59 2025 GMT

To get the number of days remaining:

EXPIRY=$(echo | openssl s_client -connect yourdomain.com:443 -servername yourdomain.com 2>/dev/null \
  | openssl x509 -noout -enddate | cut -d= -f2)

EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || date -jf "%b %d %T %Y %Z" "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

echo "Days until expiry: $DAYS_LEFT"

Using Node.js

const tls = require('tls');

function checkSSLExpiry(hostname, port = 443) {
  return new Promise((resolve, reject) => {
    const socket = tls.connect({ host: hostname, port, servername: hostname }, () => {
      const cert = socket.getPeerCertificate();
      socket.end();

      const expiryDate = new Date(cert.valid_to);
      const now = new Date();
      const daysRemaining = Math.floor((expiryDate - now) / (1000 * 60 * 60 * 24));

      resolve({ hostname, expiryDate, daysRemaining });
    });

    socket.on('error', reject);
  });
}

checkSSLExpiry('yourdomain.com').then(info => {
  console.log(`${info.hostname}: ${info.daysRemaining} days remaining`);

  if (info.daysRemaining <= 7) {
    console.error('CRITICAL: Certificate expires in less than 7 days!');
  } else if (info.daysRemaining <= 30) {
    console.warn('WARNING: Certificate expires soon.');
  }
});

Using Python

import ssl
import socket
from datetime import datetime, timezone

def check_ssl_expiry(hostname: str, port: int = 443) -> dict:
    context = ssl.create_default_context()
    with socket.create_connection((hostname, port), timeout=10) as sock:
        with context.wrap_socket(sock, server_hostname=hostname) as ssock:
            cert = ssock.getpeercert()

    expiry_str = cert['notAfter']
    expiry_date = datetime.strptime(expiry_str, '%b %d %H:%M:%S %Y %Z').replace(tzinfo=timezone.utc)
    days_remaining = (expiry_date - datetime.now(tz=timezone.utc)).days

    return {'hostname': hostname, 'days_remaining': days_remaining}

result = check_ssl_expiry('yourdomain.com')
print(f"{result['hostname']}: {result['days_remaining']} days remaining")

Automating Checks with a Cron Job

A simple cron-based approach for teams managing a small number of domains:

#!/bin/bash
# /usr/local/bin/check-ssl-certs.sh

DOMAINS=("yourdomain.com" "api.yourdomain.com" "dashboard.yourdomain.com")
ALERT_EMAIL="infra-team@yourcompany.com"
WARN_DAYS=30

for DOMAIN in "${DOMAINS[@]}"; do
  EXPIRY=$(echo | openssl s_client -connect "${DOMAIN}:443" -servername "${DOMAIN}" 2>/dev/null \
    | openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)

  if [ -z "$EXPIRY" ]; then
    echo "ERROR: Could not retrieve cert for ${DOMAIN}" \
      | mail -s "SSL Check Failed: ${DOMAIN}" "$ALERT_EMAIL"
    continue
  fi

  DAYS_LEFT=$(( ($(date -d "$EXPIRY" +%s) - $(date +%s)) / 86400 ))

  if [ "$DAYS_LEFT" -le "$WARN_DAYS" ]; then
    echo "SSL cert for ${DOMAIN} expires in ${DAYS_LEFT} days (${EXPIRY})" \
      | mail -s "SSL Warning: ${DOMAIN} expires in ${DAYS_LEFT} days" "$ALERT_EMAIL"
  fi
done

Add to crontab to run daily at 8 AM:

0 8 * * * /usr/local/bin/check-ssl-certs.sh

This gets you to a functional baseline. The limitation: it only works when your cron runner is healthy, and it has no concept of alert escalation or historical tracking.

External Monitoring as a Safety Net

Self-hosted cron jobs are a good first layer. They are not sufficient on their own. The machine running your cron job could be the same machine whose cert expires. Or the job runs but silently fails because your SMTP relay is down.

External monitoring services check your SSL certificates from outside your infrastructure, on a schedule, and alert you through independent channels (email, Slack, PagerDuty, SMS). This separation is the point — if your infrastructure has a problem, you still get notified.

AlertSleep is one example: it monitors SSL certificates continuously, tracks expiry dates across all your domains, and fires alerts at configurable thresholds — without requiring you to manage any infrastructure for the monitoring itself. For teams that want visibility without operational overhead, this kind of external check is a meaningful complement to internal automation.

Managing SSL at Scale: 50+ Certificates

When you cross the threshold of managing 50 or more certificates, new problems emerge.

Build a certificate inventory. Know which cert covers which domain, when it was issued, when it expires, who owns renewal, and whether it auto-renews. A simple internal wiki page is better than nothing. A proper certificate management tool is better still.

Wildcard certificates need special attention. A *.yourdomain.com wildcard might cover dozens of subdomains. If it expires, all of them break simultaneously. The blast radius of a wildcard expiry is much larger than a single-domain cert.

Treat auto-renewal as a process, not a guarantee. Let's Encrypt auto-renewal via certbot or ACME clients is reliable under normal conditions. It fails when DNS records change, when ports 80/443 are firewalled during the renewal window, or when the renewal configuration drifts after infrastructure changes. Verify that auto-renewal is actually succeeding, not just scheduled.

Use centralized alerting. Sending expiry alerts directly to individual engineers does not work at scale. Route all SSL alerts to a shared channel (Slack #infra-alerts) and a ticketing system. Coverage should not depend on any single person being available.

Closing Thoughts

SSL certificate expiration is a solved problem. The tools exist, the alert timelines are well-established, and the failure modes are well-documented. What makes it persistent as an incident cause is the gap between knowing what to do and actually having it in place.

The LinkedIn outage in 2021 was not a failure of knowledge. It was a failure of process. Somewhere in the chain, a certificate slipped through without the right person getting the right alert at the right time.

The fix is not complicated: external monitoring as your safety net, tiered alerts with enough lead time to act calmly, and an inventory that does not live in one person's head.

The goal is to make certificate expiry the most boring part of your infrastructure. An alert fires at 60 days, someone renews, done. No incident, no postmortem, no Tuesday morning scramble.

Setting up SSL monitoring for the first time? AlertSleep's SSL monitoring handles the external check layer and alert routing out of the box — worth a look before you build your own.

Cron Expression Cheat Sheet: Every Pattern You'll Actually Use

AlertSleep — Thu, 16 Apr 2026 10:00:00 +0000

You've seen it before. You need to schedule a job to run every weekday at 7:30 AM, so you open a tab, search "cron expression weekdays", stare at five cryptic fields, and immediately second-guess yourself.

Is it 30 7 * * 1-5 or 30 7 * * MON-FRI? Does */15 mean every 15 minutes starting from zero, or every 15th minute? And why does the internet have seventeen different answers?

Cron expressions are one of those things that look like line noise the first hundred times and suddenly click on the hundred and first. This article skips the theory lecture and goes straight to the patterns you'll actually reach for — explained clearly, with copy-paste syntax ready to go.

The Five Fields (Say Them Out Loud Once)

Every standard cron expression is exactly five fields separated by spaces:

┌───────────── minute        (0–59)
│ ┌─────────── hour          (0–23)
│ │ ┌───────── day of month  (1–31)
│ │ │ ┌─────── month         (1–12)
│ │ │ │ ┌───── day of week   (0–7, where 0 and 7 are both Sunday)
│ │ │ │ │
* * * * *

Read it left to right: minute → hour → day → month → weekday.

Drill that order into your head and half the confusion disappears. The other half disappears once you know what the special characters actually do.

Special Characters

Character	Meaning	Example
`*`	Every possible value	`* * * * *` = every minute
`,`	List of values	`0,30 * * * *` = at :00 and :30
`-`	Range of values	`0 9-17 * * *` = every hour from 9 AM to 5 PM
`/`	Step (every N)	`/15 * * *` = every 15 minutes

That's it. Four characters. Everything else is just combining them.

The Cheat Sheet

Every minute

* * * * *

You probably won't run jobs this frequently in production, but great for testing that your scheduler is alive.

Every 5 minutes

*/5 * * * *

The / means "every N units." So */5 in the minute field = 0, 5, 10, 15 ... 55. Works the same way in any field.

Every 15 minutes

*/15 * * * *

Classic for polling jobs, cache warming, or health checks you want more granular than hourly.

Every 30 minutes

*/30 * * * *

Or equivalently: 0,30 * * * *. Both hit :00 and :30 of every hour.

Every hour (on the hour)

0 * * * *

Note the 0 in the minute field. * * * * * is every minute. 0 * * * * is every hour at minute zero.

Every hour at :30

30 * * * *

Useful when you want to offset from other hourly jobs to spread load.

Every 6 hours

0 */6 * * *

Runs at midnight, 6 AM, noon, 6 PM.

Daily at midnight

0 0 * * *

Most common cron pattern in existence. Daily reports, log rotation, database backups.

Daily at 2 AM

0 2 * * *

2 AM is a popular "quiet hours" slot. Low user traffic, indexes done rebuilding.

Daily at 8:30 AM

30 8 * * *

Minute field first, then hour. Easy to flip — don't.

Weekdays only (Monday–Friday)

0 9 * * 1-5

The 1-5 in the weekday field means Monday through Friday. Run a standup digest, a daily business report, anything that shouldn't fire on weekends.

Weekdays at 9 AM and 5 PM

0 9,17 * * 1-5

Combining the comma list and the weekday range. Two fires per day, only on business days.

Weekends only

0 10 * * 6,7

Saturday and Sunday at 10 AM. Useful for batch jobs you want to avoid during business hours.

Every Monday at 9 AM

0 9 * * 1

Weekly summaries, reports, cleanup jobs. Day 1 = Monday.

First day of every month at midnight

0 0 1 * *

Monthly invoicing, reports, archiving. The 1 is in the day-of-month field (third position).

First Monday of the month

0 9 1-7 * 1

The 1-7 (first 7 days) combined with 1 (Monday) fires when both conditions are met — the first Monday of the month. Verify this with your cron implementation, as behavior can vary.

Specific date once a year (e.g., January 1st)

0 0 1 1 *

Midnight on the 1st of January. Happy New Year, cron job.

Day-of-Week Reference

Value	Day
0 or 7	Sunday
1	Monday
2	Tuesday
3	Wednesday
4	Thursday
5	Friday
6	Saturday

Both 0 and 7 are Sunday — a legacy quirk. Use whichever your tool accepts.

Month Reference

Value	Month
1	January
2	February
3	March
...	...
12	December

Some implementations accept JAN, FEB, MON, TUE, etc. as aliases. Check your platform's docs.

Quick Test: Can You Read These?

Before looking at the answers, try reading each expression aloud:

*/10 * * * *
0 0 * * 0
0 12 * * 1-5
0 0 1,15 * *
30 23 * * *

Answers:

Every 10 minutes
Every Sunday at midnight
Noon on weekdays
Midnight on the 1st and 15th of every month
11:30 PM every day

If you got those, you know cron.

If You Don't Want to Memorize All This

The honest truth: nobody remembers every pattern. Even experienced engineers double-check their expressions before deploying a job that runs monthly.

If you want a visual way to build and validate cron expressions without trial and error, AlertSleep's cron expression generator lets you select human-readable options (every weekday, first of the month, etc.) and generates the correct syntax automatically. Useful for the edge cases that are easy to get wrong.

Quick-Reference Summary Table

Goal	Cron Expression
Every minute	`* * * * *`
Every 5 minutes	`/5 * * *`
Every 15 minutes	`/15 * * *`
Every 30 minutes	`/30 * * *`
Every hour	`0 * * * *`
Every hour at :30	`30 * * * *`
Every 6 hours	`0 /6 * *`
Daily at midnight	`0 0 * * *`
Daily at 2 AM	`0 2 * * *`
Daily at 8:30 AM	`30 8 * * *`
Weekdays at 9 AM	`0 9 * * 1-5`
Every Monday at 9 AM	`0 9 * * 1`
First of month at midnight	`0 0 1 * *`
Every Sunday at midnight	`0 0 * * 0`
January 1st at midnight	`0 0 1 1 *`
Weekdays at 9 AM and 5 PM	`0 9,17 * * 1-5`

One Last Thing

When you deploy a cron job, always verify the timezone your scheduler uses. A job set to 0 2 * * * on a UTC server fires at 2 AM UTC — which might be 9 PM, 6 AM, or some other local time depending on where your users are. Always check. Always document it in a comment next to your cron expression.

# Runs daily at 2 AM UTC (10 PM ET / 7 PM PT)
0 2 * * * /usr/local/bin/run-backup.sh

Future-you will thank present-you.

Found this useful? Save the summary table and never Google "cron expression weekdays" again.

Building a Status Page From Scratch vs Using a Service: A Cost Analysis

AlertSleep — Tue, 14 Apr 2026 12:34:59 +0000

Your users know your app is down before you do.

They see the spinning loader, the 502 error, the silence where data should be. And they have nowhere to go for answers. So they flood your support inbox, post on Twitter, and quietly decide to check out your competitor.

A status page changes that dynamic completely. It's not just a "we're working on it" page — it's a trust instrument. It tells users: we see what you see, we're on it, here's what we know.

But here's the question every engineering team eventually faces: do you build it, or do you buy it?

Let me break down the real costs of both.

What a Status Page Actually Needs to Do

Before comparing options, let's align on minimum viable functionality:

Show current status of each component (API, dashboard, payments, etc.)
Display active incidents with live updates
Historical uptime data (last 30-90 days)
Subscriber notifications (email, SMS) when incidents are created or resolved
Maintenance window announcements
Public URL that stays up even when your app is down

That last point is critical and often overlooked: your status page must be hosted independently from your main infrastructure. A status page that goes down with your app is worse than no status page at all.

Option A: Build It Yourself

What you're actually building

Most teams underestimate the scope. A status page isn't a static HTML file — it's a small application:

Frontend:

Component status grid with color states (operational / degraded / outage)
Incident timeline with markdown support
Uptime history graph (requires storing and querying ping data)
Subscriber signup form

Backend:

API to update component status
Incident management CRUD
Email/SMS notification system (integrate Mailgun, SendGrid, Twilio)
Webhook receiver (if you want auto-updates from your monitoring tool)

Infrastructure:

Hosted separately from your main stack (different cloud region, different provider)
Must stay online during your worst outages

Realistic time estimate

Task	Hours
Basic frontend (React/Vue)	8–16 hrs
Backend API	8–12 hrs
Email notifications	4–6 hrs
SMS notifications	3–5 hrs
Historical uptime graph	6–10 hrs
Separate hosting setup	2–4 hrs
Testing & polish	4–8 hrs
Total	35–61 hrs

At a conservative $75/hr developer rate, that's $2,600 – $4,600 before the first user sees it.

Ongoing costs

~2–4 hours/month maintenance
Hosting: $5–20/month (Fly.io, Railway, Render)
Email service: $0–15/month (SendGrid free tier runs out)
Total recurring: $60–$420/year

The hidden cost nobody accounts for

Your status page will have its first real test during your worst incident. When your database is on fire and every engineer is in a war room call, someone also has to update the status page.

If that status page is your own codebase — with its own deployment pipeline, its own bugs, its own "why isn't the email sending" moments — you've just doubled the cognitive load during the exact moment you can least afford it.

Option B: Use a Service

The main players in 2026

Service	Free Tier	Paid Starts At	Notes
Atlassian Statuspage	No	$29/mo	Industry standard, complex
Better Uptime	Limited	$20/mo	Good UX, integrated monitoring
Instatus	Yes (limited)	$20/mo	Clean, fast
AlertSleep	Yes	Paid plans available	Integrated with uptime monitoring
Cachet (self-hosted)	Free	Hosting costs	Open source, DIY maintenance

What you get immediately

Status page live in 10 minutes
Subscriber management handled
Hosted on separate, reliable infrastructure
Incident management UI (no code required)
Uptime history auto-populated from monitoring checks
Mobile app for on-call updates

The real cost comparison

Scenario	Build	Buy (mid-tier)
Initial setup cost	$2,600–$4,600	$0
Time to launch	1–2 weeks	10 minutes
Monthly recurring	$5–35/mo	$20–29/mo
Year 1 total	$2,900–$5,000	$240–$350
Year 2 total	$60–$420	$240–$350
Break-even	~Year 8	—

The build option theoretically becomes cheaper in year 8. But it doesn't account for the ongoing maintenance, the engineering time spent on features instead of your core product, or the incidents that went poorly because the status page had a bug.

When Building Makes Sense

There are legitimate reasons to build your own:

Build if:

You're a platform company where the status page is part of your product (think Vercel, Heroku)
You need deep integration with proprietary internal tooling
You have dedicated SRE resources with time to maintain it
You have specific branding/white-label requirements that no service offers
You're already building a monitoring platform yourself

Buy if:

You have fewer than 10 engineers
You need it working before your next launch
Your team is already stretched thin
You've had a public incident and need to restore user trust quickly

The Architecture Decision Nobody Talks About

Whether you build or buy, there's one architectural decision that matters more than everything else:

Your status page data must come from external monitoring, not internal reporting.

If your status page only shows "down" when your own systems detect and report it, you have a problem: your systems might be down in a way that prevents them from self-reporting.

The right architecture:

External monitor (different cloud, different region)
    ↓ detects outage
    ↓ triggers alert
    ↓ auto-creates incident on status page
    ↓ notifies subscribers
    ↓ engineers get paged

Not:

Your app
    ↓ is down
    ↓ engineer notices 20 minutes later
    ↓ manually logs into status page
    ↓ manually creates incident
    ↓ users have been confused for 20 minutes

This is why integrated solutions — where your uptime monitoring and status page share data — tend to work better in practice.

The Recommendation

For most teams: buy, don't build.

Not because building is wrong — building is often the right answer for product problems. But a status page is infrastructure, not product. It should be invisible when things are working and bulletproof when things aren't.

The engineering time you'd spend building a status page is almost certainly better spent on the features that make outages less frequent in the first place.

Start with a free tier, get it live this week, and revisit when you've outgrown it.

What's your current setup? If you're still manually emailing users during incidents, it's worth spending 10 minutes setting up something better. Tools like AlertSleep let you connect uptime monitoring directly to a public status page — so when a check fails, the incident is created automatically.

Drop your status page setup in the comments — curious what the dev.to community is using.

What 99.9% Uptime Actually Means: 8.7 Hours of Downtime Per Year

AlertSleep — Sun, 12 Apr 2026 06:07:48 +0000

You've seen it everywhere. On hosting pages, SaaS pricing tables, cloud provider dashboards:

"99.9% uptime guaranteed"

Sounds impressive. Almost perfect. Like, what's 0.1%?

A lot, actually. Let me show you the math — and more importantly, what it means for your users, your revenue, and your sleep schedule.

The Math Nobody Does

99.9% uptime means your service is unavailable for 0.1% of the time.

Here's what 0.1% looks like across different time windows:

Time Period	Allowed Downtime
Per day	1 minute 26 seconds
Per week	10 minutes 4 seconds
Per month	43 minutes 49 seconds
Per year	8 hours 45 minutes

That last one is the one that should make you pause. 8 hours and 45 minutes of downtime per year — and your SLA is technically fine the whole time.

The Full SLA Cheat Sheet

Most people only know the "three nines" (99.9%). Here's the complete picture:

SLA	Downtime/Year	Downtime/Month	Downtime/Day
99%	3 days 15 hrs	7 hrs 18 min	14 min 24 sec
99.5%	1 day 19 hrs	3 hrs 39 min	7 min 12 sec
99.9%	8 hrs 45 min	43 min 49 sec	1 min 26 sec
99.95%	4 hrs 22 min	21 min 54 sec	43 sec
99.99%	52 min 35 sec	4 min 22 sec	8.6 sec
99.999%	5 min 15 sec	26 sec	0.86 sec

The jump from 99.9% to 99.99% — one extra "9" — reduces your annual downtime budget from 8.7 hours to 52 minutes. That's a 10x difference.

Calculate Your Own Uptime

The formula is simple:

Downtime = Total Time × (1 - Uptime %)

For example, a year has 365.25 × 24 = 8,766 hours.

At 99.9%:

8,766 hours × 0.001 = 8.766 hours ≈ 8 hrs 45 min

Or in JavaScript, if you want to build it yourself:

function calculateDowntime(uptimePercent, periodHours) {
  const downtimeRatio = 1 - (uptimePercent / 100);
  const downtimeHours = periodHours * downtimeRatio;
  const downtimeMinutes = downtimeHours * 60;
  const downtimeSeconds = downtimeMinutes * 60;

  return {
    hours: Math.floor(downtimeHours),
    minutes: Math.floor(downtimeMinutes % 60),
    seconds: Math.floor(downtimeSeconds % 60),
  };
}

// 99.9% uptime over a year (8766 hours)
console.log(calculateDowntime(99.9, 8766));
// → { hours: 8, minutes: 45, seconds: 46 }

If you'd rather skip the math, tools like AlertSleep's uptime calculator let you punch in any percentage and get the breakdown instantly.

"But Our SLA Excludes Planned Maintenance"

This is the clause that quietly turns "99.9%" into "something much lower."

Many SLAs include language like:

"Uptime calculations exclude scheduled maintenance windows, force majeure events, and incidents caused by the customer."

In practice, this means a vendor can take their service down for a 4-hour maintenance window every month and still advertise "99.9% uptime" — because those hours simply don't count.

Always check:

Does the SLA count maintenance windows as downtime?
How much advance notice is required for scheduled maintenance?
What's the compensation if they breach the SLA? (Hint: it's usually service credits, not money)

What Does Downtime Actually Cost?

Here's where it gets real. Abstract percentages become concrete when you map them to your business.

A rough formula used by most reliability engineers:

Cost of Downtime = Lost Revenue/hr + Productivity Cost/hr + Reputation Damage

For an e-commerce site doing $100k/day in revenue:

Revenue per hour = $100,000 / 24 ≈ $4,166/hr

At 99.9% uptime → 8.75 hours of downtime/year
→ Lost revenue: 8.75 × $4,166 ≈ $36,000/year

And that's before counting the customer support tickets, the social media complaints, and the users who never come back.

The "Five Nines" Problem

You'll sometimes see "five nines" (99.999%) thrown around by cloud providers. It sounds incredible — only 5 minutes of downtime per year.

But here's the uncomfortable truth: achieving five nines is mostly about architecture, not monitoring.

Five nines requires:

Multi-region active-active deployments
Zero-downtime deployments (blue/green or canary)
Automatic failover with sub-second detection
Chaos engineering to test failure scenarios

Most startups and even mid-size companies realistically operate at 99.5% to 99.95%. And that's fine — if you know it and plan for it.

The Difference Between Measured and Actual Uptime

Here's a subtle but important distinction.

Your hosting provider might achieve 99.99% uptime at the infrastructure level. But your application might only hit 99.5% because of:

Memory leaks that require weekly restarts
Slow database queries that cause timeouts (HTTP 504 — is that "downtime"?)
Third-party API dependencies that go down
SSL certificate expiry (this kills more sites than you'd think)
Your own deployment going wrong at 2am

Your uptime is only as good as the weakest link in the chain. And the only way to know your real uptime — not your provider's uptime — is to monitor from the outside.

What to Actually Monitor

Most developers start monitoring too late and measure too little. Here's a baseline:

Minimum viable monitoring:

[ ] HTTP status check every 1-5 minutes
[ ] Response time tracking (a 503 that takes 30s is worse than a fast 503)
[ ] SSL certificate expiry alert (set to 30 days before)
[ ] Domain expiration alert (set to 60 days before)

Level up:

[ ] Multi-region checks (your site might be down only in the US East)
[ ] API endpoint monitoring (not just the homepage)
[ ] Port monitoring for non-HTTP services

Alert channels that actually wake you up:

SMS/phone call for critical alerts (email is too easy to miss at 3am)
Slack/Teams for the team
Status page for your users so they know you know

The Real Takeaway

99.9% uptime is not "always online." It's a budget — a budget of how much downtime your users are willing to accept before they find an alternative.

The question isn't "what SLA does my provider offer?" The question is:

What uptime does your business actually need — and how will you know when you're not hitting it?

The first step is measuring. You can't improve what you can't see.

If you're building something people depend on, set up external uptime monitoring today — not after the first outage. Tools like AlertSleep start free and take about 2 minutes to configure.

What SLA does your app target? And are you actually measuring it? Drop it in the comments.