Your server returns a 200 OK. Your monitoring dashboard shows green. But users are complaining the site is broken.
Welcome to the gray zone between downtime and degraded performance.
The Gray Zone
Consider these scenarios:
Memory leak: Your app slowly consumes more memory. Response times creep from 200ms to 8 seconds over days. At no point does it "go down." But users leave.
Third-party failure: Your payment provider is having issues. Your site loads perfectly but 40% of checkouts fail. Monitoring says everything is fine.
Regional CDN issue: Your CDN has problems in Asia. US and EU users are fine. Asian users see 20-second load times. Your monitoring server is in the US, so it reports 100% uptime.
In all three cases, traditional monitoring reports "UP" ✅
Why This Matters for SLAs
Most SLAs define uptime as "responds with a non-error status code." A 200 OK that takes 30 seconds still counts as "up."
| Scenario | SLA Status | User Experience |
|---|---|---|
| 200 in 200ms | ✅ Up | ✅ Good |
| 200 in 15 seconds | ✅ Up | ❌ Terrible |
| 200 with empty data | ✅ Up | ❌ Broken |
| 503 error | ❌ Down | ❌ Down |
Two out of five scenarios are "up" by SLA definition but functionally broken.
How to Monitor for Both
Layer 1: Uptime checks (catches total outages)
Layer 2: Response time thresholds — alert when consistently > 3 seconds
Layer 3: Multi-step flow monitoring — check complete user journeys
Layer 4: SSL/cert monitoring — prevents a specific downtime type
Layer 5: Visual monitoring — catches UI degradation that returns 200 OK
The most expensive incidents aren't total outages. They're degradation events that go undetected for hours because monitoring says "everything is fine."
Full deep-dive with real-world examples from Cloudflare, GitHub, and Stripe: Read the complete guide
Top comments (1)
Really helpful article — thank you!