DEV Community: Luke · Software Developer

Your Monitoring Says 'UP' But Your Users Say 'Broken'

Luke · Software Developer — Mon, 16 Feb 2026 07:43:53 +0000

Your server returns a 200 OK. Your monitoring dashboard shows green. But users are complaining the site is broken.

Welcome to the gray zone between downtime and degraded performance.

The Gray Zone

Consider these scenarios:

Memory leak: Your app slowly consumes more memory. Response times creep from 200ms to 8 seconds over days. At no point does it "go down." But users leave.

Third-party failure: Your payment provider is having issues. Your site loads perfectly but 40% of checkouts fail. Monitoring says everything is fine.

Regional CDN issue: Your CDN has problems in Asia. US and EU users are fine. Asian users see 20-second load times. Your monitoring server is in the US, so it reports 100% uptime.

In all three cases, traditional monitoring reports "UP" ✅

Why This Matters for SLAs

Most SLAs define uptime as "responds with a non-error status code." A 200 OK that takes 30 seconds still counts as "up."

Scenario	SLA Status	User Experience
200 in 200ms	✅ Up	✅ Good
200 in 15 seconds	✅ Up	❌ Terrible
200 with empty data	✅ Up	❌ Broken
503 error	❌ Down	❌ Down

Two out of five scenarios are "up" by SLA definition but functionally broken.

How to Monitor for Both

Layer 1: Uptime checks (catches total outages)
Layer 2: Response time thresholds — alert when consistently > 3 seconds
Layer 3: Multi-step flow monitoring — check complete user journeys
Layer 4: SSL/cert monitoring — prevents a specific downtime type
Layer 5: Visual monitoring — catches UI degradation that returns 200 OK

The most expensive incidents aren't total outages. They're degradation events that go undetected for hours because monitoring says "everything is fine."

Full deep-dive with real-world examples from Cloudflare, GitHub, and Stripe: Read the complete guide

Uptime Monitoring Won't Save You: A Guide to API-Based Auth Flow Monitoring

Luke · Software Developer — Thu, 22 Jan 2026 17:38:55 +0000

The 3 AM Wake-Up Call

Your phone buzzes. Then again. And again.

"Can't log in to your app. Just spins forever."

You check your uptime dashboard: 100% green. Server responding perfectly.

But your auth API timed out. Or your token validation broke. Or your Redis session store went down.

Your users are locked out - and your "uptime monitoring" didn't catch it.

Why Uptime Monitoring Misses Auth Failures

Traditional uptime monitoring checks one thing: does the server respond to HTTP requests?

Your authentication involves multiple steps:

User submits credentials
↓
Auth endpoint receives request
↓
Credentials validated (database/auth provider)
↓
Token/session generated
↓
Token returned to client
↓
Subsequent requests authenticated

If ANY step fails, users can't log in. But your homepage still returns 200 OK.

Common auth failures invisible to uptime monitoring:

Auth API timeout (server up, endpoint slow)
Database connection pool exhausted
Auth provider outage (Auth0, Firebase, Okta)
Token generation failure
Rate limiting triggered
Session store unavailable (Redis down)

What Breaks SaaS Authentication

Your Auth Provider

Using Auth0, Okta, Firebase Auth? Their outage = your outage. You've outsourced auth, which means you've outsourced a critical failure point.

Your Auth API

Even with custom auth:

Database connection issues
Memory/CPU exhaustion
Deployment bugs
Rate limiting triggered

Token/Session Infrastructure

JWT signing failures. Redis down. Token validation errors. These cause mysterious auth failures.

Third-Party OAuth

Google, Microsoft, GitHub OAuth - if their token endpoint is slow or down, your "Login with Google" breaks.

API-Based Auth Flow Monitoring

The solution: monitor the actual auth flow at the API level.

Example: Login → Authenticated Request Flow

Step 1: POST /api/auth/login

Body: { email: 'test@example.com', password: 'test-password' }
Validate: Status = 200
Extract: $.token as 'auth_token'

Step 2: GET /api/user/profile

Headers: Authorization: Bearer {{auth_token}}
Validate: Status = 200
Validate: Response contains user data

Step 3: POST /api/auth/logout (cleanup)

Headers: Authorization: Bearer {{auth_token}}
Validate: Status = 200

What this catches:

Auth endpoint failures
Token generation issues
Token validation failures
Database/backend issues
Rate limiting on auth endpoints

Setting Up Auth Flow Monitoring

Step 1: Create Test Account

Create a dedicated monitoring user:

test-monitor@yourdomain.com
Strong, unique password
Minimal permissions (read-only if possible)
Excluded from analytics/billing

Important: Don't use real user credentials. Don't use admin credentials.

Step 2: Document Your Auth Flow

Before configuring monitoring:

POST /api/auth/login Body: { "email": "...", "password": "..." } Response: { "token": "eyJ...", "user": { "id": "123" } } Authenticated requests: Header: Authorization: Bearer <token>

Step 3: Configure Multi-Step API Monitor

Most monitoring tools support this:

Create "Process Flow" or "Multi-step API" monitor
Add login step with credential extraction
Add authenticated request step
Set assertions for each step

Step 4: Alert Configuration

For auth issues, speed matters:

Minute 0-5: Email + Slack
Minute 5-15: SMS if unacknowledged
Minute 15+: Page the team

Auth affects 100% of users. Alert aggressively.

Common Gotchas

Test Account Rate Limiting

Your test account logging in every 5 minutes = 288 logins/day.

Solutions:

Whitelist test account from rate limiting
Whitelist monitoring IPs
Set test account password to never expire

False Positives

Auth monitoring can have more false positives. Use:

Retry once before alerting
Check from multiple locations
Validate specific response content

Emergency Playbook

Minute 0-3: Verify

Try logging in manually (incognito browser)
Check auth provider status page
Check recent deployments

Minute 3-5: Communicate

Before fixing, communicate:

Post to your status page
"We're aware some users cannot log in. Investigating."

Minute 5-15: Diagnose

Check in order:

Auth provider status
Your auth API logs
Database connectivity
Recent deployments
Rate limiting/WAF logs

After Resolution

Update status page
Email affected users
Post-mortem: how to detect faster?

The Bottom Line

Your authentication is the gate to everything. When it's broken, nothing else matters.

Traditional uptime monitoring won't catch auth issues. You need to:

Monitor the actual auth flow
Test with real credentials
Verify authenticated requests work
Alert fast and communicate faster

Set this up today. Your 3 AM self will thank you.

What's your auth monitoring setup? Have you been bitten by "uptime fine, login broken"? Let me know in the comments.

Why “99.9% uptime” doesn’t mean your users are fine

Luke · Software Developer — Fri, 19 Dec 2025 21:05:59 +0000

For years, uptime has been treated as the ultimate signal of reliability.

If a dashboard shows 99.9% uptime, everything must be fine.

Servers respond. Checks are green. Alerts are silent.

And yet, users complain.

Pages load but don’t render correctly.

Critical actions fail.

Performance is inconsistent depending on where users are located.

From a monitoring perspective, everything looks “up”.

From a user’s perspective, the product feels broken.

This disconnect is more common than most teams realize.

Uptime is an infrastructure metric

Uptime answers a very specific question:

Does a server respond to a request?

That’s it.

It doesn’t tell you:

whether the page actually renders
whether critical user flows work
whether the experience is usable
whether users in different regions see the same thing

Uptime is necessary, but it’s only a baseline.

Treating it as a proxy for user experience is where problems begin.

When everything is “up” but nothing works

Many real incidents don’t show up as downtime:

A frontend deploy introduces a JavaScript error
An API responds, but returns incorrect data
A checkout page loads but fails silently
A CSS issue breaks layout on specific devices
A feature flag misconfiguration affects only part of the audience

From the outside, the site is reachable.

From the inside, dashboards stay green.

From the user’s point of view, the product is unusable.

The regional blind spot

Another common failure mode is regional availability.

A site may be:

fully accessible from one country
slow or unreachable from another

CDNs, DNS resolution, routing paths, and ISPs all play a role here.

Centralized monitoring often checks from a limited set of locations.

If those locations are healthy, the issue stays invisible.

This is why teams hear:

“I can’t reproduce it.”

And users keep experiencing problems.

Why teams struggle to communicate incidents

When availability issues are unclear, communication breaks down too.

Teams fall back to:

replying to individual support tickets
posting updates in chat tools
sending ad-hoc emails
answering “is it down?” repeatedly

There’s no single source of truth.

Users don’t know where to look.

Support load increases exactly when teams are already under pressure.

The problem isn’t just technical.

It’s about shared understanding.

What actually helps

Teams that handle incidents well tend to focus on a few principles:

Think in terms of availability, not just uptime
Look at systems from the user’s perspective
Verify reachability from outside their own environment
Detect user-facing breakage, not just server response
Communicate clearly and consistently

Monitoring becomes less about collecting metrics

and more about reducing uncertainty.

If you want a deeper look at how uptime differs from real availability,

this guide explores the topic in more detail:

👉 https://perkydash.com/guides/why-uptime-is-not-enough

Quick checks still matter

Sometimes, teams don’t need a full dashboard or historical data.

They just need a fast answer to a simple question:

Is the site reachable for users right now?

A quick external check can help:

confirm or rule out availability issues
validate user reports
decide whether deeper investigation is needed

Tools that check reachability from the outside are useful exactly because

they step outside internal networks, cached DNS, and existing sessions.

Here’s a small free tool that does just that:

👉 https://perkydash.com/tools/uptime-check

Availability is the real goal

Uptime should be treated as a baseline, not a success metric.

What users care about is whether they can:

access the product
use it as expected
complete what they came to do

When teams shift their mindset from uptime to availability,

they start seeing issues earlier, communicating better,

and making decisions with more confidence.

Green dashboards are reassuring.

Understanding what users actually experience is far more valuable.