binadit

Posted on Apr 11 • Originally published at binadit.com

Intermittent outages: causes, detection and solutions

#highavailability #outages #monitoring #reliability

Why your 99.9% uptime means nothing to frustrated users

Picture this: your dashboards show green across the board, uptime sits at 99.9%, but support tickets keep flooding in about "random failures" and "the app being slow sometimes." You're dealing with intermittent outages, and they're probably costing you more than you think.

Unlike dramatic server crashes that wake everyone up at 3 AM, intermittent failures are sneaky. They show up as occasional API timeouts, random connection drops, or that payment form that works fine when you test it but fails for real users.

The real damage of "minor" issues

Complete outages hurt, but they're honest about it. Your monitoring screams, your team jumps into action, and you fix the problem. Intermittent issues are different beasts entirely.

They chip away at user trust one failed request at a time. Users start refreshing pages "just to be sure." They avoid using your app during certain hours. Eventually, they find alternatives that "just work."

For SaaS platforms, this translates to increased churn rates. E-commerce sites lose revenue during checkout flows. The business impact compounds because these problems often get brushed off as "network issues" until the damage is done.

Root causes that actually matter

Resource exhaustion patterns

Most intermittent failures trace back to resources that temporarily run dry:

Connection pools filling during traffic spikes
Memory gradually climbing until garbage collection blocks requests
Database connections timing out under load

The pattern is always the same: everything works until it doesn't, then magically recovers when conditions change.

Network instability you can't see

Network equipment fails gracefully until it doesn't. At 2% packet loss, connections start timing out randomly. When bandwidth hits 80%, latency spikes cause application timeouts.

Your load balancer health checks pass while real user requests fail. This monitoring blind spot makes network-related intermittent issues especially painful to track down.

Dependency cascade effects

Modern apps depend on everything: databases, APIs, CDNs, third-party services. When dependencies become unreliable, they don't fail cleanly. They become slow or intermittently unavailable.

Database replica lag creates read inconsistencies. API rate limiting causes random failures. CDN issues affect specific regions. Each dependency multiplies your potential failure points.

Detection strategies that work

Monitor error rates, not just uptime

Track HTTP 5xx responses, database connection failures, API timeouts, and background job failures across different time scales. A 2% error rate averaged over an hour might be acceptable, but consistent 5-minute spikes indicate serious problems.

# Example Prometheus alert for intermittent failures
- alert: IntermittentAPIFailures
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.02
  for: 2m
  annotations:
    summary: "API error rate spike detected"

Implement distributed tracing

Intermittent failures in microservice architectures need request tracing across services. Tools like Jaeger or Zipkin reveal which service becomes unreliable and how failures propagate.

Real user monitoring beats synthetic tests

Synthetic monitoring misses issues that only affect specific user patterns or regions. RUM shows real problems: certain workflows failing more often, regional issues, or time-based patterns.

Case study: fixing checkout failures

A client lost revenue to intermittent payment failures occurring 3-5% of the time during peak hours. Traditional monitoring showed healthy services and normal database performance.

We implemented end-to-end request tracing that revealed the real culprit: database connection pool exhaustion during traffic spikes. The payment service couldn't get connections fast enough, causing checkout timeouts.

After optimizing connection pooling:

Intermittent failures dropped from 3-5% to under 0.1%
Peak period revenue increased by 12%
Customer cart abandonment due to payment issues nearly disappeared

Key takeaways

Monitor what matters: Error rates and user experience metrics beat server uptime
Don't dismiss unreproducible issues: They often indicate systemic problems
Fix causes, not symptoms: Restarting services masks underlying issues
Implement comprehensive observability: Logs, metrics, and traces across your entire stack

Intermittent outages aren't minor annoyances. They're canaries in the coal mine, warning you about systemic issues before they become catastrophic failures. The teams that take them seriously build more reliable systems and keep happier users.

Originally published on binadit.com

DEV Community