binadit

Posted on Apr 24 • Originally published at binadit.com

How to solve random downtime in high availability infrastructure

#highavailability #downtime #debugging #resilience

Tracking down mysterious production outages in distributed systems

Picture this: your dashboards are all green, CPU and memory look healthy, database connections are stable. Then suddenly your app crashes for 3 minutes. When it recovers, there's nothing useful in the logs. Sound familiar?

These phantom outages plague high-availability systems because failures cascade through interconnected components in ways that traditional monitoring misses. Here's how to actually debug them.

Why distributed systems fail randomly

The "randomness" isn't actually random. It's cascading failures triggered by timing, load patterns, or external factors your monitoring doesn't capture.

Consider this scenario: your database connection pool hits capacity, making app threads wait. Health checks timeout because the app can't respond fast enough. Your load balancer removes the server, pushing more traffic to remaining nodes. The cascade completes in seconds, but the root cause (maybe a gradual memory leak) built up over hours.

Your 30-second monitoring intervals miss the brief spikes that actually trigger failures.

Debugging strategy that works

Increase monitoring resolution

Drop your scrape intervals to 5-10 seconds during failure periods:

# prometheus.yml
global:
  scrape_interval: 5s
  evaluation_interval: 5s

scrape_configs:
  - job_name: 'api-servers'
    scrape_interval: 5s
    static_configs:
      - targets: ['api1:9090', 'api2:9090']

Add distributed tracing

Random failures often hide in service-to-service interactions. Instrument these critical paths:

Load balancer to application servers
Database queries and connection acquisition
External API calls
Cache operations and background jobs

Jaeger and Zipkin reveal where failures propagate across service boundaries.

Correlate logs by timestamp

Aggregate logs from all components and look for patterns during failure windows:

{
  "timestamp": "2024-01-15T14:30:45Z",
  "service": "api",
  "level": "error",
  "message": "DB connection timeout",
  "request_id": "req-123",
  "pool_active": 48,
  "pool_max": 50
}

Look for database slow queries before app timeouts, memory issues during traffic spikes, or background jobs consuming resources at peak times.

Building resilient systems

Circuit breakers prevent cascades

Fail fast when downstream services become unavailable:

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure = None
        self.state = 'CLOSED'

    def call(self, func, *args, **kwargs):
        if self.state == 'OPEN':
            if time.time() - self.last_failure > self.timeout:
                self.state = 'HALF_OPEN'
            else:
                raise CircuitBreakerOpen()

        try:
            result = func(*args, **kwargs)
            self.reset()
            return result
        except Exception:
            self.record_failure()
            raise

Set aggressive timeouts

Most cascading failures happen because systems wait too long for unresponsive dependencies:

# Database connections
pool = create_engine(
    database_url,
    pool_size=20,
    pool_timeout=3,  # Short timeout for getting connections
    pool_recycle=3600,
    pool_pre_ping=True
)

# HTTP clients with retries
session = requests.Session()
retry_strategy = Retry(
    total=3,
    backoff_factor=0.5,
    status_forcelist=[500, 502, 503, 504]
)
session.mount('https://', HTTPAdapter(max_retries=retry_strategy))

# Always use timeouts
response = session.get(url, timeout=(2, 8))  # 2s connect, 8s read

Design for graceful degradation

Instead of complete outages, reduce functionality when components fail. Cache critical data, implement read-only modes, or disable non-essential features.

Testing failure scenarios

Use chaos engineering to trigger failures systematically:

Inject database delays
Limit available memory
Simulate network partitions
Throttle API responses

This reveals failure modes before they surprise you in production.

Key takeaways

Random downtime isn't random. It's cascading failures in complex systems where timing and load create perfect storms. Fix it by increasing observability, implementing circuit breakers, setting proper timeouts, and testing failure scenarios deliberately.

The goal isn't preventing all failures, but preventing small failures from becoming outages.

Originally published on binadit.com

DEV Community