DEV Community

Cover image for 5 Production Incidents Every DevOps Engineer Should Know How to Debug
Steven Leggett
Steven Leggett

Posted on

5 Production Incidents Every DevOps Engineer Should Know How to Debug

It's 2 AM. Your phone is screaming. The dashboard is red. Users are tweeting.

You have been on call long enough to know that the gap between "I think I know what's wrong" and "I know exactly what's wrong" can cost your company thousands of dollars per minute. The engineers who close that gap fast are not smarter than everyone else. They have just seen these patterns before.

Here are 5 production incidents that every DevOps engineer will encounter at some point - what they look like, why they happen, and how to debug them.


1. "No Space Left on Device"

The story

A developer was chasing a gnarly bug in production. To get more visibility, they temporarily cranked the application log level to DEBUG. They fixed the bug, merged the PR, and completely forgot to revert the log level setting.

Three weeks later, at 3 AM on a Tuesday, your monitoring fires. Every service on that host is returning 500s. The database is refusing writes. Nothing makes sense until you SSH in and run the one command that tells you everything:

df -h
Enter fullscreen mode Exit fullscreen mode
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        80G   80G     0 100% /
Enter fullscreen mode Exit fullscreen mode

Full. The disk is completely full. /var/log has grown to 64GB of verbose debug output nobody was watching.

Why it happens

Debug logging is chatty by design. It logs every function call, every query parameter, every header. In a high-traffic service, debug logs can generate gigabytes per hour. Combine that with a missing or misconfigured log rotation policy and you have a slow-motion disaster playing out in the background while everyone is focused on feature work.

How to debug it

# Step 1: Confirm the problem
df -h

# Step 2: Find the culprit
du -sh /var/log/*
du -sh /var/log/nginx/*

# Step 3: Immediate relief - clear old compressed logs
find /var/log -name "*.gz" -mtime +7 -delete

# Step 4: Truncate (don't delete) the active log file
truncate -s 0 /var/log/myapp/app.log

# Step 5: Check logrotate config
cat /etc/logrotate.d/myapp
Enter fullscreen mode Exit fullscreen mode

The fix

Set the log level back to INFO or WARN. Fix your logrotate config to enforce retention limits. Then add a disk space alert at 80% - not 95%. By the time you hit 95%, you probably have minutes, not hours.

Key takeaway

Your logging infrastructure needs to be monitored too. The tool you use to diagnose outages can itself cause outages if you ignore it.


2. Database Connection Pool Exhaustion

The story

Traffic is normal. CPU is normal. The database server is completely idle. But your application is throwing errors that look like this:

Error: Cannot acquire connection from pool
TimeoutError: timeout of 5000ms exceeded
Enter fullscreen mode Exit fullscreen mode

And users are getting 503s.

This one is maddening the first time you see it because every instinct tells you to look at the database. The database is fine. The problem is how your application is managing its connection to the database.

Why it happens

Most database drivers give you a connection pool - a fixed set of reusable connections shared across your application's threads or async workers. When a request needs to run a query, it borrows a connection from the pool. When it's done, it returns the connection.

The failure mode that is easy to miss: what happens when a request throws an exception before it returns the connection?

// This is a leak
async function getUser(id) {
  const conn = await pool.acquire();
  const result = await db.query('SELECT * FROM users WHERE id = $1', [id]);
  // If the query throws, conn is never released
  conn.release();
  return result;
}

// This is correct
async function getUser(id) {
  const conn = await pool.acquire();
  try {
    return await db.query('SELECT * FROM users WHERE id = $1', [id]);
  } finally {
    conn.release(); // Always runs, even on error
  }
}
Enter fullscreen mode Exit fullscreen mode

Under normal traffic, the leak is slow enough that the pool replenishes. Under higher load, or when errors spike, the pool drains faster than it fills. Then everything queues up waiting for a connection that never comes back.

How to debug it

-- On PostgreSQL: see all active connections
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- See who is holding connections longest
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
Enter fullscreen mode Exit fullscreen mode

Look for connections stuck in idle in transaction state. That is almost always a leak. The connection was borrowed, a transaction started, and it was never committed or rolled back.

The fix

Audit your error handling paths. Every connection acquire must have a matching release in a finally block or equivalent. Set a connectionTimeoutMillis on your pool so leaked connections get reclaimed automatically. Add an alert when active connection count exceeds 80% of pool size.

Key takeaway

Low database CPU during an "outage" is a red flag pointing to connection management, not query performance. Always check pg_stat_activity before assuming the database is healthy.


3. Kubernetes CrashLoopBackOff - The Missing Secret

The story

You deploy a new version of your application to Kubernetes. Instead of the pods coming up healthy, you see this in kubectl get pods:

NAME                          READY   STATUS             RESTARTS   AGE
myapp-7d9f8b-xkj2p            0/1     CrashLoopBackOff   4          3m
Enter fullscreen mode Exit fullscreen mode

The pod starts, crashes almost immediately, Kubernetes restarts it, it crashes again, and the backoff timer grows exponentially. Within 10 minutes the pod is waiting 5 minutes between restart attempts.

Why it happens

This specific variant is one of the more frustrating ones: the app crashes on startup because it cannot find a required configuration value. It is looking for a secret - maybe a database password, maybe an API key - via an environment variable mounted from a Kubernetes Secret.

But the Secret does not exist in this namespace.

kubectl logs myapp-7d9f8b-xkj2p
Enter fullscreen mode Exit fullscreen mode
Error: Required environment variable DATABASE_PASSWORD is not set
Process exited with code 1
Enter fullscreen mode Exit fullscreen mode
kubectl describe pod myapp-7d9f8b-xkj2p
Enter fullscreen mode Exit fullscreen mode
Events:
  Warning  Failed    2m   kubelet  Error: secret "myapp-credentials" not found
Enter fullscreen mode Exit fullscreen mode

The deployment referenced a Secret that was never created in the target namespace. It exists in staging. It does not exist in production. The deployment YAML was copy-pasted and nobody noticed.

How to debug it

# Step 1: Get the actual error
kubectl logs <pod-name>
kubectl logs <pod-name> --previous  # Logs from the crashed instance

# Step 2: Describe the pod for Kubernetes-level events
kubectl describe pod <pod-name>

# Step 3: Check if the secret exists
kubectl get secrets -n <namespace>

# Step 4: Verify the secret has the expected keys
kubectl describe secret myapp-credentials
Enter fullscreen mode Exit fullscreen mode

The fix

Create the missing Secret in the correct namespace. For the longer-term fix, use a tool like helm diff, kubectl diff, or a GitOps pipeline that validates all referenced resources exist before allowing a deployment to proceed.

Key takeaway

CrashLoopBackOff means the pod keeps dying. kubectl logs --previous shows why it died. kubectl describe pod shows what Kubernetes tried and failed to do. Always check both.


4. The Node.js Memory Leak (WebSocket Edition)

The story

Your Node.js service is running fine after deployment. Memory usage is at 200MB, which is normal. Over the next 18 hours, you watch it creep up. 300MB. 400MB. 600MB. Then the process gets OOMKilled by the container runtime and restarts. The whole cycle starts again.

You check your code for obvious leaks - giant arrays, global caches growing unbounded. Nothing jumps out. This one hides.

Why it happens

The classic Node.js memory leak pattern that catches even experienced engineers: adding event listeners inside a function that gets called repeatedly, without removing them.

// This leaks memory on every new WebSocket connection
function setupWebSocket(socket) {
  // This listener is added fresh on every call
  // But the reference to process.on keeps the socket alive
  // even after the connection closes
  process.on('SIGTERM', () => {
    socket.close();
  });

  socket.on('message', handleMessage);
}
Enter fullscreen mode Exit fullscreen mode

Every time a new WebSocket connection comes in, a new listener is added to process. When the connection closes, the listener is not removed. The listener holds a reference to the socket object. The socket object cannot be garbage collected. After thousands of connections, you have thousands of dead socket references sitting in memory.

Node.js will even warn you about this - but the warning often gets lost in log noise:

MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
11 SIGTERM listeners added to [process]. Use emitter.setMaxListeners() to increase limit
Enter fullscreen mode Exit fullscreen mode

That warning is not something to suppress. It is a canary telling you there is a leak.

How to debug it

# Get a heap snapshot from a running Node.js process
kill -USR2 <pid>

# Or via the Node.js inspector
node --inspect app.js
# Then open chrome://inspect and take a heap snapshot
Enter fullscreen mode Exit fullscreen mode

Look for object types with counts growing over time. In this case you would see Socket instances accumulating far beyond the number of active connections.

The clinic.js tool is excellent for this:

npx clinic heapprofiler -- node app.js
Enter fullscreen mode Exit fullscreen mode

The fix

Always clean up listeners when the associated resource goes away:

function setupWebSocket(socket) {
  const cleanup = () => socket.close();
  process.on('SIGTERM', cleanup);

  socket.on('close', () => {
    // Remove the listener when the connection closes
    process.removeListener('SIGTERM', cleanup);
  });
}
Enter fullscreen mode Exit fullscreen mode

Key takeaway

Memory leaks in Node.js are almost always about retaining references longer than necessary. Event listeners are the most common culprit. Take MaxListenersExceededWarning seriously - it is not noise.


5. Cache Stampede After Redis Restart

The story

Your Redis cache went down for planned maintenance. You brought it back up. Simple, right?

Sixty seconds later your database server is on fire. CPU is pegged at 100%. Query latency went from 5ms to 8 seconds. The database is drowning.

What happened? Every single cache key expired at the same moment - because they all had the same TTL set from the last cache warming cycle - and every single application server tried to rebuild the cache simultaneously by hitting the database.

This is a cache stampede, also called a thundering herd.

Why it happens

Consider what happens when your cache is empty after a restart and you have 50 application servers:

  1. Request comes in for /api/products
  2. All 50 servers check the cache - cache miss
  3. All 50 servers query the database for product data
  4. All 50 servers write the result back to cache
  5. 49 of those database queries were wasted
  6. Under high traffic, "50 servers" becomes "50,000 requests per second"

The database - which normally handles 200 queries per second because the cache absorbs the rest - suddenly receives 20,000 queries per second. It collapses.

How to debug it

The diagnosis is usually visible in the metrics:

Cache hit rate: dropped from 95% to 0%
Database connections: spiked from 50 to 800 in 30 seconds
Database CPU: 100%
API P99 latency: 50ms -> 12,000ms
Enter fullscreen mode Exit fullscreen mode

Correlate the timeline with the Redis restart event. If the metrics cliff happened right when Redis came back up, you have your answer.

The fix

Several strategies exist, and production systems often use multiple in combination:

Cache locking (mutex pattern): Only one process populates a cache key. Others wait.

import redis
import time

def get_with_lock(key, fetch_fn, ttl=300):
    r = redis.Redis()
    value = r.get(key)
    if value:
        return value

    # Try to acquire a lock
    lock_key = f"lock:{key}"
    if r.set(lock_key, "1", nx=True, ex=10):
        # We got the lock - populate the cache
        try:
            value = fetch_fn()
            r.setex(key, ttl, value)
            return value
        finally:
            r.delete(lock_key)
    else:
        # Someone else has the lock - wait briefly and retry
        time.sleep(0.1)
        return r.get(key)
Enter fullscreen mode Exit fullscreen mode

TTL jitter: Add random variance to cache expiration times so keys do not all expire simultaneously.

import random

base_ttl = 300
jitter = random.randint(-30, 30)
r.setex(key, base_ttl + jitter, value)
Enter fullscreen mode Exit fullscreen mode

Probabilistic early expiration: Proactively refresh cache entries before they expire, based on how expensive the recomputation is.

Key takeaway

A cache restart is not a safe non-event. Treat cache warming as part of your maintenance procedure. Use TTL jitter by default - it costs nothing and prevents a whole class of stampede failures.


The Pattern Across All Five

Look at what these incidents have in common:

  • The symptoms lied. Disk full causing database errors. Connection pool causing "database" problems. A cache issue causing what looks like a database overload.
  • The actual cause was one layer removed from where the pain was visible.
  • All five are preventable with the right monitoring thresholds, code patterns, and configuration choices.
  • All five are faster to debug if you have seen them before.

That last point is the crux of it. Incident response speed is largely pattern recognition. The engineer who has seen a connection pool exhaustion before spots the idle database CPU and goes straight to pg_stat_activity. The one who has not seen it spends an hour tuning query indexes that are not the problem.


Practice Before the Pager Goes Off

If you want to build that pattern recognition without waiting for production to teach you the hard way, I built youbrokeprod.com - a free browser game where you investigate production outages step by step.

Each scenario drops you into a live incident: you run commands, read logs, check metrics, and work toward a diagnosis. No signup required to try it. The game currently has 10 scenarios across beginner, intermediate, and advanced difficulty - including all five incidents described in this post.

The goal is simple: make the muscle memory of incident debugging feel familiar before it is your on-call rotation on the line.


What production incidents have scarred you the most? Drop them in the comments - there are 44 scenarios in the backlog and the most painful real-world ones make the best levels.

Top comments (0)