Peter

Posted on Apr 26

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

#kubernetes #devops #sre #postmortem

It started at 1:49 AM.

PagerDuty fired — payments-service entering CrashLoopBackOff, 3 replicas simultaneously. On-call engineer paged. I joined the incident bridge 4 minutes later.

By 2:36 AM, we had the fix deployed. 47 minutes of debugging for a 2-line YAML change.

This is the postmortem. Not of the incident itself — those exist internally — but of the investigation. Every wrong turn, every wasted minute, and the exact signals that eventually cracked it.

## The First 10 Minutes: The Obvious Wrong Answer

When pods crash simultaneously right after a deployment, the deployment is guilty until proven innocent. That's the right instinct most of the time. So the first 10 minutes were spent here:

  kubectl rollout history deployment/payments-service -n production
  kubectl describe deployment/payments-service -n production

The last deployment had gone out at 7:52 PM — over 6 hours earlier. The pods had been healthy for 6 hours since that deploy. This should have ruled out the deployment immediately, but we didn't internalize
it fast enough. We spent another 5 minutes reviewing the deployment diff anyway, looking for a subtle config change that could have caused a delayed failure.

Nothing.

Time lost: ~12 minutes.

## The Next 15 Minutes: Log Archaeology

With deployment ruled out, we went to logs.

  kubectl logs payments-service-7d9f8b-xkp2q --previous -n production
  kubectl logs payments-service-7d9f8b-mn4lw --previous -n production

The logs showed the service starting up normally, attempting a Redis connection, and then… nothing. Process killed. No error, no panic, no stack trace. Just termination.

This looked like an OOMKill at first. So we checked resource limits:

  kubectl describe pod payments-service-7d9f8b-xkp2q -n production | grep -A5 Limits

Memory usage was at 180Mi against a 512Mi limit. Not OOM.

We pulled metrics from Prometheus looking for a memory spike. Nothing unusual.

Time lost: ~15 more minutes. Now 27 minutes in.

## Minute 27: The Event Log (Should Have Started Here)

This is the moment in every postmortem where I think: why didn't we start with events?

  kubectl get events -n production --sort-by='.lastTimestamp' | tail -30

There it was:

  LAST SEEN   TYPE      REASON    OBJECT                        MESSAGE
  2m          Warning   BackOff   pod/payments-service-7d9f8b   Back-off restarting failed container                                                                                                            
  4m          Warning   Unhealthy pod/payments-service-7d9f8b   Liveness probe failed: Get "http://10.0.1.12:8080/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Liveness probe failure. The probe was timing out.

But here's where we made our second mistake: we assumed the liveness endpoint itself was broken. Maybe the service wasn't starting correctly. Maybe there was a dependency it couldn't reach. We spent another
10 minutes trying to curl the healthz endpoint manually, exec'ing into a running pod to see if the endpoint responded.

It did. The endpoint worked fine.

Time lost: ~10 more minutes. Now 37 minutes in.

## Minute 37: The Node Event Nobody Checked

One of the engineers on the call said something offhand: "Wait, when did this start? 1:49? Can you check if anything happened on the node around then?"

  kubectl get events -n kube-system --sort-by='.lastTimestamp' | grep -E "Node|node"

  47m   Normal   NodeReady    node/ip-10-0-1-5   Node ip-10-0-1-5 status is now: NodeReady
  49m   Normal   Starting     node/ip-10-0-1-5   Starting kubelet                                                                                                                                               
  52m   Normal   NodeNotReady node/ip-10-0-1-5   Node ip-10-0-1-5 status is now: NodeNotReady

A node had cycled at 1:47 AM — 2 minutes before the crash loop started.

That node was running redis-cache-0.

Redis had restarted when the node recycled. And Redis, spinning up fresh, takes about 3-4 seconds before it starts accepting connections.

The payments-service liveness probe hits /healthz, which internally checks Redis connectivity. The probe has a 2-second timeout. Redis is taking 3-4 seconds to warm up. The probe fails. Kubernetes kills

the pod. Repeat.

The deployment was innocent. The pods were healthy. The liveness probe was doing exactly what it was configured to do. It was just configured too aggressively for the actual startup behavior of its

dependencies.

The fix:

  livenessProbe:
    httpGet:
      path: /healthz
      port: 8080
    timeoutSeconds: 10        # was: 2
    initialDelaySeconds: 15   # was: 0

Deploy. Pods stabilize. Incident resolved at 2:36 AM.

## The Real Postmortem: Why Did This Take 47 Minutes?

The fix was 2 lines of YAML. The investigation took 47 minutes. That ratio is the actual problem.

Here's the map of where the time went:

| Phase | Time Spent | Why |

|---|---|---|

| Deployment investigation | 12 min | Correct first instinct, but over-indexed |
| Log archaeology | 15 min | Logs showed the symptom (killed process) not the cause |
| Healthz endpoint testing | 10 min | Found the mechanism but not the root cause |

| Node event discovery | 7 min | The actual signal — found last |
| Diagnosis + fix | 3 min | Trivial once cause was known |

The signals that cracked it were all there from minute zero:

k8s events showing Liveness probe failed: context deadline exceeded
Node events in kube-system showing the recycle at 1:47 AM
Redis pod restart time correlating with the node event
Deployment history showing the last deploy was 6 hours prior (early exoneration)

If you had read these four things in that order, the investigation is 5 minutes, not 47. The problem is that these signals live in three different places, and under pressure at 2 AM, humans don't naturally
start with the most diagnostic view. We start with the most familiar one (logs) and dig deeper instead of wider.

## What Changes With Automated Investigation

We've been building Causa partly because of incidents exactly like this one.

When this alert fired, Causa's investigation loop would have:

Received the PagerDuty signal
Pulled the k8s events for the affected pods — immediately seeing the liveness probe timeout
Correlated the pod restart timestamps against node events in kube-system — finding the 1:47 recycle
Identified redis-cache-0 as running on the recycled node
Checked recent deployments and exonerated them (6 hours ago, no correlation)
Run eBPF traces to confirm Redis connection latency in the 3-4 second range

And posted this to Slack in under 60 seconds:

  ROOT CAUSE: Liveness probe timeout (2s) on payments-service is failing
  because redis-cache-0 requires 3-4s to accept connections after the                                                                                                                                           
  spot instance recycle at 01:47 UTC.

  EVIDENCE:       
  • Node ip-10-0-1-5 recycled at 01:47 UTC (2 min before crash loop start)
  • redis-cache-0 was on ip-10-0-1-5 — restarted at same time                                                                                                                                                   
  • Liveness probe timeout: 2s — insufficient for Redis warmup
  • Last deployment: 6h ago — NOT correlated                                                                                                                                                                    

  FIX: Increase livenessProbe.timeoutSeconds to 10, add initialDelaySeconds: 15                                                                                                                                 

  CONFIDENCE: High (4 corroborating signals)

That's not guesswork and it's not a summary of the alert we already saw. It's the actual investigation — reading the right signals in the right order, without the human bottleneck.

The on-call engineer still makes the call. They still deploy the fix. But they do it with the full picture in front of them in minute one, not minute 47.

## Three Changes We Made After This Incident

1. Start every investigation with events, not logs.

kubectl get events --sort-by='.lastTimestamp' is now the first command in our runbook. Logs show what happened to the process. Events show what Kubernetes did about it. Start wider, then drill down.

2. Node events are infrastructure events — check kube-system.
Pod-level debugging often misses infrastructure-level causes. If you're not checking kubectl get events -n kube-system, you're missing a category of signals.

3. Timestamp correlation before hypothesis formation.
Before you start testing a hypothesis (broken deployment, bad code, OOM), check when things happened. The exact timing often rules out 80% of likely causes before you investigate them.

## Final Word

The best SREs I know aren't necessarily the fastest debuggers in isolation. They're the ones who've internalized a sequence — who know which signals to read in which order, and who don't get anchored on

the first hypothesis.

That sequence can be automated. Not to replace the engineer's judgment, but to surface the right information before the judgment is needed.

If your team runs Kubernetes in production and you've had an incident that took longer than it should have, Causa is worth 5 minutes of your time. Free tier, one Install command, works with whatever alerting you already have.

Have a war story like this one? I'd genuinely like to hear it — drop it in the comments.

Top comments (2)

arun rajkumar • Apr 30

The PagerDuty → CrashLoopBackOff → 47-minute hunt is a story I've lived. Two takeaways from our side running payments infra: (1) anything that touches a settlement window should never sit on spot — the cost saving is rounding error vs. one missed payout, and (2) a "node terminated" event from your cloud's metadata service is the cheapest signal you can wire into your alerting and most teams forget it exists. Solid writeup.

Laura Ashaley • Apr 27

Classic cloud lesson spot instances in Amazon Web Services are great for cost savings, but without proper fault tolerance and observability they can quietly break critical systems. Reliability always has to beat cost optimization in payment flows.