Mehmet TURAÇ

Posted on Jun 16

Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

#devops #backend #discuss #sre

Great Stack to Doesn't Work #10

Season Finale: "When PagerDuty Calls at 3 AM"

A survival guide for when everything goes wrong in production.

This episode is different. No tutorials. No configuration guides. No "here's how the technology works."

This is seven incidents. Seven nights where someone's phone rang at a terrible hour. Seven postmortems where the root cause was never just one thing.

Each incident ties back to something we covered in Episodes 1-9. Because production doesn't read your documentation. It combines failure modes in ways you didn't plan for.

Incident 1: Split-Brain — Two Masters, Two Datasets

Time: 02:17 AM, Thursday

What happened:

PostgreSQL cluster with streaming replication. One primary, two replicas. The network between the primary and the replicas experienced a 45-second partition — just long enough for the replicas to lose contact with the primary.

The failover system (Patroni) promoted Replica-1 to primary. But the original primary didn't know it had been demoted. The network partition healed after 60 seconds. Now two nodes both believed they were the primary. Both were accepting writes.

For 8 minutes, two masters served two different sets of writes. The application load balancer was sending reads to both and writes to whichever responded first. 2,400 orders were created on the original primary. 1,800 orders were created on the new primary. 340 of them conflicted — same order IDs, different data.

02:25 — Monitoring detected replication lag anomaly (lag was negative, which should be impossible).

02:28 — On-call engineer logged in. Saw two nodes reporting as primary. Immediately realized: split-brain.

02:30 — Fenced the original primary (shut down PostgreSQL, blocked network access) to stop the bleeding.

02:31 to 04:45 — Reconciliation. Exported the WAL from both nodes after the split point. Compared transaction logs. Identified 340 conflicting writes. Manually resolved each one. Replayed non-conflicting writes from the fenced primary onto the surviving primary.

Root cause: Patroni's fencing mechanism relied on a watchdog timer that the network partition disrupted. The old primary should have been automatically fenced (shut down) when it couldn't reach the DCS (Distributed Configuration Store). The watchdog was disabled during a maintenance window two weeks earlier and never re-enabled.

Lessons:

Automatic fencing is not optional. STONITH (Shoot The Other Node In The Head) exists for a reason. (#1: PostgreSQL)
Post-maintenance checklists must verify every disabled safety mechanism is re-enabled.
Monitor for "impossible" states. Negative replication lag, two primaries — these should be hard alerts. (#7: Observability)
8 minutes of split-brain created 4 hours of manual reconciliation. Prevention is infinitely cheaper than recovery.

Incident 2: "Just a Config Change" — 4 Hours of Downtime

Time: 23:45 PM, Tuesday

What happened:

An engineer updated a Kubernetes ConfigMap that contained the database connection string. The change was minor: updating the connection pool size from 20 to 50 to handle increased traffic. The ConfigMap was applied. Pods were restarted to pick up the new config.

But the ConfigMap YAML had a typo. Not in the pool size — in the database hostname. A trailing space: db-host.internal instead of db-host.internal. DNS resolution failed silently for the hostname with a space. Every pod restarted, read the new config, failed to connect to the database, and entered CrashLoopBackOff.

23:47 — All pods in CrashLoopBackOff. Error rate: 100%. All traffic returning 503.

23:48 — PagerDuty fired. On-call engineer opened the alert.

23:52 — Checked pod logs: connection refused: host not found. Checked the ConfigMap. Didn't see the trailing space (it's invisible in most terminals).

00:05 — Tried rolling back the deployment. But the deployment hadn't changed — only the ConfigMap changed. kubectl rollout undo reverted to the same ConfigMap. Pods still crashed.

00:15 — Someone suggested checking the raw ConfigMap YAML. kubectl get configmap db-config -o yaml showed the trailing space in the hostname.

00:17 — Fixed the typo. Applied. Pods restarted. Service restored.

00:17 to 03:45 — Cleaning up. 2.5 hours of orders were lost (no database connection = no processing). Queue replay from Kafka. Customer notifications. Incident report.

Total downtime: 32 minutes. Total recovery effort: 4 hours.

Root cause: ConfigMap changes bypass all CI/CD validation. No unit test. No integration test. No canary. No approval gate. A single character in a YAML file took down the entire platform.

Lessons:

ConfigMap changes are deployments. Treat them with the same rigor: code review, validation, canary rollout. (#6: CI/CD)
Use ConfigMap immutability or versioned ConfigMaps. Instead of updating in-place, create a new ConfigMap with a version suffix and update the deployment to reference it. Now kubectl rollout undo actually works.
Validate connection strings before deploying them. A pre-deploy script that attempts a TCP connection to the database hostname catches this instantly.
Kubernetes' CrashLoopBackOff for config errors is indistinguishable from application bugs in logs. The connection string looked correct until you diffed it byte-by-byte. (#4: Kubernetes)

Incident 3: Cache Invalidation — 6 Hours Undetected

Time: Discovered at 08:30 AM, Wednesday. Started at 02:15 AM.

What happened:

A nightly batch job updated product prices in the database at 02:15. The cache invalidation hook was supposed to delete the affected Redis keys so the next read would fetch fresh prices. The hook ran, but a Redis cluster failover had happened at 02:10 — 5 minutes before the batch job. The invalidation commands were sent to the old primary, which was now a replica. Replicas accepted the DELETE commands (they were forwarded to the new primary) — but 12 of the commands timed out during the forwarding.

Those 12 keys were never invalidated. 12 products showed stale prices — specifically, yesterday's prices before a 15% discount was applied. Customers buying those products paid full price.

08:30 — Customer support received complaints: "The website shows a discount but I was charged full price." No, actually: the website showed the old price (from cache), but the checkout flow read from the database (correct discounted price). The displayed price and the charged price were different.

08:45 — Engineering confirmed: Redis cached prices were stale for 12 products. Manual invalidation fixed it immediately.

09:00 to 12:00 — Identified all affected orders (1,847). Calculated price differences. Issued partial refunds.

6 hours of stale cache. Zero alerts fired because:

Cache hit ratio was 99.8% (great!)
Error rate was 0% (no errors — wrong prices aren't errors)
Latency was normal
No healthcheck verifies that cached data matches source data

Root cause: Cache invalidation during a Redis failover window is unreliable. The client library retried the timed-out commands once but not enough times to succeed after the failover completed.

Lessons:

Cache invalidation is not fire-and-forget. Verify that invalidation succeeded, especially during infrastructure events. (#3: Redis)
Monitor data freshness, not just cache metrics. A check that compares a sample of cached values against the database every 5 minutes would have caught this in 5 minutes instead of 6 hours.
TTLs are your safety net. If these cache keys had a 1-hour TTL, the stale data would have self-corrected by 03:15. The keys had no TTL because "we invalidate on change." (#3: Redis)
Financial impact from stale cache: $23,400 in refunds. Cost of a 1-hour TTL on price keys: zero.

Incident 4: DNS Propagation — Two Regions Couldn't See Each Other

Time: 14:20 PM, Monday

What happened:

Multi-region deployment. US-East and EU-West. Service discovery via internal DNS (Route 53 private hosted zones). An infrastructure change updated the DNS records for the payment service in EU-West — new IP addresses after a cluster migration.

US-East's DNS resolver cached the old IP addresses. TTL was set to 300 seconds (5 minutes). But the resolver had its own caching layer that didn't respect TTL strictly — it held entries for up to 15 minutes under load.

For 15 minutes, US-East couldn't reach EU-West's payment service. The old IPs pointed to decommissioned nodes. Connection timeout. Every US-East order that required the EU payment provider failed.

14:20 — Error alerts: payment service connection timeouts from US-East.

14:25 — On-call checked EU-West: payment service healthy, responding to local requests.

14:30 — Checked DNS from US-East: resolving to old IPs. TTL had expired but the resolver was still serving cached entries.

14:35 — Flushed the DNS resolver cache on US-East nodes. Connections restored.

15 minutes of cross-region payment failures. 3,200 failed orders.

Root cause: DNS TTLs are a suggestion, not a guarantee. Resolvers, operating systems, and applications all cache DNS at different layers, and none of them are obligated to respect the TTL exactly.

Lessons:

When changing DNS records, plan for stale cache. Lower the TTL to 30 seconds 24 hours before the change. Make the change. Wait for the old TTL period. Raise the TTL back. (#8: Load Balancer)
Application-level DNS caching (JVM's networkaddress.cache.ttl, Python's resolver, Go's resolver) adds another layer. Some frameworks cache DNS for the lifetime of the process. Know your runtime's DNS behavior.
Connection pooling with health checks detects stale DNS faster than waiting for TTL. If the pool detects dead connections, it re-resolves DNS and connects to the new IPs.
Cross-region dependencies should have circuit breakers. If US-East can't reach EU-West's payment service, fall back to a US payment provider or queue the request for retry. (#9: Distributed Tracing)

Incident 5: Memory Leak — The Restart That Became a Ritual

Time: Ongoing, discovered during a capacity planning review

What happened:

This isn't a 3 AM incident. It's worse — it's a slow-motion failure that everyone adapted to.

A Node.js service had a memory leak. Not dramatic — about 50 MB per day. The container's memory limit was 2 GB. Every 3 weeks, memory usage hit the limit, the container was OOMKilled, Kubernetes restarted it, and memory dropped back to 400 MB.

The on-call runbook said: "If the order-enrichment service restarts, check logs for OOMKilled. This is expected. No action needed."

For 8 months, this was "normal." A production service crashing every 3 weeks was documented and accepted. Nobody investigated the root cause because the symptom was managed.

Then traffic doubled after a marketing campaign. Memory growth accelerated to 100 MB per day. Restarts went from every 3 weeks to every 10 days to every 5 days. Then a traffic spike pushed memory growth to 200 MB in one day. The service restarted during peak hours. The cold start took 45 seconds. During those 45 seconds, 3,000 requests queued. When the service came back, it processed the queue, allocating memory rapidly, and hit the limit again within 2 hours. Restart loop.

Root cause: An event listener was being registered on every request but never removed. Each listener held a reference to the request context, preventing garbage collection. After 500,000 requests, 500,000 dead listeners consumed 1.6 GB of memory.

The fix: One line — remove the event listener in the response handler.

Lessons:

A crash that "nobody needs to investigate" is a crash waiting to get worse. (#5: Linux, #4: Kubernetes)
Memory usage over time should be a standard dashboard. A monotonically increasing line is never healthy, even if it's slow.
"The runbook says it's expected" is not an acceptable state for any production failure. If the runbook normalizes a crash, the runbook is wrong.
Node.js memory profiling (--inspect, Chrome DevTools heap snapshots) would have found the listener leak in 30 minutes. 8 months of "managed failure" cost far more.

Incident 6: Triple Deploy — Three Teams, No Communication

Time: 16:45 PM, Friday (naturally)

What happened:

Three teams deployed simultaneously on a Friday afternoon. None of them knew the others were deploying.

Team A deployed a new version of the API gateway with updated rate limiting rules.
Team B deployed a database migration that added a column and backfilled it, creating heavy write load for 20 minutes.
Team C deployed a new version of the search service with an updated Elasticsearch mapping.

Individually, each deployment was tested and safe. Together:

16:45 — Team B's migration started. Database write IOPS tripled. Query latency increased from 5ms to 80ms.

16:47 — Team A's new rate limiting rules used a Redis counter per user per endpoint. The increased latency from the database caused more retries from the frontend, which meant more Redis counter increments, which combined with the database latency increased overall request processing time.

16:48 — Team C's Elasticsearch mapping change triggered a re-index. Elasticsearch CPU hit 95%. Search queries started timing out.

16:50 — The combination: slow database + increased Redis load + dead search = cascading user-facing degradation. Error rate hit 8%. Latency P99 hit 12 seconds.

16:55 — PagerDuty fired. On-call engineer saw errors everywhere and couldn't identify a single root cause because there wasn't one. There were three.

17:00 to 17:45 — Each team independently rolled back, blaming the other teams' deployments. By 17:45, all three had rolled back and the system was stable. But now nobody knew which deployment was actually problematic, because all three were fine in isolation.

The following Monday: They redeployed one at a time, with 30-minute gaps. Each deployment succeeded without issues. The problem was the interaction, not any individual change.

Root cause: No deployment coordination. No shared deployment calendar. No system-wide view of concurrent changes.

Lessons:

Deploy freezes on Fridays exist for a reason. (#6: CI/CD)
A shared deployment channel (Slack, dedicated dashboard) where teams announce deployments prevents collisions. The cost: 30 seconds to post "deploying search service v2.4." The savings: 2 hours of incident response.
Canary deployments detect individual deployment problems. They don't detect interaction problems between simultaneous deployments. (#6: CI/CD)
Observability across services, not just within services, would have shown the three simultaneous changes in a single timeline. (#7: Observability)

Incident 7: Token Expired — 45 Minutes Without the Ability to Deploy

Time: 09:15 AM, Wednesday (during an active incident)

What happened:

The search service had a bug that caused it to return empty results for queries containing non-ASCII characters. The fix was ready in 20 minutes — a one-line encoding fix. The engineer pushed to the branch, opened a PR, got approval, merged.

The CI/CD pipeline started. Build succeeded. Tests passed. Push to container registry... failed. Error: "authentication denied."

The GitHub App token used by CI/CD to push images to the container registry had expired 3 days ago. Nobody noticed because the last deployment was 5 days ago. The expiring-credentials alert existed but was routed to a Slack channel that the platform team had archived last month during a channel cleanup.

09:35 — The fix was merged. The pipeline couldn't deploy.

09:38 — Platform team alerted. They logged into the CI system to regenerate the token.

09:45 — The CI system's admin interface required MFA. The MFA recovery codes were in a shared password manager. The shared password manager required its own MFA. The person with the recovery codes was in a meeting.

10:00 — Token regenerated. Pipeline restarted. Image pushed. Deployment started.

10:05 — Search service deployed with the fix. Incident resolved.

45 minutes of deployment inability during an active user-facing incident. The bug fix was ready at 09:20. Users experienced empty search results until 10:05.

Root cause: Expired CI/CD credential. Failed alerting (archived channel). MFA chain requiring a specific person.

Lessons:

CI/CD credentials are critical infrastructure. Monitor expiration dates with 30-day, 14-day, and 3-day warnings sent to a channel that can't be archived. (#6: CI/CD)
Emergency deployment path: have a documented manual deployment procedure that doesn't depend on CI/CD. A shell script, a documented kubectl sequence, anything. When the pipeline is down, you need an alternative.
MFA recovery access should be available to at least 2 people on every team. Single-person dependencies for infrastructure access are single points of failure.
The credential had been expiring with a 90-day cycle for 2 years. Nobody had automated the rotation because "someone always renews it." Until nobody did.

The Pattern Across All Seven

Every incident shared three characteristics:

1. The failure was predictable. Split-brain during network partitions. Config typos without validation. Cache staleness during failover. DNS propagation delays. Memory leaks without monitoring. Deploy collisions without coordination. Token expiration without alerting. None of these are novel failure modes. All of them are documented. All of them have known mitigations.

2. The mitigation existed but was disabled, misconfigured, or ignored. The watchdog was turned off. The TTL wasn't set. The alert went to an archived channel. The runbook said "expected, no action needed." The tools were there. The process around the tools wasn't.

3. The blast radius was determined by detection time. The split-brain was detected in 8 minutes — painful but contained. The cache staleness went undetected for 6 hours — expensive. The memory leak was "managed" for 8 months — deeply wasteful. The faster you detect, the smaller the damage.

What Production Actually Teaches You

Production doesn't care about your architecture diagrams. It doesn't care that you used Kubernetes, or that your CI/CD pipeline has 14 stages, or that your observability stack cost $40,000 per month.

Production cares about:

Can you detect the problem? If your monitoring doesn't alert on data freshness, you won't know your cache is stale for 6 hours.
Can you diagnose the problem? If three teams deploy simultaneously, can you see all three changes in a single timeline?
Can you fix the problem? If your CI/CD token is expired, can you still deploy the hotfix?
Can you prevent the recurrence? If you write a postmortem but don't implement the action items, the same incident will happen again. And it will be worse, because now you can't say you didn't know.

Every technology in this series — PostgreSQL, Kafka, Redis, Kubernetes, Linux, CI/CD, observability, load balancers, distributed tracing — is a tool. Tools don't prevent incidents. Processes prevent incidents. Tools help you detect and recover.

The teams that have fewer incidents aren't using better technology. They're using the same technology with better processes: deployment coordination, credential rotation, data freshness monitoring, chaos testing, and postmortems that actually lead to changes.

End of Season 1

This has been "Great Stack to Doesn't Work" — a survival guide for when everything goes wrong in production.

Ten episodes. Nine bonus pieces. Zero best practices listicles. Because production isn't a list of best practices. It's a series of judgments you make at 3 AM when the system is broken and the documentation is wrong.

The only real best practice: when your phone rings at 3 AM, be someone who's read the failure modes before they happened. That's what this series was for.

Thanks for reading. See you in Season 2.

Great Stack to Doesn't Work — Season 1 Complete
Published: June 1 – July 7, 2026

Over to You

What's the most memorable 3 AM incident you've responded to? Which of the 7 incidents in this article resonated the most with your experience?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

DEV Community

Great Stack to Doesn't Work #10 — Season Finale: "When PagerDuty Calls at 3 AM"

Great Stack to Doesn't Work #10

Season Finale: "When PagerDuty Calls at 3 AM"

Incident 1: Split-Brain — Two Masters, Two Datasets

Incident 2: "Just a Config Change" — 4 Hours of Downtime

Incident 3: Cache Invalidation — 6 Hours Undetected

Incident 4: DNS Propagation — Two Regions Couldn't See Each Other

Incident 5: Memory Leak — The Restart That Became a Ritual

Incident 6: Triple Deploy — Three Teams, No Communication

Incident 7: Token Expired — 45 Minutes Without the Ability to Deploy

The Pattern Across All Seven

What Production Actually Teaches You

End of Season 1

Over to You

Top comments (0)