Muhammad Yawar Malik

Posted on Jan 4 • Edited on Jan 7 • Originally published at Medium

What 100+ Production Incidents Taught Me About System Design

#aws #systemdesign #sre #devops

I’ve responded to more production incidents than I care to count. Some were five-minute fixes. Others kept me up for days. But every single one taught me something about how systems actually break — not how we think they break.

Here are the patterns I wish I’d recognized earlier.

1. Your Monitoring Tells You What Broke, Not Why

The first twenty incidents I handled, I trusted my dashboards completely. CPU spiked? Must be a resource problem. Database slow? Must need more capacity.

I was treating symptoms, not causes.

Real example: We had API latency alerts firing. Dashboards showed database query times were normal, CPU was fine, and network looked good. Spent two hours checking everything the monitors told us to check.

The actual problem? A third-party service we called was timing out silently, and our retry logic was backing up requests. Our monitoring couldn’t see it because we weren’t measuring the right thing — external dependency health.

The lesson: Monitor dependencies as aggressively as you monitor your own services. If you call it, you need visibility into it.

2. Timeouts Are Your Friend Until They’re Not

Early in my SRE journey, I set generous timeouts everywhere. “Better to wait than to fail fast,” I thought.

That approach nearly took down our entire service during a database incident.

When our primary database started struggling, our application waited patiently — 30-second timeouts on every query. Requests piled up. Thread pools exhausted. Memory leaked. What started as a database performance issue cascaded into a complete service outage.

The lesson: Aggressive timeouts with proper circuit breakers beat patient waiting every time. Fail fast, fail explicitly, and give your system room to breathe.

3. Autoscaling Saves You Until It Kills You

I wrote about this in detail after our AWS autoscaling incident, but it’s worth repeating: automation that works 99% of the time can make the 1% catastrophic.

During a regional AWS issue, our autoscaling detected unhealthy instances and kept spinning up replacements — in the same failing region. We burned through our service limits trying to “fix” a problem that wasn’t ours to fix.

The lesson: Every automation needs a kill switch. Know how to disable autoscaling, circuit breakers, and retry logic when the system’s fundamental assumptions are wrong.

4. The Absence of Errors Is Not Health

This one hurt. We had a payment processing service that looked perfect, no errors, latency within SLO, all green dashboards.

Turns out it had silently stopped processing payments three hours earlier due to a config change. No errors because no requests were reaching the payment logic. Everything looked healthy because we were measuring the wrong thing.

The lesson: Measure business-level metrics, not just technical ones. For a payment service, track “successful payments per minute,” not just “HTTP 200 responses.”

5. Your Biggest Risk Is What Changed Recently

I could probably retire if I had a dollar for every incident that started with “we didn’t change anything” and ended with “oh wait, we deployed this yesterday.”

The pattern is always the same:

Deploy goes out Friday afternoon
Looks fine for 24 hours
Something tips over Sunday night
Monday morning panic The lesson: Keep an audit trail of everything: deployments, config changes, infrastructure modifications. When things break, start with “what changed?” not “what’s wrong?”

6. Redundancy Only Works If You Test It

We had multi-region redundancy. Database replicas. Backup systems. All the boxes checked.

Then our primary region had issues, and we discovered our failover hadn’t been tested in eight months. It didn’t work. The configurations had drifted. The DNS setup was stale.

Our redundancy was theoretical, not actual.

The lesson: Chaos engineering isn’t optional. If you haven’t tested your failover in the last 90 days, assume it doesn’t work.

7. Logs Are Useless Until You Need Them Desperately

I used to think comprehensive logging was overkill. “We’ll add logging when we need it.”

Then I’d be in the middle of an incident, desperately needing to know what happened five minutes ago, and our logs would tell me nothing useful.

The lesson: Log liberally with structured data. When you’re debugging at 2 AM, you’ll want timestamps, request IDs, user context, and state changes, not generic “something happened” messages.

8. The Hardest Incidents Are Silent Degradations

Sudden failures are obvious. Silent degradations are insidious.

We once had a memory leak that took three weeks to notice. Performance degraded so gradually that users complained about “feeling slower” but nothing triggered alerts. By the time we caught it, we were running at 40% capacity with no idea why.

The lesson: Track trends, not just thresholds. If your P95 latency has been creeping up for two weeks, that’s an incident waiting to happen.

9. Your Recovery Plan Assumes Too Much

Every recovery plan I’ve written assumed we’d have:

Access to all our systems
Working communication channels
The right people available
Documentation that’s current Reality is messier. I’ve debugged incidents where Slack was down, our monitoring was affected by the same issue breaking production, and the person who built the system was on vacation.

The lesson: Your incident response plan should work when everything is broken, including your incident response tools.

10. Post-Mortems Without Action Items Are Therapy Sessions

I’ve sat through dozens of post-mortems that ended with “we learned a lot” and zero concrete changes.

The incidents that don’t repeat are the ones where we:

Wrote down specific action items
Assigned owners with deadlines
Actually followed through The lesson: Every post-mortem should produce at least one pull request. If you’re not changing code, monitoring, or process, you’re not really learning.

What This Means for How You Build

These patterns have fundamentally changed how I approach system design:

I design for failure, not uptime. Every component assumes its dependencies will fail and handles it gracefully.

I measure what matters to users, not just what’s easy to measure technically.

I automate carefully, with kill switches and manual overrides for when my assumptions are wrong.

The Meta-Lesson

The biggest thing 100+ incidents taught me? Production will humble you. The system you think is rock-solid will break in ways you never imagined. The edge case you dismissed will become your 2 AM wake-up call.

But each incident makes you better. You learn what actually matters versus what you thought mattered. You build better systems because you’ve seen how the old ones broke.

That’s worth a few sleepless nights.