What DDIA taught me about reliability

#architecture #distributedsystems #sre #systemdesign

I used to think a reliable system was simply one that doesn't crash.

The book challenges that from the first pages.

Failures are expected. Databases go down. Networks fail. Third-party APIs stop responding.
What matters is not whether failures happen — it's how the system behaves when they do.

This hit me immediately when I thought about a service instance I've been working with.

If it suddenly became unavailable — what would happen?
Would the app degrade gracefully, serving cached data or a fallback?
Or would the entire request chain fail?

The honest answer: I hadn't thought about it carefully enough.

Key takeaway from chapter 1:

Reliability is not about preventing every failure.
It's about designing systems that continue operating despite them.

Hardware faults, software errors, human mistakes — they're not edge cases.
They're the default state you design around.