Designer, developer, & entrepreneur. Founder of Screenity + other ventures.
For a long time, I thought my job as a developer was to make systems work.
Now I believe my real job is to make systems fail clearly.
This article is about a shift that quietly changed how I design architectures, APIs, and even teams — especially when building complex products that look fine… right until they don’t.
This is not a beginner post.
This is about decisions you only care about after you’ve shipped, broken things, fixed them at 3 a.m., and promised yourself “never again.”
The Problem: Complexity Hides Behind “Working Code”
Most systems don’t fail because of bad code.
They fail because:
- assumptions are implicit
- constraints are undocumented
- failure modes are invisible
- success paths are optimized, failure paths are ignored
In early versions of one of my products,
everything “worked”:
- APIs responded
- UI was smooth
- metrics were green
Until a single edge case cascaded into:
- partial data corruption
- retries amplifying load
- logs that told stories, not truth
The system didn’t fail loudly.
It failed politely.
That was the real problem.
The Shift: Designing for Failure First
Instead of asking:
“How do we make this scalable?”
I started asking:
“How does this break — and how do we know immediately?”
This led me to adopt a few non-negotiable design principles.
1. Constraints Are Features, Not Limitations
Every complex system has constraints.
The mistake is pretending they don’t exist.
Examples of explicit constraints I now write down before coding:
- Maximum request size (hard fail, not best-effort)
- Acceptable staleness of data
- Timeout budgets per dependency
- Retry limits (with exponential backoff or none at all)
- Ownership boundaries (this service does not fix that service’s bugs)
If a constraint isn’t explicit, it becomes folklore.
Folklore doesn’t survive outages.
2. Failure Modes Must Be Named
If you can’t name how something fails, you can’t reason about it.
I now document failure modes like this:
- Upstream unavailable → return cached degraded response
- Partial write success → emit compensating event
- Client misuse → reject loudly with actionable error
- Unknown state → stop processing, alert humans
This isn’t pessimism.
This is engineering honesty.
3. Observability Is Not Logging
Logs are narratives.
Metrics are aggregates.
Traces are timelines.
None of them alone tell the truth.
For critical paths, I ask:
- What signal tells me this is broken?
- How long between breakage and detection?
- Can I tell who is affected without guessing?
If the answer is “we’ll inspect logs,” the system is lying to me.
- APIs Should Be Unforgiving (to Protect the System)
“Be liberal in what you accept” sounds nice — until it becomes technical debt with interest.
I now design APIs that:
- validate aggressively
- reject ambiguous input
- return errors that explain what to fix, not just what failed
Kind APIs protect users.
Strict APIs protect systems.
Great APIs do both.
5. Teams Are Part of the Architecture
This one took me the longest to accept.
If:
- ownership is fuzzy
- responsibility is shared by everyone
- failures are “someone else’s layer”
Then the system will reflect that ambiguity.
Clear ownership boundaries reduce:
- silent failures
- duplicated fixes
- emotional load during incidents
Technical architecture and social architecture are inseparable.
What Changed for Me
After adopting this mindset:
- incidents became rarer, but more importantly, shorter
- debugging went from “what is happening?” to “this exact thing failed”
- onboarding new developers became faster
- my own cognitive load dropped significantly
The system didn’t become simpler.
It became more honest.
Conclusion
Complexity is unavoidable.
Confusion is optional.
Designing for failure doesn’t make you negative.
It makes you reliable.
If your system fails:
- clearly
- quickly
- and in ways you already understand
You’re doing something right.
Top comments (0)