DEV Community

Cover image for Resilience > Perfection
Yashodhan Singh
Yashodhan Singh

Posted on

Resilience > Perfection

Software engineering is mostly about decisions under uncertainty.

Every framework you choose, every abstraction you build, every quick fix you push to production is a decision under uncertainty.

The hard part isn’t writing code. It’s deciding when you don’t have perfect information.

A useful mental model for this comes from Donald Rumsfeld. He split knowledge into four categories:

Rumsfeld Matrix

  • Known Knowns: things we know
  • Known Unknowns: things we know we don’t know
  • Unknown Knowns: things we know but have forgotten or buried
  • Unknown Unknowns: things we don’t even realize we don’t know

Here’s how those quadrants show up in engineering work.


1. Known Knowns

“Use the linter.”

Examples

  • Choosing between a for loop and .map
  • Configuring a database connection string

These are the obvious decisions. You already understand the trade-offs, and the outcome is predictable. The best way to handle known knowns is to automate them. Style guides, linters, and conventions remove the need to think about them at all, freeing engineers to focus on the problems that actually require judgment.


2. Known Unknowns

“Let’s spike it.”

Examples

  • Which database fits a multi-tenant SaaS?
  • How to meet GDPR logging requirements?

This is where you’re aware of the gap in your knowledge. You know you don’t know, which means you can plan for it. Known unknowns are best approached with structured exploration: build a proof of concept, run a benchmark, or consult someone with domain expertise. The key is to turn uncertainty into learning, and to capture that knowledge so it doesn’t slip back into ambiguity later.


3. Unknown Knowns

“If it works, don’t touch it.”

Examples

  • A scaling trick solved by an engineer three years ago
  • An API that looks weird but has a buried reason

This is where legacy code thrives. The system works, but the original context has been forgotten. Maybe someone added a hack to prevent cascading failures, or maybe the odd-looking API was designed around an old constraint that isn’t obvious today. Unknown knowns create risk because you might break something you didn’t even realize had a purpose.

The best antidote is to preserve knowledge with documentation, ADRs, and mentorship so that wisdom doesn’t quietly disappear over time. This doesn’t mean you should never touch legacy code. It means you approach it with care, seeking to understand before rewriting. Refactoring from knowledge leads to progress; refactoring from ignorance often reopens old wounds.


4. Unknown Unknowns

“We’ll find out in production.”

Examples

  • A user signs up with 🚀💥🔥 as their username and the system breaks
  • A sudden traffic spike from going viral melts your caching layer
  • A timezone edge case shifts reports by a full day

These are the landmines. Unknown unknowns are the problems you don’t even know exist until they explode in your face. You thought you had tested thoroughly, but reality exposes blind spots you never considered. The emoji username bug is a perfect case: Unicode normalization, surrogate pairs, and multi-byte storage were never even on your radar.

Unknown unknowns can’t be eliminated, but you can prepare for them: observability, tests with odd inputs, chaos drills, and a blameless culture make your systems and your team more adaptive when the unexpected happens.


How to Build Resilient Software

Resilience is the art of surviving the unknown. It doesn’t mean overengineering or locking everything down. It means buying yourself flexibility so that when surprises happen, they don’t turn into catastrophes.

Practical patterns

  • Preserve optionality When you change a database schema, don’t flip everything at once. Support old and new fields side by side, migrate in phases, and only remove the old path when you’re confident nothing depends on it.
  • Capture history before optimizing If you don’t have a fully fleshed-out auditing system, start by keeping every version of what matters, like storing every CSV upload. It may feel heavy-handed, but it means you won’t lose information. Later, when you understand the use cases better, you can design a slimmer system.
  • Fail gracefully Don’t assume success. Timeouts, retries, circuit breakers, and clear error handling keep failures local instead of global.
  • Embrace feedback loops Logging, monitoring, and alerts aren’t just for outages. They help you discover edge cases you didn’t anticipate and feed those back into design and testing.

Resilience rarely looks elegant at first. As engineers, we often aim for the most optimal, minimal, or clean design. But resilience requires flexibility, and flexibility often comes with a few rough edges. Temporary duplication, layered schemas, storing too much data, or supporting multiple code paths might feel like antipatterns. In reality, they are often what keep systems safe until the unknowns are better understood.

Resilience isn’t about perfection. It’s about giving your system the ability to bend without breaking, and your team the confidence to make changes even in the face of uncertainty.


Why This Matters

Most real failures come from the bottom half of the matrix: forgotten wisdom and surprise landmines. By naming these categories, the Rumsfeld Matrix gives us a way to think about uncertainty in engineering.

Great engineers don’t try to erase uncertainty. They accept that surprises will always show up and build systems and teams that can adapt when they do.

Top comments (0)