Building Reliable Software: Planning for Things to Break

#sre #webdev #architecture #software

We often joke that software is usually implemented in two steps: the first 80% of the time is spent on making it work, and then the second 80% of the time is spend on making it work well. People mistake demos, proofs-of-concept, and walking skeletons for products because the optimistic path is often realized in full, so under ideal lab conditions, a PoC behaves just like the full product.

At Saleor, where I act as a CTO, we spend a significant part of our engineering effort embracing the different failure states and making sure the unhappy paths are covered as well as the happy ones.

Embracing the Failure

Because it does not matter how good your software is or how expensive your hardware is, something will eventually break. The only systems that never break are ones that are never used. Amazon's AWS spends more money on preventive measures than you will ever be able to, and yet, a major outage took out the entire us-east-1 region just last October. In 2016, the world's largest particle accelerator, CERN's Large Hadron Collider, was taken offline by a single weasel. Google's Chromecast service was down for days because someone forgot to renew an intermediate CA certificate, something that needs to be done once every 10 years.

The question is not if but when. Reliability is both about pushing that point as far it's practically possible and about planning what happens when it inevitably comes. And both suffer from brutally diminishing returns.

Every additional "nine" in your uptime—getting from 90% to 99%, from 99% to 99.9%, and so on—requires ten times as much resources as the previous one. Getting from one nine to two is usually trivial and gives you roughly 33 days of additional uptime per year. The next step is ten times as much work for only 3.2 additional days. Then it's even more expensive and results in just under 8 hours of additional uptime. You then get to 47 minutes, 4.7 minutes, 37 seconds, and so on. At some point the cost of getting to the next step exceeds the losses from being unavailable.

It's similar with your firefighting tools. You can get from multiple down to one business day per simple fix with relatively simple measures. It takes more expensive tools, stricter procedures, and paid on-call duty to guarantee a same-day attempt at fixing. Shortening it further requires investing in even more specialized (and costlier) tools, better training for engineers, and a lot of upfront work on observability. And again, at some point the cost of lowering the downtime even further is guaranteed to exceed the cost of any prevented downtime.

Some of the component failures you'll encounter will be self-inflicted. Because one day you'll discover that a database server needs to be brought offline to upgrade it to the newer version that fixes a critical CVE.

Given all the above, the pragmatic approach dictates that instead of trying to achieve the impossible, we should build systems that anticipate failures, and, ideally, recover from them without human intervention. While every component's availability is capped by the product of all the SLOs of its direct dependencies, the larger system can be built to tolerate at least some failing components.

The CAP Theorem

The CAP theorem dictates that any distributed stateful system can only achieve at most two of the three guarantees: consistency, availability, and partition tolerance.

What is a distributed stateful system? Anything that stores any data and consists of more than one component. A shell script accessing a database is such a system, and so is a Kubernetes service talking to a server-less database.

The consistency guarantee demands that every time the system returns data, it either returns the most up-to-date data, or the read fails. Under no circumstances can the system return a stale copy as doing so could break an even larger system for which your system is a dependency.

The availability guarantee dictates that if the system receives a request, it must not fail to provide a response.

Partition tolerance means the system needs to remain fully operational even if some of its components are unable to communicate with some other components.

I think it's clear that it's impossible for a system to always return the latest data and never return an error while it can't reach its main database. That's why you can only pick two of the virtues and in most cases it's only practical to achieve one.

It's Systems All the Way Down

It's also important to note that any complex solution is usually a multitude of smaller systems in a trench coat. You can have systems within systems and you can pick different corners of the CAP triangle for every individual subsystem.

A practical example may be an online store that uses an external system to figure out if a given order qualifies for free shipping. The free shipping decision is delegated to a third-party system, a black box only accessed through its API. The order lines and the cost of regular shipping are stored in some sort of a database, and the storefront is backed by a web service that needs to return the valid shipping methods.

Now we have the following systems:

The external shipping discount service that we don't control. that can provide any of the CAP guarantees. Whatever it does is beyond our control.
Our internal free shipping eligibility service that depends on the database (as it needs to be able to send the cart contents) and the external service (as it needs to receive the response).
Our public web service that tells the storefront what shipping methods are available that depends on our internal free shipping eligibility service and the database (to figure out the cost of regular shipping).
The entire store that depends on the storefront running in the client's browser being able to communicate, over the internet, with our public web service.

Since we can't do much about the external system (and if it goes down, fixing it is beyond our reach), we can make the pragmatic decision to make any system that depends on it focus on partition tolerance. For example, we could decide that if the external system can't be reached, any order is eligible for free shipping. This way, when the external system inevitably goes down, we can err on the side of generosity and lose some money on shipping but keep our store transactional (which usually more than makes up for the shipping cost). We could also decide the opposite, that if the service is down, no order can be shipped for free, potentially upsetting some customers, but still taking orders from everyone else.

Better Fault Tolerance

I think it's clear that whichever way we choose is preferable to the entire store becoming unavailable and thus accepting no orders at all.

If we broaden up the partition tolerance to general fault tolerance, we can design systems that are internally as fault-tolerant as is pragmatic and externally as available as practically possible. This prevents cracks from propagating from component to component, which gives the larger system a chance of staying transactional even while some of its individual subsystems struggle to stay online.

Fault tolerance can be achieved through documented fallbacks and software design patterns. It's a process that needs to start during the design stages as it's not easy to bolt onto an existing system. All external communication has to be safeguarded and time-boxed, with timeouts short enough not to grind the larger system to a halt. Repeated failures can temporarily eliminate the external dependency through patterns like the circuit breaker.

High availability is usually achieved through redundancy. If a single component has a 1% chance of randomly failing, adding a second duplicate as a fallback reduces that chance to 0.01%. With proper load balancing it also provides additional capacity and is a first step to auto-scaling. Of course, failure is rarely truly random and is often tied to the underlying hardware or other components, so those, too, may need to be made redundant. Multi-zone or multi-region deployments, database clustering, those are all tools that let you lower the chance of things going south at the expense of hard earned cash.

It's up to you to figure out the sweet spot that offers you relative peace of mind while still keeping the operational expenses below the potential losses.

Self-Healing Systems

Given that we can't fully prevent components from failing, what if we at least eliminated the necessity of a human tending to them once they do? A self-healing system is one that is designed to recover from failures without external intervention. I'm not talking about self-adapting code paths that the prophets of AGI promise, I'm talking about automatic retry mechanisms, dead letter queues for unprocessable events, and robust work queues that guarantee at-least-once delivery.

A good system is one that fails in a predictable manner and recovers from the failure in a similarly predictable manner. Eventual consistency is much easier to achieve than immediate consistency. Exactly-once delivery is often impossible to guarantee but at-least-once beats at-most-once under most circumstances.

Design your systems with idempotency in mind so it's safe to retry partial successes. Use fair queues to prevent a single noisy task from adding hours to wait time to all its neighbors. Treat every component as if it was malfunctioning or outright malicious and ask yourself, "How can I have the system not only tolerate this but also fully recover?"

Perhaps the most extreme version of this is the Chaos Monkey from Netflix, a tool designed to break your system's components in controllable yet unpredictable way. The engineers behind Chaos Monkey theorized that in a system designed around reliability, the actions of the Monkey should be completely invisible from the outermost systems perspective. True, with an asterisk that if you get anything wrong, your services are down and you're losing money. Perhaps not everyone can afford that.

And to get it right is often more about being smart than clever. The self-healing part could be as easy as implementing a health check and restarting the component. Or it could mean dropping the cache if you're unable to deserialize its contents, because maybe you forgot that caches can persist across schema changes. Or even restarting your HTTP server every 27 requests while you're figuring out why the 29th request always causes it to crash. Observe your systems and learn from their failures, adding preventive measures for similar classes of future problems.

Remain Vigilant

In 2026, perhaps more than ever, remain vigilant. With the advent of generative AI, some parts of your service will likely end up being written by an LLM. That model, like all models, was trained on a large corpus of code, both purely commercial and Open Source. You have to remember that most of this code, even if it didn't completely neglect its reliability engineering homework, may have vastly different assumptions about where it stands with regard to the CAP theorem.

You cannot blindly transplant code from one project to another, from an AI chatbot, or from a StackOverflow answer, without also consciously asking yourself, "How does this code anticipate and deal with failures? And does it fit my goals for this particular subsystem?"

Happy failures. Farewell and until next time!