Building Reliable Software: The Trap of Convenience

#sre #architecture #security #software

When I started to learn programming a PC (as opposed to programming an Amiga), it was still the 20th century. In the days of yore, when electricity was a novel concept and computer screens had to be illuminated by candlelight in the evenings, we'd use languages like C or Pascal. While the standard libraries of those languages provided most of the needed primitives, it was by no means a "batteries included" situation. And even where a standard library solution existed, we'd still drop to inline assembly for performance-critical sections because those computers were definitely not fast enough to spare any CPU cycles. PCs were also still similar enough that they used the same CPU architecture, and thus the same machine code, so the assembly sections were not that hard to maintain.

Today, the x86 architectures come with dozens of optional extensions and you're not even guranteed to encounter an "Intel" machine (technically referred to as "amd64"). RISC CPUs are coming back to reclaim computing thanks to the efforts of Apple (Apple Silicon), Amazon (AWS Graviton), Microsoft (Azure Cobalt), and others licensees of ARM. In 2026, writing assembly code is something you only do if there is absolutely no other solution. The number of versions of that inline section keeps growing with every new CPU family. Meanwhile, modern compilers are getting so good at optimizing the resulting machine code, and the computers got so fast, that manual optimization is usually not worth the effort. Unless it absolutely is.

So the modern programming languages split. There are system programming languages that optimize for the performance with an extra focus on safety, like Rust. And there are application programming languages that optimize for "productivity", that is the speed at which we produce useful software, rather than the speed at which said software runs.

Productivity Through Convenience

Productivity demands higher-order abstractions. Instead of representing how the underlying hardware works, the programming languages and their libraries instead model how people think about the problems.

Thanks to this, instead of writing several pages of C code to allocate a send buffer and a receive buffer, open a socket, set its options, resolve the target hostname, establish a connection, and so on, you can fetch and parse a web resource in a few lines of Python:

import requests

def fetch_json(url):
    data = requests.get(url)
    return data.json()

With just a few keystrokes I can achieve what used to take me hours just to type out. Thanks to both Python (and C#, and TypeScript, and likely also your favorite language) and the requests library (and its equivalents) being open-source and available for everyone to use for free, we can all collectively and individually build more complex systems with less effort.

Except that, as I mentioned in my previous post, it's systems all the way down. And all those systems make a (conscious or not) choice on what it means to be a reliable tool.

As a fun exercise, look at the above example and try to figure out what the biggest problem with that bit of code is. It certainly works for the happy path, which would make it pass a lot of the unit tests!

Let's walk through several (but not all, the complete list would be way too long) of the things that can go wrong in just two lines of code.

A Litany of Failure Modes

data = requests.get(url)
return data.json()

An error will be raised if the URL is not a valid URL (or not even a string, yay modern languages!).
An error will be raised if the target hostname cannot be resolved (either the domain does not exist, or your DNS server cannot be reached).
If the target hostname resolves to multiple IP (IPv6 or IPv4) addresses, then try each one is tried in sequence until one accepts the connection on the destination port. Since no timeout is specified, the default system timeout for TCP is used (6 connection attempts totalling about 127 seconds on modern Linux systems) for each individual IP address. If none of the IP addresses end up accepting, an error is raised.
If the target system does not speak our desired protocol and responds with random gibberish, an error is raised.
An error is raised if the protocol is secure (like HTTPS) and the target server does not offer any of the TLS variants we trust.
If the protocol is secure (like HTTPS) and the target system responds with an invalid TLS certificate (either broken, expired, or not trusted by any of the certificates our system trusts), an error is raised.
An error is raised if the certificate is valid but does not match the target hostname.
If the target system stops responding, an error is raised. Since no timeout is specified, the default system TCP read timeout is used (60 seconds on modern Linux systems). If the target system sends anything during that time window, the system timer will be reset as the read was successful.
If the response is a valid HTTP redirect response, the process is restarted from step 1, using the redirect URL as the new target URL.
If we somehow get to this point (as unlikely as it seems) and the response is not a valid JSON string, an error is raised.

Different parts of the above are problematic, or extremely problematic, depending on what your code is attempting to achieve.

If your goal is to download a movie to watch, preserve artifacts of a system you're about to delete, or create complete copies of websites for a project like the Internet Archive's Wayback Machine, then chances are you want the code to take all the necessary time, perhaps even multiple attempts instead of giving up on the first transient error. The desired outcome is to access the resource at all costs.

But if your goal is to figure out if an order is eligible for free delivery, you probably don't want to keep the user waiting for literally minutes just because some external server crashed. By the time you send your fallback response, the user will be nowhere to be found, having abandoned their order and long since closed the browser tab.

A web API that takes minutes to respond is a useless one, but the same is true for many other use cases. Imagine having to stand in front of an ATM for several minutes before the machine finally spits your card out with an error. All the while people behind you start to make arrangements for your upcoming funeral.

Failing Fast

If you guessed that the problem with the code is that it could crash, you probably guessed wrong. I'm going with "probably", because I don't know what your use case is. But most systems handle failing fast rather gracefully.

A simple try/except block wrapped around the call to our function could take care of specifying the fallback behavior. And even that is absent, the underlying framework is likely built to withstand that and return an error instead of crashing, like in the old times. What it can't do is rewind the time it took the code to fail.

Resisting Abuse

You can't have reliability with at least some resilience (but I guess failing reliably is also a form of consistency). So you need to teach the system how to defend itself against undesirable behaviors. Some of them outright malicious, some not.

In the above example, an extremely malicious behavior would be to take a domain name and configure its zone record to resolve to 511 different IP addresses, all from non-routable network segments such as 196.168.0.0/16.

Or to have a domain resolve to 9 non-routable IPs and one that returns a HTTP redirect to the same domain.

Or to point the URL to a server that streams the response by sending one byte every 50 seconds, thus never triggering a read timeout.

If those numbers sound oddly specific, this is because we did all those things internally at Saleor. I don't remember why 511 was the number of IPs we went for, maybe CloudFlare didn't allow more records to be added, maybe it didn't matter because no one was going to wait for that test to time out anyway.

But a malicious actor could also ask your system to access a URL of an internal system they can't access directly. If the URL comes from an untrusted source, it could be used to probe your internal network for open ports, based on the error codes you send back. And if your system is foolish enough to show the entire "unexpected response" from such an URL, also to steal your credentials.

Did you know that any EC2 instance on AWS can make a request to http://169.254.169.254/latest/meta-data/ to learn about its own roles? And that a subsequent call to http://169.254.169.254/latest/meta-data/iam/security-credentials/<role-id>/ returns both an AWS access key and its corresponding secret? Yikes on leaking that!

Final Thoughts

Convenience is the biggest pitfall of the modern high-abstraction productivity. All the important bits and compromises are buried deep in the convenience layers, making it impossible to reason about systems without popping the hood. Meanwhile, your IDE, your code review tools, your whiteboard interviews, surface the types of problems that—in the grand scheme of things—don't matter all that much: the ones your system can recover from automatically.

If you ever find yourself accessing the great unknown from Python code, take a look at the requests-hardened wrapper we created for requests. It makes it safe to point the library at untrusted URLs from code that doesn't have forever to wait for the outcome. It also works around a DoS potential in Python's standard library that we also reported (responsibly, it's only public because we were asked by the maintainers to make it public).

Make sure your team doesn't mistake simple code for the simplicity of the underlying systems. The reassurance offered by the complexity becoming invisible is a fake one.

Happy failures. Farewell and until next time!