Why Edge Cases Matter in Distributed Systems

#architecture #discuss #systemdesign

I’ve been reading Designing Data-Intensive Applications, and it’s making me more aware of the assumptions we rarely question in everyday engineering.

What stood out to me is not how technical the book is, but how it slowly changes the way you think about systems. It brings attention to assumptions that often fade into the background because things usually work. As developers, we rely heavily on what the book refers to as engineering folklore. These are ideas that get passed around because they tend to hold in practice, not necessarily because we fully understand where they break down.

As I kept reading, I noticed how often the book encourages looking past surface-level behavior. Many systems look similar from the outside. They expose similar APIs, use familiar tools, and behave well under normal conditions. But that superficial similarity can be misleading. Once something deviates from what we expect, systems that appear similar can behave very differently.

Most of these assumptions are reasonable in isolation. The challenge is that they become more fragile once systems grow and start interacting.

That gap between what we assume and what actually happens is where many issues start.

The Leap Second Incident

Before reading this book, I had not heard about the leap second incident. After coming across it, I got curious and looked into what actually happened and how systems adapted afterward.

The leap second incident most commonly refers to the widespread technical disruptions that occurred on June 30, 2012, when an extra second was added to atomic clocks to keep them aligned with the Earth’s rotation. This introduced a rare situation where a minute contained 61 seconds, creating the timestamp 23:59:60.

In theory, this adjustment is well-defined. In practice, many systems were not designed to handle time appearing to pause, repeat, or move backward.

One of the most severe issues came from a bug in the Linux kernel’s high-resolution timer. When the leap second occurred, some systems entered a live-lock state. CPU usage spiked to 100 percent as processes became trapped in tight loops, preventing the system from making progress.

Because so much infrastructure shared similar assumptions about time, the effects spread quickly. Major websites such as Reddit, LinkedIn, Mozilla, Yelp, and others experienced partial or total outages. Reddit, for example, was unavailable for over an hour. Outside of web services, the Amadeus airline reservation system failed, leading to hundreds of flight delays in Australia.

Applications built on Java-based systems, including technologies like Cassandra and Hadoop, were especially vulnerable because they depended on the same underlying Linux timers. Once time behaved unexpectedly at the operating system level, higher-level software inherited the problem.

What stood out to me is that different parts of systems reacted differently. Some components attempted to move forward in time. Others paused or retried, waiting for time to advance. That inconsistency created situations that closely resembled deadlocks or live-locks, not because of incorrect locking logic, but because time itself had become an unreliable shared dependency.

After the incident, engineers did not just patch bugs. They rethought how to handle time.

Google and Meta adopted an approach known as leap smearing. Instead of inserting the extra second all at once, they gradually spread it over several hours. From the system’s perspective, time never jumps or stalls. It simply runs slightly slower for a short window.

On the operating system side, Linux timekeeping was improved to better handle leap seconds without threads spinning or stalling unexpectedly. Over time, the risks became significant enough that international timekeeping bodies voted in 2022 to abolish leap seconds entirely by 2035. As of now, no new leap seconds are scheduled.

What stayed with me is that the problem was not just the extra second. It was the assumption that time is always consistent and universally agreed upon across systems.

Faults, Failures, and Reality at Scale

One of the most important distinctions the book makes is between a fault and a failure. A fault happens when one part of the system deviates from its expected behavior. A failure happens when the system as a whole stops providing the service the user expects. That difference matters more than it sounds.

Faults are normal. Disks fail. Networks slow down. Clocks drift. Humans make mistakes. In large systems, these are not exceptions. They are expected. Reliable systems are not the ones where nothing ever goes wrong. They are the ones designed to continue working even when parts of the system misbehave.

The book gives a simple example that makes this idea more concrete. Hard disks are reported to have a mean time to failure of about 10 to 50 years. That sounds reassuring until you operate thousands of disks. At that scale, disk failures are not rare events. They happen regularly. What feels unusual on a single machine becomes normal behavior in a large system.

This same idea shows up in fan-out patterns, where a single event triggers many downstream actions. These patterns are powerful, but they also increase the impact of small issues. A delay or error in one place can quickly spread across multiple services. When you add impedance mismatch between systems that were never designed to fit together naturally, complexity grows even faster. Over time, systems accumulate complexity that makes them harder to understand, operate, and evolve safely.

Performance Is Not Just About Speed

Another subtle lesson that stood out to me is the difference between latency and response time. These terms are often used interchangeably, but they are not the same. Response time is what the user experiences. It includes processing time, network delays, and time spent waiting in queues. Latency is the time a request spends waiting before it is even handled.

From the user’s perspective, these distinctions do not matter. What matters is how the system feels. A system can be technically fast and still feel slow if requests spend too much time waiting.

The book also challenged some common assumptions about performance. Counterintuitively, in-memory databases are not always faster because they avoid disk reads. Often, they are faster because they avoid the overhead of encoding in-memory data structures into formats suitable for disk storage. That detail changes how you think about optimization and where performance bottlenecks really come from.

Designing Systems for People

As the chapters progress, the book keeps coming back to reliability, scalability, and maintainability. What I appreciate is that these are not framed as purely technical goals. They are human goals.

Systems need to be operable so teams can keep them running. They need to be simple enough for new engineers to understand. They need to be evolvable so they can adapt as requirements change.

Systems that accumulate unnecessary complexity do not just slow machines down. They slow people down. They make learning harder, change riskier, and mistakes more expensive. Simplicity here does not mean fewer features; It means fewer hidden assumptions and less unnecessary complexity.

Key Takeaway

Considering edge cases is not a new concept in programming. Most developers learn early to handle null values, invalid input, and obvious boundary conditions. That part is expected.

What becomes less obvious, especially as systems grow, is a different category of edge cases. These are not logical mistakes, but rather assumptions that are often taken for granted.

Time is assumed to move forward smoothly, despite the fact that it can drift, stall, or repeat. Disks are often assumed to fail infrequently, yet failures are statistically expected. Networks are often treated as either up or down, while in practice, failure is usually partial and shows up as latency, packet loss, or general degradation. Messages are assumed to arrive once and in order, although duplication and reordering are common.

At small scale, these assumptions usually work well enough. On a larger scale, they start to interact.

A clock drifting slightly, a disk failing occasionally, or a message arriving late does not feel like a problem on its own. But when these small deviations happen across many machines and services, they stop being edge cases and start becoming normal behavior.

That is often where faults turn into failures.

The takeaway here is not to try to handle every possible scenario. That is not realistic. It is to be more intentional about which assumptions a system depends on, and what happens when those assumptions stop holding.

Designing for the happy path is necessary. Designing with deviation in mind is what keeps systems stable over time.

If this resonated with you, I would be interested to hear which assumption surprised you the most when it broke in a real system.