Designing Data-Intensive Applications — Chapter 1: Reliable, Scalable, and Maintainable Applications

#systemdesign #architecture #dataengineering #database

This post is part of a series summarizing key ideas from Designing Data-Intensive Applications by Martin Kleppmann.

In most data-intensive applications, a few standard components appear repeatedly: databases, caches, search indexes, stream processors, and batch processors. Collectively, these are known as data systems.

Each of these systems has its own characteristics, strengths, and weaknesses. To make informed decisions—like which database to use in a given scenario or which caching strategy fits best—we need to understand how these mechanisms work, what they excel at, and where they fall short.

That’s where Designing Data-Intensive Applications by Martin Kleppmann comes in. The book’s goal is to establish a set of principles that help us make better design decisions. But before applying those principles, we first need to clarify what exactly we’re optimizing for with each choice.

The Blurring Lines Between Data Systems

Traditionally, different data systems had clearly defined roles. Databases, message queues, and caches each served distinct purposes. Today, however, those boundaries are much blurrier. Redis, for example, is often used as a messaging system through its publish/subscribe channels, while Kafka provides database-like durability guarantees.

As developers, we now combine these once-specialized systems—each far more general-purpose than before—into composite systems that process data in specific ways to meet our business needs. In doing so, we essentially become data system designers ourselves.

This shift means we face many of the same challenges that the creators of these systems encountered:
How do we ensure data remains safe after a crash?
How do we maintain consistency at the level our use case requires?
In a sense, we’re solving the same kinds of problems—just at a different level of abstraction.

Reliability — Building from Unreliable Parts

The book defines reliability not as having “reliable data,” but as building fault-tolerant systems—systems that continue to function correctly even when software, hardware, or human errors (or even malicious actions) occur.

It’s important to distinguish between a fault and a failure.
A fault happens when a component behaves unexpectedly.
A failure occurs when that fault affects the entire system and users notice the problem.

The goal of fault-tolerant design is to prevent faults from turning into failures.

Since faults are inevitable, reliability comes from building reliable systems out of unreliable parts. One of the best ways to do this is by deliberately inducing faults to test your assumptions.

Netflix’s “Chaos Monkey” is a great example. In their 2011 blog post
, they explain how they randomly terminate production servers to ensure systems can handle unexpected failures. The goal is to avoid the nightmare scenario: a real outage affecting customers while executives are on the call and engineers are troubleshooting under pressure for the first time. By creating controlled failures, Netflix can safely observe how their systems behave—and train teams to respond effectively when real faults occur.

Scalability — Understanding and Managing Load

Next comes scalability. To scale a system, you first need to define what its load actually means. Is it the number of records processed per second? Active users? Requests per minute? You need this definition before deciding how to handle growth.

Each system has load metrics that make more sense than others. Kleppmann uses Twitter as an example: reading tweets happens far more often than writing them. Users typically read dozens of tweets for every one they post.

When someone tweets, Twitter must make that post appear on followers’ timelines within seconds. Because writes are less frequent, it makes sense to do most of the heavy lifting during the write—this makes reads much faster.

To achieve this, Twitter fans out tweets to followers as they’re written, rather than making every read request fetch tweets from a single database. Otherwise, the database would quickly become a bottleneck. Of course, this approach has limits—the “celebrity problem” being one. Fanning out a single tweet to millions of followers is inefficient, so Twitter uses a different read workflow for those cases.

Once you’ve defined your load, the next step is to understand how your system performs under it. A common metric here is response time (or latency).

Response times are usually expressed as percentiles. For example, you might say that 95% of requests are processed in under 200 ms. The median (p50) means half of all requests complete within a given time. Amazon, for instance, monitors latency at the 99.9th percentile—because even if only 1 in 1,000 users experience slow responses, those users may represent high-value customers handling large transactions.

At high percentiles, queuing delays often dominate. Servers can only process a limited number of requests in parallel, and a few slow ones can block the rest—this is called head-of-line blocking. Even fast requests end up waiting, making the entire system feel sluggish. For this reason, it’s crucial to measure response times from the client’s perspective, not just on the server.

Once you have accurate measurements, you can plan how to handle increased load without sacrificing performance.

The two main strategies are scaling up (using a more powerful machine) and scaling out (adding more machines). Scaling up is simpler but quickly becomes expensive. Scaling out is more complex—especially for stateful systems like databases—but it’s often the only practical path as demand grows.

Fortunately, as distributed systems tools and abstractions improve, we can expect more data systems to be designed as distributed by default.