DEV Community: Romario Da Silva

Designing Data-Intensive Applications — Chapter 2: Data Models and Query Languages

Romario Da Silva — Tue, 11 Nov 2025 22:15:55 +0000

This post is part of a series summarizing key ideas from Designing Data-Intensive Applications by Martin Kleppmann.

Layers of Abstraction

Data Models work as the a sort of interface between different layers of abstractions.

Each model is an interface that enables every layer of abstraction to communicate with one another.

Besides that, each model also works as a way to limit what each layer above can do. E.g. If you store your data in a relational database, that decision will come with limitations on what you can or should do with it.

As applications developers, we model real world concepts into objects, classes, data structures.

These objects are then modeled into tables, or JSON documents, or some other format, in order to be stored.

Engineers that create those data storages systems need to think on a way to store that information in terms of bytes, in memory or disk.

Hardware engineers come up with ways to represent those bytes with electrical currents.

Relational, Document, and Hybrid Models

The relational model has clearly won the quest for the best way to model data. It’s been used since the 1970s. Despite that, in the 2010s, driven by the need for greater scalability, specialized query operations that aren’t well supported by the relational model, or simply more flexibility, NoSQL databases became popular.

A common criticism of the SQL/relational model comes from the way most applications are written — using the OOP paradigm. In these cases, there’s a need for a translation layer between objects and tables. ORMs are commonly used to ease that process. This mismatch is often called the impedance mismatch.

JSON is often seen as a more natural alternative that solves the impedance mismatch problem. However, JSON comes with its own issues, which we’ll discuss later.

When working with document databases, data locality is a big advantage. All the data is stored in one place, so you don’t need to look around in different tables. It’s important to note, though, that when you fetch a document, you’re fetching it entirely. So if you only need a small piece of information within that document, it might not be a good idea to store it as a document. For that reason, it’s generally recommended to use the document model for small objects.

On the other hand, having IDs and foreign keys in relational databases can be extremely useful for many reasons. They make it easier to avoid ambiguity, simplify updates (you only need to update one place), and provide better search capabilities.

Using an ID instead of plain text (relational model instead of document model) reduces data duplication. On top of that, since IDs don’t have real meaning for humans, they don’t need to change. Any piece of information that carries human meaning might need to change at some point. If you go for the document-based approach, you risk write overheads and inconsistencies while updating that information. Removing that duplication is what’s called normalization in databases.

To fetch normalized data, you need to perform joins across different tables. Document-based databases usually don’t have good support for joins. Also, data tends to get more and more interconnected as new features are added. In this sense, relational databases give us a stronger framework for dealing with that.

Regarding schema flexibility, it’s inevitable that your data will take on some structure eventually; in that sense, the document model isn’t entirely schemaless. The question is whether the database enforces that schema on write, or your application enforces it on read (schema-on-read). Both approaches have trade-offs, and we can draw an interesting parallel here with dynamically vs. statically typed programming languages. The schema-on-read approach is usually helpful when you have very heterogeneous data.

Modern databases implementing the relational model often include some form of storage locality, similar to document databases. Column-family databases like Cassandra are examples of that. It seems, though, that we’re moving toward a hybrid approach between the two.

Declarative vs Imperative Models and Historical Context

The author illustrates the difference between imperative and declarative query languages. The key distinction is that, with a declarative approach, you specify only the result you want — not how to get it. This allows the database optimizer to perform its own optimizations whenever it deems necessary. Once again, the declarative approach can be compared to depending on an abstraction rather than an implementation.

Returning to imperative vs. declarative query languages: since SQL (a relational model and declarative language) doesn’t let users enforce a specific algorithm or execution plan, query optimizers can perform optimizations that are often not possible with imperative languages. This also makes SQL better suited for distributed systems, where data is processed across multiple machines in parallel and controlling every detail of execution is difficult.

In a way, the NoSQL debate is revisiting a discussion that computer science already faced in the 1970s. At that time, the most popular database system was IBM IMS, which implemented what was called the hierarchical model — a model that shares many similarities with the JSON model used by document databases today. And, just like document databases, it struggled to represent many-to-one or many-to-many relationships.

To address those limitations, two new models were widely discussed: the network model and the relational model.

The idea behind the network model was to expand on the hierarchical model. It kept the tree-like structure that worked well for one-to-many relationships but allowed a node to have more than one parent. The links between records worked similarly to pointers. To access a specific record, developers had to navigate through these links using what were called access paths. In practice, this made the code for querying and updating databases quite complex and inflexible.

The relational model, on the other hand, introduced the idea of query optimizers — systems capable of finding the most efficient way to retrieve data, or the best “access paths.” These optimizers are complex pieces of software, but they only need to be built once.

This mirrors the difference between imperative and declarative query languages: while the network model required developers to know exactly how to reach a given record (imperative), the relational model allowed them to simply state what data they wanted (declarative).

Graph Databases and Complex Relationships

Although relational databases can naturally model many-to-many relationships, when your needs for complex relationships grow, it’s often recommended to use a graph model instead (think of social networks or even Obsidian note links). Graphs can connect different types of objects as well.

Graphs offer a more intuitive way to represent complex real-world relationships. To understand how they’re structured, we can think of them in terms of a relational database.

We have vertices, which contain an identifier and properties:

CREATE TABLE vertices (
  vertex_id integer PRIMARY KEY,
  properties json
);

And edges, which connect vertices. Every edge must have a tail vertex and a head vertex. Each edge represents a relationship identified by a label. Since every edge has both a head and a tail, there’s a sense of direction in the relationship.

CREATE TABLE edges (
  edge_id integer PRIMARY KEY,
  tail_vertex integer REFERENCES vertices (vertex_id),
  head_vertex integer REFERENCES vertices (vertex_id),
  label text,
  properties json
);

Let’s say we need to model which foods a person is allowed to consume. We could have vertices representing people, allergens, and foods. We could connect Mark to gluten with an edge labeled "allergic to", and gluten to bread with an edge labeled "found in". From there, we could infer that Mark is allergic to gluten found in bread.

Now, imagine that in the future we’d like to add Mark’s address to the graph. We could simply create a city vertex, say "London", and add an edge between Mark and London labeled "lives in". This flexibility — or evolvability — is one of the key advantages of graph models.

Neo4j is a popular graph database that uses Cypher as its declarative query language. Here’s an example of how to create a graph with Cypher:

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location {name:'United States', type:'country'}),
  (Idaho:Location {name:'Idaho', type:'state'}),
  (Lucy:Person {name:'Lucy'}),
  (Idaho)-[:WITHIN]->(USA)-[:WITHIN]->(NAmerica),
  (Lucy)-[:BORN_IN]->(Idaho);

Now suppose we want to find the names of all people who emigrated from the United States to Europe. To do that, we’d need to query people born in a city within the U.S. who currently live in Europe. Here’s how that query might look in Cypher:

MATCH
  (person)-[:BORN_IN]->()-[:WITHIN*0..]->(us:Location {name:'United States'}),
  (person)-[:LIVES_IN]->()-[:WITHIN*0..]->(eu:Location {name:'Europe'})
RETURN person.name;

In a structure like this, there are many possible ways to reach the same result. As is common with declarative query languages, the query optimizer decides the most efficient way to get there.

You could achieve the same result using a relational database, but the query would need recursion, since you don’t know how many joins are required or how long you’d need to traverse the data before finding the answer. While it’s possible, it’s much more complicated to represent this kind of traversal with SQL.

In the book’s example, the same query that takes 4 lines in Cypher requires 29 lines in SQL. It’s clear that some data models are simply better suited for certain use cases than others.

Designing Data-Intensive Applications — Chapter 1: Reliable, Scalable, and Maintainable Applications

Romario Da Silva — Wed, 29 Oct 2025 20:29:59 +0000

This post is part of a series summarizing key ideas from Designing Data-Intensive Applications by Martin Kleppmann.

In most data-intensive applications, a few standard components appear repeatedly: databases, caches, search indexes, stream processors, and batch processors. Collectively, these are known as data systems.

Each of these systems has its own characteristics, strengths, and weaknesses. To make informed decisions—like which database to use in a given scenario or which caching strategy fits best—we need to understand how these mechanisms work, what they excel at, and where they fall short.

That’s where Designing Data-Intensive Applications by Martin Kleppmann comes in. The book’s goal is to establish a set of principles that help us make better design decisions. But before applying those principles, we first need to clarify what exactly we’re optimizing for with each choice.

The Blurring Lines Between Data Systems

Traditionally, different data systems had clearly defined roles. Databases, message queues, and caches each served distinct purposes. Today, however, those boundaries are much blurrier. Redis, for example, is often used as a messaging system through its publish/subscribe channels, while Kafka provides database-like durability guarantees.

As developers, we now combine these once-specialized systems—each far more general-purpose than before—into composite systems that process data in specific ways to meet our business needs. In doing so, we essentially become data system designers ourselves.

This shift means we face many of the same challenges that the creators of these systems encountered:
How do we ensure data remains safe after a crash?
How do we maintain consistency at the level our use case requires?
In a sense, we’re solving the same kinds of problems—just at a different level of abstraction.

Reliability — Building from Unreliable Parts

The book defines reliability not as having “reliable data,” but as building fault-tolerant systems—systems that continue to function correctly even when software, hardware, or human errors (or even malicious actions) occur.

It’s important to distinguish between a fault and a failure.
A fault happens when a component behaves unexpectedly.
A failure occurs when that fault affects the entire system and users notice the problem.

The goal of fault-tolerant design is to prevent faults from turning into failures.

Since faults are inevitable, reliability comes from building reliable systems out of unreliable parts. One of the best ways to do this is by deliberately inducing faults to test your assumptions.

Netflix’s “Chaos Monkey” is a great example. In their 2011 blog post
, they explain how they randomly terminate production servers to ensure systems can handle unexpected failures. The goal is to avoid the nightmare scenario: a real outage affecting customers while executives are on the call and engineers are troubleshooting under pressure for the first time. By creating controlled failures, Netflix can safely observe how their systems behave—and train teams to respond effectively when real faults occur.

Scalability — Understanding and Managing Load

Next comes scalability. To scale a system, you first need to define what its load actually means. Is it the number of records processed per second? Active users? Requests per minute? You need this definition before deciding how to handle growth.

Each system has load metrics that make more sense than others. Kleppmann uses Twitter as an example: reading tweets happens far more often than writing them. Users typically read dozens of tweets for every one they post.

When someone tweets, Twitter must make that post appear on followers’ timelines within seconds. Because writes are less frequent, it makes sense to do most of the heavy lifting during the write—this makes reads much faster.

To achieve this, Twitter fans out tweets to followers as they’re written, rather than making every read request fetch tweets from a single database. Otherwise, the database would quickly become a bottleneck. Of course, this approach has limits—the “celebrity problem” being one. Fanning out a single tweet to millions of followers is inefficient, so Twitter uses a different read workflow for those cases.

Once you’ve defined your load, the next step is to understand how your system performs under it. A common metric here is response time (or latency).

Response times are usually expressed as percentiles. For example, you might say that 95% of requests are processed in under 200 ms. The median (p50) means half of all requests complete within a given time. Amazon, for instance, monitors latency at the 99.9th percentile—because even if only 1 in 1,000 users experience slow responses, those users may represent high-value customers handling large transactions.

At high percentiles, queuing delays often dominate. Servers can only process a limited number of requests in parallel, and a few slow ones can block the rest—this is called head-of-line blocking. Even fast requests end up waiting, making the entire system feel sluggish. For this reason, it’s crucial to measure response times from the client’s perspective, not just on the server.

Once you have accurate measurements, you can plan how to handle increased load without sacrificing performance.

The two main strategies are scaling up (using a more powerful machine) and scaling out (adding more machines). Scaling up is simpler but quickly becomes expensive. Scaling out is more complex—especially for stateful systems like databases—but it’s often the only practical path as demand grows.

Fortunately, as distributed systems tools and abstractions improve, we can expect more data systems to be designed as distributed by default.