Designing Data-Intensive Applications — Chapter 2: Data Models and Query Languages

#architecture #database #systemdesign

This post is part of a series summarizing key ideas from Designing Data-Intensive Applications by Martin Kleppmann.

Layers of Abstraction

Data Models work as the a sort of interface between different layers of abstractions.

Each model is an interface that enables every layer of abstraction to communicate with one another.

Besides that, each model also works as a way to limit what each layer above can do. E.g. If you store your data in a relational database, that decision will come with limitations on what you can or should do with it.

As applications developers, we model real world concepts into objects, classes, data structures.

These objects are then modeled into tables, or JSON documents, or some other format, in order to be stored.

Engineers that create those data storages systems need to think on a way to store that information in terms of bytes, in memory or disk.

Hardware engineers come up with ways to represent those bytes with electrical currents.

Relational, Document, and Hybrid Models

The relational model has clearly won the quest for the best way to model data. It’s been used since the 1970s. Despite that, in the 2010s, driven by the need for greater scalability, specialized query operations that aren’t well supported by the relational model, or simply more flexibility, NoSQL databases became popular.

A common criticism of the SQL/relational model comes from the way most applications are written — using the OOP paradigm. In these cases, there’s a need for a translation layer between objects and tables. ORMs are commonly used to ease that process. This mismatch is often called the impedance mismatch.

JSON is often seen as a more natural alternative that solves the impedance mismatch problem. However, JSON comes with its own issues, which we’ll discuss later.

When working with document databases, data locality is a big advantage. All the data is stored in one place, so you don’t need to look around in different tables. It’s important to note, though, that when you fetch a document, you’re fetching it entirely. So if you only need a small piece of information within that document, it might not be a good idea to store it as a document. For that reason, it’s generally recommended to use the document model for small objects.

On the other hand, having IDs and foreign keys in relational databases can be extremely useful for many reasons. They make it easier to avoid ambiguity, simplify updates (you only need to update one place), and provide better search capabilities.

Using an ID instead of plain text (relational model instead of document model) reduces data duplication. On top of that, since IDs don’t have real meaning for humans, they don’t need to change. Any piece of information that carries human meaning might need to change at some point. If you go for the document-based approach, you risk write overheads and inconsistencies while updating that information. Removing that duplication is what’s called normalization in databases.

To fetch normalized data, you need to perform joins across different tables. Document-based databases usually don’t have good support for joins. Also, data tends to get more and more interconnected as new features are added. In this sense, relational databases give us a stronger framework for dealing with that.

Regarding schema flexibility, it’s inevitable that your data will take on some structure eventually; in that sense, the document model isn’t entirely schemaless. The question is whether the database enforces that schema on write, or your application enforces it on read (schema-on-read). Both approaches have trade-offs, and we can draw an interesting parallel here with dynamically vs. statically typed programming languages. The schema-on-read approach is usually helpful when you have very heterogeneous data.

Modern databases implementing the relational model often include some form of storage locality, similar to document databases. Column-family databases like Cassandra are examples of that. It seems, though, that we’re moving toward a hybrid approach between the two.

Declarative vs Imperative Models and Historical Context

The author illustrates the difference between imperative and declarative query languages. The key distinction is that, with a declarative approach, you specify only the result you want — not how to get it. This allows the database optimizer to perform its own optimizations whenever it deems necessary. Once again, the declarative approach can be compared to depending on an abstraction rather than an implementation.

Returning to imperative vs. declarative query languages: since SQL (a relational model and declarative language) doesn’t let users enforce a specific algorithm or execution plan, query optimizers can perform optimizations that are often not possible with imperative languages. This also makes SQL better suited for distributed systems, where data is processed across multiple machines in parallel and controlling every detail of execution is difficult.

In a way, the NoSQL debate is revisiting a discussion that computer science already faced in the 1970s. At that time, the most popular database system was IBM IMS, which implemented what was called the hierarchical model — a model that shares many similarities with the JSON model used by document databases today. And, just like document databases, it struggled to represent many-to-one or many-to-many relationships.

To address those limitations, two new models were widely discussed: the network model and the relational model.

The idea behind the network model was to expand on the hierarchical model. It kept the tree-like structure that worked well for one-to-many relationships but allowed a node to have more than one parent. The links between records worked similarly to pointers. To access a specific record, developers had to navigate through these links using what were called access paths. In practice, this made the code for querying and updating databases quite complex and inflexible.

The relational model, on the other hand, introduced the idea of query optimizers — systems capable of finding the most efficient way to retrieve data, or the best “access paths.” These optimizers are complex pieces of software, but they only need to be built once.

This mirrors the difference between imperative and declarative query languages: while the network model required developers to know exactly how to reach a given record (imperative), the relational model allowed them to simply state what data they wanted (declarative).

Graph Databases and Complex Relationships

Although relational databases can naturally model many-to-many relationships, when your needs for complex relationships grow, it’s often recommended to use a graph model instead (think of social networks or even Obsidian note links). Graphs can connect different types of objects as well.

Graphs offer a more intuitive way to represent complex real-world relationships. To understand how they’re structured, we can think of them in terms of a relational database.

We have vertices, which contain an identifier and properties:

CREATE TABLE vertices (
  vertex_id integer PRIMARY KEY,
  properties json
);

And edges, which connect vertices. Every edge must have a tail vertex and a head vertex. Each edge represents a relationship identified by a label. Since every edge has both a head and a tail, there’s a sense of direction in the relationship.

CREATE TABLE edges (
  edge_id integer PRIMARY KEY,
  tail_vertex integer REFERENCES vertices (vertex_id),
  head_vertex integer REFERENCES vertices (vertex_id),
  label text,
  properties json
);

Let’s say we need to model which foods a person is allowed to consume. We could have vertices representing people, allergens, and foods. We could connect Mark to gluten with an edge labeled "allergic to", and gluten to bread with an edge labeled "found in". From there, we could infer that Mark is allergic to gluten found in bread.

Now, imagine that in the future we’d like to add Mark’s address to the graph. We could simply create a city vertex, say "London", and add an edge between Mark and London labeled "lives in". This flexibility — or evolvability — is one of the key advantages of graph models.

Neo4j is a popular graph database that uses Cypher as its declarative query language. Here’s an example of how to create a graph with Cypher:

CREATE
  (NAmerica:Location {name:'North America', type:'continent'}),
  (USA:Location {name:'United States', type:'country'}),
  (Idaho:Location {name:'Idaho', type:'state'}),
  (Lucy:Person {name:'Lucy'}),
  (Idaho)-[:WITHIN]->(USA)-[:WITHIN]->(NAmerica),
  (Lucy)-[:BORN_IN]->(Idaho);

Now suppose we want to find the names of all people who emigrated from the United States to Europe. To do that, we’d need to query people born in a city within the U.S. who currently live in Europe. Here’s how that query might look in Cypher:

MATCH
  (person)-[:BORN_IN]->()-[:WITHIN*0..]->(us:Location {name:'United States'}),
  (person)-[:LIVES_IN]->()-[:WITHIN*0..]->(eu:Location {name:'Europe'})
RETURN person.name;

In a structure like this, there are many possible ways to reach the same result. As is common with declarative query languages, the query optimizer decides the most efficient way to get there.

You could achieve the same result using a relational database, but the query would need recursion, since you don’t know how many joins are required or how long you’d need to traverse the data before finding the answer. While it’s possible, it’s much more complicated to represent this kind of traversal with SQL.

In the book’s example, the same query that takes 4 lines in Cypher requires 29 lines in SQL. It’s clear that some data models are simply better suited for certain use cases than others.