Do you think schema flexibility justifies using NoSQL? Think twice.

Ernesto Campohermoso — Fri, 27 Dec 2024 14:05:55 +0000

Introduction

In the world of software development, there is a common belief that implementing a NoSQL database is justified solely by its schema flexibility. However, this perspective can be misleading if we overlook fundamental aspects such as the CAP theorem and the differences between Schema on Write and Schema on Read. As Martin Kleppmann explains in his book Designing Data-Intensive Applications, the choice of a database should be made with a deep understanding of the requirements for consistency, availability, and partition tolerance, while keeping in mind the schema with which the data will be managed.

In this article, you will learn:

What the CAP theorem entails and how it influences your application's architecture.
Why eventual consistency is not always desirable, especially when handling critical information like financial transactions.
The role partition tolerance plays and the trade-offs in consistency within large-scale distributed systems.
The real difference between document-oriented and relational databases, and how Schema on Read and Schema on Write are not as distinct as they seem, even at the code level.

1. CAP Theorem

The CAP theorem states that in any distributed system, it is impossible to simultaneously guarantee Consistency, Availability, and Partition Tolerance. You can only prioritize two of these properties at a time:

Consistency: All nodes see the same data at the same time.
Availability: The system always responds, even if some nodes fail.
Partition Tolerance: The system continues to function despite communication failures between nodes.

When designing a large-scale application, you must decide which of these elements are critical and which can be sacrificed. This balance guides the choice between relational databases (emphasizing strong consistency) and many NoSQL databases (leaning toward availability and partition tolerance, but with eventual consistency).

2. Eventual Consistency and When It’s Not Enough

Eventual consistency means that in a distributed system, all nodes will eventually reach a consistent state over time. It works well for social networks or applications where slight delays in data updates do not compromise business integrity.

However, when handling money or banking operations, eventual consistency becomes a risk. In such cases, strong consistency is required, ensuring that every transaction is immediately reflected across the system without the possibility of temporary discrepancies.

Case Study

Imagine a financial institution using MongoDB configured for eventual consistency. If a user withdraws all the funds in their account, leaving it at zero, but that update doesn’t propagate to all nodes immediately, and a credit card payment is processed simultaneously, the customer’s balance could go negative.

For this reason, systems managing money typically require strong consistency as an essential prerequisite.

3. Large Data Volumes and Trade-offs for Partition Tolerance

When dealing with large volumes of data, many distributed architectures choose to sacrifice a certain degree of consistency to achieve high partition tolerance. If a node fails to respond or there are network issues, the application can still function with the remaining nodes.

This trade-off is essential for services with millions of simultaneous users or global systems spanning multiple geographic regions. But it’s not always the right choice: if your application demands absolute precision and cannot tolerate outdated data, adopting a model that sacrifices consistency can be detrimental.

4. Document-Oriented vs Relational: Schema on Read vs Schema on Write

In a document-oriented model (typically NoSQL), the term Schema on Read is used: the structure of documents is not rigidly defined when data is written, and the schema is validated during reading or processing. On the other hand, in a relational model, Schema on Write is used: a rigid schema is defined before any record is entered into the database.

While document-oriented systems offer more flexibility, the reality is that there is always a schema, in one form or another. The code processing the data must know what fields exist and how to interpret them. For example, if your application expects a "price" field to calculate a total, it cannot “guess” where that field is if it wasn’t predefined. Thus, the supposed freedom of schema doesn’t eliminate the need for a coherent design and careful evolution.

Schema Modification Example

In a Relational Database

If we need to add a new column to store the last update date, we must alter the table:

ALTER TABLE products
ADD COLUMN last_update_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP;

Any application querying this table must account for this new column. If there’s a process using the last_update_date information, the code will need to be updated to read it.

In a Document-Oriented Database

Let’s imagine a products collection in MongoDB, where the last update date wasn’t previously stored. From a specific date, we decide to add this field. The new “schema” is handled in the code:

// Starting from 2024-01-01, we add new logic:
function processProduct(product) {
  const referenceDate = new Date('2024-01-01');

  if (new Date() >= referenceDate) {
    if (!product.last_update_date) {
      // Add the field to the document
      product.last_update_date = new Date();
    }
  }

  // Process the rest of the product
  // ...
}

In this scenario, altering a formal structure isn’t required. But the code must handle the new field and, if it’s missing, generate a default behavior (e.g., creating it). In other words, Schema on Read is managed by the application, not by the database.

Conclusion

Choosing the right database model requires considering the CAP theorem and your application’s specific needs. If you need scalability and partition tolerance to handle large data volumes with eventual consistency, a document-oriented or NoSQL database may be the best option. On the other hand, if data accuracy and integrity are critical, relational databases with strong consistency often excel.

Don’t forget that other models, such as graph databases, are ideal for handling complex (many-to-many) relationships and exploring deep connections between entities. The type of relationship (one-to-many, many-to-many, etc.) and the required level of consistency should guide your decision. With this approach, you’ll build a reliable, scalable, and coherent system without falling into the misconception that schema flexibility is the only reason to use NoSQL.

Image from: https://www.commitstrip.com/en/2012/04/10/what-do-you-mean-its-oversized/?

DEV Community: Ernesto Campohermoso