Franck Pachot

Posted on May 9 • Edited on May 13

Do you need foreign keys and surrogate keys because you broke your relationships ?

#architecture #database #sql #document

Many developers believe they need a relational database because their data has relationships. The reasoning usually goes: "My entities are related, therefore I need foreign keys, therefore I need an RDBMS."

But the causality is actually reversed. Normalization creates the need for foreign keys—not the other way around.

This misunderstanding persists when the builders are agents, because LLMs operate on high-dimensional similarity (vectors), which may catch correlation better than causality.

Normalization and relations

When you normalize your domain model into Normal Forms, you decompose aggregates—cohesive groups of entities that form a natural consistency boundary—into separate, independent tables. Each table represents a single relation in the mathematical sense: a set of tuples (rows) constrained by functional dependencies on a key. That is what "relational" means in "relational model"—it refers to these mathematical relations (tables), not to the relationships between entities.

Actually, this normalization is a decomposition that removes the structural associations between entities. Co-location, direct references, and pointers that once made related data physically and logically cohesive are eliminated. The connections between entities exist only through shared attribute values, and are reconstructed at query time via joins on those columns. Normalization eliminates those relationships from the data structure, keeping only direct, functional dependency of an attribute on its key.

In other words, relational databases are not meant to strengthen relationships—they dissolve them from the schema and defer their reconstruction to the query, in order to have an application-agnostic representation of data.

An example: the Order aggregate

Consider a typical e-commerce domain. An Order is a natural aggregate: it contains line items, a shipping address, and payment details. In your domain model, these are not independent things—a line item has no meaning outside its order, a shipping address belongs to that specific purchase, and the payment is authorized for that exact total.

In a document database, this aggregate stays whole:

{
  "_id": ObjectId("order_123"),
  "customer": "Alice",
  "status": "confirmed",
  "shippingAddress": {
    "street": "123 Main St",
    "city": "Springfield",
    "zip": "62704"
  },
  "items": [
    { "sku": "WIDGET-A", "name": "Blue Widget", "qty": 2, "price": 9.99 },
    { "sku": "GADGET-B", "name": "Red Gadget", "qty": 1, "price": 24.99 }
  ],
  "payment": {
    "method": "credit_card",
    "last4": "4242",
    "authorized": 44.97
  }
}

The consistency boundary is the document. A single write replaces or updates the entire order atomically. No line item can be orphaned. No shipping address can drift out of sync. The structure is the integrity.

Now normalize this into Third Normal Form:

orders (order_id PK, customer, status)
order_items (item_id PK, order_id FK, sku, name, qty, price)
shipping_addresses (address_id PK, order_id FK, street, city, zip)
payments (payment_id PK, order_id FK, method, last4, authorized)

Notice what happened:

The aggregate was shattered into four independent tables with no structural cohesion.
Foreign keys had to be added (order_id FK) to re-express the parent-child relationship that was previously implicit through embedding.
Surrogate keys had to be invented (address_id) because value objects don't have a natural identity—they were identified by their values.
Joins are now required to reconstitute what was a single read—SELECT ... FROM orders JOIN order_items JOIN shipping_addresses JOIN payments.
Referential integrity constraints must be declared to prevent orphan rows—a problem that could not exist when the data was embedded.
Transaction scope must be explicitly managed because inserting an order now requires manipulating data in four tables. You need BEGIN/COMMIT to ensure atomicity that was previously guaranteed by the single-document write.

Surrogate keys: identity invented for decomposition

This point deserves emphasis. In Domain-Driven Design, a Value Object has no identity—it is defined entirely by its attributes. A shipping address is a value object. A line item, depending on your domain, may be a dependent entity whose identity is meaningful only within its parent aggregate.

But a relational table requires every row to be uniquely identifiable by a primary key. When you extract a value object or dependent entity into its own table, you must assign it an artificial surrogate key—an auto-incremented integer or UUID that has no meaning in your domain. This surrogate exists solely because normalization required the value object to live independently.

The surrogate key is not modeling your domain. It is compensating for a storage decision.

This normalization has consequences for consistency and concurrency

In most RDBMS implementations, the physical model mirrors the logical one: the default unit of atomicity and locking is the row, not the original multi-entity aggregate from your domain model. Since normalization has scattered what was once a single aggregate across multiple tables, the database must provide additional mechanisms to restore consistency across those rows:

Referential integrity constraints (foreign keys) exist to reenforce cross-row consistency that was inherent before decomposition, when related entities were embedded under a common root (the aggregate root).
Isolation levels and explicit locks exist because the natural consistency boundary (the aggregate, in Domain-Driven Design terms) no longer maps to a single lockable unit. The database must escalate from row-level locks to approximate the transactional boundary of the original aggregate root. This is implicit with serializable isolation level (often implemented with range or predicate locks) or explicit in lower isolation levels (SELECT FOR UPDATE on the aggregate root).

What normalization costs you—and what it buys

To be fair, normalization is not without benefits. It eliminates data redundancy, which simplifies updates to shared reference data and reduces storage when the same entity participates in many aggregates (e.g., a product referenced by thousands of order lines). It provides a flexible query model in which any column can serve as a join path, enabling ad hoc analytics that weren't anticipated at design time. The normalized model is also the best choice when you don't know your use cases or data distribution, because it is an abstraction that doesn't depend on access patterns or cardinalities.

But these benefits come at a structural cost:

What normalization removes	What must be added to compensate
Embedding (co-location)	Joins at query time
Structural association	Foreign key constraints
Aggregate-level atomicity	Explicit transactions across tables
Identity through containment	Surrogate keys
Single-unit locking	Isolation levels and lock escalation

Many foreign keys in your schema are evidence of a relationship that was structurally present and had to be re-declared as a constraint after decomposition.

The relational model as a middle ground

This application-agnostic quality is not accidental—it is the relational model's primary design goal. And it positions the relational model as an intermediate representation between two extremes:

Document model (full business aggregates): Optimized for OLTP, where the unit of read and write aligns with the domain's consistency boundary. You read what you need in one operation, and you write what belongs together atomically. The structure serves the application.
Columnar model (individual attributes): Optimized for analytics and ad-hoc queries, where any combination of columns across millions of rows can be scanned, filtered, and aggregated without regard to entity boundaries. The structure serves the query engine.

The relational model sits between these two. It doesn't fully optimize for either transactional aggregates or analytical scans, but it provides a general-purpose representation that can serve both reasonably well when the database engines adds physical optimizations. Rows are decomposed enough to avoid redundancy and enable flexible joins, but not so decomposed that individual attributes lose their tuple context. This is why the relational model became the default: when you don't know your access patterns or future use cases, and need a single schema to serve multiple applications, normalization into relations is the safest bet.

But "safest bet" is not the same as "best fit." When you do know your domain, your access patterns, and your consistency boundaries, that middle-ground positioning becomes over-engineering for your transactional workload and under-optimization for your analytical one.

Beyond the aggregates

Everything above describes referential integrity within a domain aggregate—the consistency boundary where entities and value objects are tightly coupled under a single root. But there are relationships between aggregates too, and this is where your software architecture determines whether the database should enforce them:

Domain-Driven Design decouples aggregates so that each team can operate independently within a bounded context. You may reference another aggregate by its root identity, but enforcement is asynchronous—through application events between services—always routing consistency through the aggregate root. In this architecture, cross-aggregate foreign keys in the database would create the tight coupling that DDD explicitly avoids.
Monolithic data models keep all entities tightly coupled. Foreign keys can be defined across the entire schema, and any application module can update any entity. In this world, in-database integrity constraints are the sole guarantor of business invariants, because no architectural boundary prevents inconsistent writes.

The choice is not purely technical—it reflects whether you trust your architecture or your database to enforce boundaries between aggregates.

When the application evolves

This architectural distinction becomes critical when new use cases emerge—and they always do.

In a relational model, evolution means adding columns, adding tables, and adding joins. The normalized schema was over-engineered for the first use case, but that over-engineering pays off as flexibility: a new query path is just a new join condition. Schema changes to existing data require migration scripts—adding columns, backfilling values, updating constraints—but the work is well understood, and the tooling is mature. New use cases accumulate within the same schema without requiring organizational change: no new services to deploy, no event streams to operate, no eventual consistency to reason about. A single team can absorb new requirements by extending the model, and each evolution is a small, incremental step.

An ORM can significantly lower the initial barrier. Tools like Hibernate, ActiveRecord, or SQLAlchemy let developers define entities in application code and automatically generate a normalized schema—making prototyping and MVPs feel almost as frictionless as with a schema-flexible database. But the ORM is an abstraction over decomposition, not an elimination of it. The joins, the transactions, the N+1 query problems are still there—just hidden until load reveals them. An ORM that isn't tuned (lazy loading in loops, missing fetch strategies, implicit transaction boundaries) will generate SQL that defeats the relational model's strengths. The ease of starting must be followed by the discipline of tuning.

However, as the schema grows—more tables, more joins, more constraints, more cross-cutting migrations—the monolithic model becomes harder to reason about, harder to change safely, and harder to split later when organizational growth demands it. Every new use case widens the blast radius of a schema change.

In a document model, evolution follows the bounded context. When a new use case aligns with an existing aggregate, you add fields to the document—and because documents are schema-flexible, existing data doesn't require migration. Old documents simply lack the new field, which the application handles gracefully. But when a new use case crosses aggregate boundaries or introduces a different access pattern—say, analytics across all orders, or a recommendation engine that needs to correlate items across customers—you don't contort your existing domain model to accommodate it. Instead, you create a new bounded context with its own collection and schema optimized for its access pattern, and you feed it through event streaming (change streams, CDC) from the source aggregates.

This comes at a cost, too: you need the infrastructure for event streaming, you must reason about eventual consistency between contexts, and you need organizational maturity to define clean boundaries. For a small team building an MVP, spinning up event pipelines, or managing change stream consumers may be premature architecture.

This is a fundamental difference in how complexity grows:

The monolithic model absorbs complexity within a single schema. New use cases add joins and constraints. The schema becomes harder to reason about, harder to migrate, and harder to split later—but each evolution is a small step that doesn't require organizational change.
The domain model distributes complexity across bounded contexts. Each context remains simple—optimized for its specific use case—and the integration between them is through events. You need to recognize when a new use case doesn't fit the existing bounded context, and spawn a new collection with its own stream, rather than adding multi-document transactions or cross-collection lookups that would compromise the aggregate's integrity.

In practice, this means a document-oriented architecture scales with organizational complexity. Each team owns its bounded context, its schema, and its data. A new analytical use case doesn't require the order service to change—it subscribes to order events and materializes its own read-optimized view. The order aggregate stays clean, stays fast, and stays owned by a single team. Meanwhile, the relational model scales with functional complexity within a single team—allowing a small group serve many use cases from a single schema, provided they can manage its growing weight.

Dimension	Relational model	Document model
Getting started	More upfront schema work	Fast to prototype
Adding use cases	Extend the schema, add joins	Add fields or spawn new bounded contexts
Complexity growth	Accumulates within one schema	Distributes across contexts
Organizational impact	Larger team synchronization	Requires team boundaries
Risk at scale	Monolithic schema becomes rigid	Must continue in event-driven design

The frequency of releases reflects the difference. Monthly migration windows are often considered undesirable in relational databases. With schema-flexible documents, schema evolution happens at the pace of code deployment: weekly or even daily iterations, without migration scripts or downtime windows.

The takeaway

Relationships exist in your domain model regardless of your database choice. A document database preserves them—a document maps to an aggregate where entities and value objects are embedded, with transactions and locks scoped to that boundary. A relational database decomposes them through normalization and reconstructs them through joins, foreign keys, and transaction isolation.

The relational model earns its place as a general-purpose middle ground that scales with functional complexity—enabling a single team to serve many use cases from a single schema without organizational change. The document model scales with organizational complexity—each team owns its bounded context, its aggregate stays clean, and new use cases spawn new contexts connected by events rather than widening a monolithic schema.

The question is not "Does my data have relationships?" (it always does). The question is: "Do I want my database to preserve my aggregate boundaries or decompose them? And when my application evolves, do I want complexity to accumulate in my schema, or to distribute across my architecture?".

DEV Community