DEV Community

Cover image for You need foreign keys and surrogate keys because you broke your relationships
Franck Pachot
Franck Pachot

Posted on

You need foreign keys and surrogate keys because you broke your relationships

Many developers believe they need a relational database because their data has relationships. The reasoning usually goes: "My entities are related, therefore I need foreign keys, therefore I need an RDBMS."

But the causality is actually reversed. Normalization creates the need for foreign keys—not the other way around.

This misunderstanding persists when the builders are agents, because LLMs operate on high-dimensional similarity (vectors), which may catch correlation better than causality.

Normalisation and relations

When you normalize your domain model into Normal Forms, you decompose aggregates—cohesive groups of entities that form a natural consistency boundary—into separate, independent tables. Each table represents a single relation in the mathematical sense: a set of tuples (rows) constrained by functional dependencies on a key. That is what "relational" means in "relational model"—it refers to these mathematical relations (tables), not to the relationships between entities.

Actually, this normalization is a decomposition that removes the structural associations between entities. Co-location, direct references, and pointers that once made related data physically and logically cohesive are eliminated. The connections between entities exist only through shared attribute values, and are reconstructed at query time via joins on those columns.

In other words, normalization doesn't strengthen relationships—it dissolves them from the schema and defers their reconstruction to the query, in order to have an application-agnostic representation of data.

An example: the Order aggregate

Consider a typical e-commerce domain. An Order is a natural aggregate: it contains line items, a shipping address, and payment details. In your domain model, these are not independent things—a line item has no meaning outside its order, a shipping address belongs to that specific purchase, and the payment is authorized for that exact total.

In a document database, this aggregate stays whole:

{
  "_id": ObjectId("order_123"),
  "customer": "Alice",
  "status": "confirmed",
  "shippingAddress": {
    "street": "123 Main St",
    "city": "Springfield",
    "zip": "62704"
  },
  "items": [
    { "sku": "WIDGET-A", "name": "Blue Widget", "qty": 2, "price": 9.99 },
    { "sku": "GADGET-B", "name": "Red Gadget", "qty": 1, "price": 24.99 }
  ],
  "payment": {
    "method": "credit_card",
    "last4": "4242",
    "authorized": 44.97
  }
}
Enter fullscreen mode Exit fullscreen mode

The consistency boundary is the document. A single write replaces or updates the entire order atomically. No line item can be orphaned. No shipping address can drift out of sync. The structure is the integrity.

Now normalize this into Third Normal Form:

orders (order_id PK, customer, status)
order_items (item_id PK, order_id FK, sku, name, qty, price)
shipping_addresses (address_id PK, order_id FK, street, city, zip)
payments (payment_id PK, order_id FK, method, last4, authorized)
Enter fullscreen mode Exit fullscreen mode

Notice what happened:

  1. The aggregate was shattered into four independent tables with no structural cohesion.
  2. Foreign keys had to be added (order_id FK) to re-express the parent-child relationship that was previously implicit through embedding.
  3. Surrogate keys had to be invented (address_id) because value objects don't have a natural identity—they were identified by their values.
  4. Joins are now required to reconstitute what was a single read—SELECT ... FROM orders JOIN order_items JOIN shipping_addresses JOIN payments.
  5. Referential integrity constraints must be declared to prevent orphan rows—a problem that could not exist when the data was embedded.
  6. Transaction scope must be explicitly managed because inserting an order now means inserting into four tables. You need BEGIN/COMMIT to ensure atomicity that was previously guaranteed by the single-document write.

Surrogate keys: identity invented for decomposition

This point deserves emphasis. In Domain-Driven Design, a Value Object has no identity—it is defined entirely by its attributes. A shipping address is a value object. A line item, depending on your domain, may be a dependent entity whose identity is meaningful only within its parent aggregate.

But a relational table requires every row to be uniquely identifiable by a primary key. When you extract a value object or dependent entity into its own table, you must assign it an artificial surrogate key—an auto-incremented integer or UUID that has no meaning in your domain. This surrogate exists solely because normalization demanded that the value object live independently.

The surrogate key is not modeling your domain. It is compensating for a storage decision.

This normalization has consequences for consistency and concurrency

In most RDBMS implementations, the physical model mirrors the logical one: the default unit of atomicity and locking is the row, not the original multi-entity aggregate from your domain model. Since normalization has scattered what was once a single aggregate across multiple tables, the database must provide additional mechanisms to restore consistency across those rows:

  • Referential integrity constraints (foreign keys) exist to re-enforce cross-row consistency that was inherent before decomposition, when related entities were embedded under a common root (the aggregate root).
  • Isolation levels and explicit locks exist because the natural consistency boundary (the aggregate, in Domain-Driven Design terms) no longer maps to a single lockable unit. The database must escalate from row-level locks to approximate the transactional boundary of the original aggregate root. This is implicit with serializable isolation level (often implemented with range or predicate locks) or explicit in lower isolation levels (SELECT FOR UPDATE on the aggregate root).

What normalization costs you—and what it buys

To be fair, normalization is not without benefits. It eliminates data redundancy, which simplifies updates to shared reference data and reduces storage when the same entity participates in many aggregates (e.g., a product referenced by thousands of order lines). It provides a flexible query model where any column can become a join path, enabling ad-hoc analytics that weren't anticipated at design time. The normalized model is also the best when you don't know your use cases and data distribution, because it is an abstraction that doesn't depend on access patterns and cardinalities.

But these benefits come at a structural cost:

What normalization removes What must be added to compensate
Embedding (co-location) Joins at query time
Structural association Foreign key constraints
Aggregate-level atomicity Explicit transactions across tables
Identity through containment Surrogate keys
Single-unit locking Isolation levels and lock escalation

Many foreign keys in your schema are evidence of a relationship that was structurally present and had to be re-declared as a constraint after decomposition.

The relational model as a middle ground

This application-agnostic quality is not accidental—it is the relational model's primary design goal. And it positions the relational model as an intermediate representation between two extremes:

  • Document model (full business aggregates): Optimized for OLTP, where the unit of read and write aligns with the domain's consistency boundary. You read what you need in one operation, you write what belongs together atomically. The structure serves the application.
  • Columnar model (individual attributes): Optimized for analytics and ad-hoc queries, where any combination of columns across millions of rows can be scanned, filtered, and aggregated without regard to entity boundaries. The structure serves the query engine.

The relational model sits between these two. It doesn't fully optimize for either transactional aggregates or analytical scans, but it provides a general-purpose representation that can serve both reasonably well when the database engines adds physical optimizations. Rows are decomposed enough to avoid redundancy and enable flexible joins, but not so decomposed that individual attributes lose their tuple context. This is why the relational model became the default: when you don't know your access patterns, don't know your future use cases, and need a single schema to serve multiple applications, normalization into relations is the safest bet.

But "safest bet" is not the same as "best fit." When you do know your domain, your access patterns, and your consistency boundaries, that middle-ground positioning becomes over-engineering for your transactional workload and under-optimization for your analytical one.

Beyond the aggregates

Everything above describes referential integrity within a domain aggregate—the consistency boundary where entities and value objects are tightly coupled under a single root. But there are relationships between aggregates too, and this is where your software architecture determines whether the database should enforce them:

  • Domain-Driven Design decouples aggregates so that each team can operate independently within a bounded context. You may reference another aggregate by its root identity, but enforcement is asynchronous—through application events between services—always routing consistency through the aggregate root. In this architecture, cross-aggregate foreign keys in the database would create the tight coupling that DDD explicitly avoids.
  • Monolithic data models keep all entities tightly coupled. Foreign keys can be defined across the entire schema, and any application module can update any entity. In this world, in-database integrity constraints are the sole guarantor of business invariants, because no architectural boundary prevents inconsistent writes.

The choice is not purely technical—it reflects whether you trust your architecture or your database to enforce boundaries between aggregates.

When the application evolves

This architectural distinction becomes critical when new use cases emerge—and they always do.

In a relational model, evolution means adding columns, adding tables, and adding joins. The normalized schema was over-engineered for the first use case, but that over-engineering pays off as flexibility: a new query path is just a new join condition. Schema changes against existing data require migration scripts—ALTER TABLE ADD COLUMN, backfilling values, updating constraints—but the work is well-understood and the tooling is mature. New use cases accumulate within the same schema without requiring organizational change: no new services to deploy, no event streams to operate, no eventual consistency to reason about. A single team can absorb new requirements by extending the model, and each evolution is a small, incremental step.

An ORM can lower the initial barrier significantly. Tools like Hibernate, ActiveRecord, or SQLAlchemy let developers define entities in application code and generate the normalized schema automatically—making prototyping and MVPs feel almost as frictionless as a schema-flexible database. But the ORM is an abstraction over decomposition, not an elimination of it. The joins, the transactions, the N+1 query problems are still there—just hidden until load reveals them. An ORM that isn't tuned (lazy loading in loops, missing fetch strategies, implicit transaction boundaries) will generate SQL that defeats the relational model's strengths. The ease of starting must be followed by the discipline of tuning.

However, as the schema grows—more tables, more joins, more constraints, more cross-cutting migrations—the monolithic model becomes harder to reason about, harder to change safely, and harder to split later when organizational growth demands it. Every new use case widens the blast radius of a schema change.

In a document model, evolution follows the bounded context. When a new use case aligns with an existing aggregate, you add fields to the document—and because documents are schema-flexible, existing data doesn't require migration. Old documents simply lack the new field, which the application handles gracefully. But when a new use case crosses aggregate boundaries or introduces a different access pattern—say, analytics across all orders, or a recommendation engine that needs to correlate items across customers—you don't contort your existing domain model to accommodate it. Instead, you create a new bounded context with its own collection, its own schema optimized for its own access pattern, and you feed it through event streaming (change streams, CDC) from the source aggregates.

This comes at a cost too: you need the infrastructure for event streaming, you must reason about eventual consistency between contexts, and you need organizational maturity to define clean boundaries. For a small team building an MVP, spinning up event pipelines or managing change stream consumers may be premature architecture.

This is a fundamental difference in how complexity grows:

  • The monolithic model absorbs complexity within a single schema. New use cases add joins and constraints. The schema becomes harder to reason about, harder to migrate, and harder to split later—but each evolution is a small step that doesn't require organizational change.
  • The domain model distributes complexity across bounded contexts. Each context remains simple—optimized for its specific use case—and the integration between them is through events. You need to recognize when a new use case doesn't fit the existing bounded context, and spawn a new collection with its own stream, rather than adding multi-document transactions or cross-collection lookups that would compromise the aggregate's integrity.

In practice, this means a document-oriented architecture scales with organizational complexity. Each team owns its bounded context, its schema, and its data. A new analytical use case doesn't require the order service to change—it subscribes to order events and materializes its own read-optimized view. The order aggregate stays clean, stays fast, and stays owned by a single team. Meanwhile, the relational model scales with functional complexity within a single team—letting a small group serve many use cases from one schema, provided they can manage its growing weight.

Dimension Relational model Document model
Getting started More upfront schema work Fast to prototype
Adding use cases Extend the schema, add joins Add fields or spawn new bounded contexts
Complexity growth Accumulates within one schema Distributes across contexts
Organizational impact Larger team synchronization Requires team boundaries
Risk at scale Monolithic schema becomes rigid Must continue in event-driven design

The frequency of releases reflects the difference. Monthly migration windows are often considered undesirable in relational databases. With schema-flexible documents, schema evolution happens at the pace of code deployment: weekly or even daily iterations, without migration scripts or downtime windows.

The takeaway

Relationships exist in your domain model regardless of your database choice. A document database preserves them—a document maps to an aggregate where entities and value objects are embedded, with transactions and locks scoped to that boundary. A relational database decomposes them through normalization and reconstructs them through joins, foreign keys, and transaction isolation.

The relational model earns its place as a general-purpose middle ground that scales with functional complexity—letting a single team serve many use cases from one schema without organizational change. The document model scales with organizational complexity—each team owns its bounded context, its aggregate stays clean, and new use cases spawn new contexts connected by events rather than widening a monolithic schema.

The question is not "Does my data have relationships?" (it always does). The question is: "Do I want my database to preserve my aggregate boundaries or decompose them? And when my application evolves, do I want complexity to accumulate in my schema, or to distribute across my architecture?".

Top comments (0)