Franck Pachot

Posted on Dec 25

No Foreign Keys in MongoDB: Rethinking Referential Integrity

#mongodb #referential #integrity #database

In SQL databases, foreign keys act as immediate constraints that verify the correctness of relationships between tables before accepting a write. This was designed for scenarios in which end users can submit random queries directly to the database. As a result, the database is responsible for protecting the data model using normalization, integrity constraints, stored procedures, and triggers, rather than relying on validation performed before the application interacts with the database. When relational integrity is violated, an error occurs, preventing the user from making the changes, and the application rolls back and raises an exception.

MongoDB’s NoSQL approach differs from relational databases as it was designed for application developers. It relies on application code to enforce these rules. Use cases are clearly defined, validation occurs at the application level, and business logic takes precedence over foreign key verification. Eliminating the need for additional serializable reads associated with foreign keys can significantly boost write performance and scalability.

Referential integrity can be verified asynchronously. Instead of raising an exception—an unexpected event the application might not be ready for—MongoDB allows the write to proceed and offers tools like the aggregation framework and change streams to detect and log errors. This approach enables error analysis, data correction, and application fixes without affecting the application's availability and includes the business logic.

Let's go through a traditional example of departments and employees, where all employees must belong to a department.

Two collections with reference

Strong relationships, including one-to-many, don't necessarily require multiple collections with references, especially if they share the exact lifecycle. Depending on the domain's context, we can embed a list of employees within each department document to ensure referential integrity and prevent orphans. Alternatively, we might embed department information into each employee's document, particularly when department updates are infrequent—such as a simple multi-document change to a department description—or when department changes are usually part of larger enterprise reorganizations.

When both entities are not always accessed together, have unbounded cardinality, or are updated independently, you can choose to reference another document instead of embedding all details. For example, store a "deptno" for each employee and maintain a separate collection of departments, each with a unique "deptno". I insert such data:

// Reset
db.departments.drop();
db.employees.drop();

// Departments
db.departments.createIndex(  
  { deptno: 1 },   // deptno will be used as the referenced key 
  { unique: true } // it must be unique for many-to-one relationships
;
db.departments.insertMany([
  { deptno: 10, dname: "ACCOUNTING", loc: "NEW YORK" },
  { deptno: 20, dname: "RESEARCH", loc: "DALLAS" },
  { deptno: 30, dname: "SALES", loc: "CHICAGO" },
  { deptno: 40, dname: "OPERATIONS", loc: "BOSTON" }
]);

// Employees in departments 10, 20, and 30
db.departments.createIndex(  
  { deptno: 1 },   // reference to departments 
;
db.employees.insertMany([
  { empno: 7839, ename: "KING",    job: "PRESIDENT", deptno: 10 },
  { empno: 7698, ename: "BLAKE",   job: "MANAGER",   deptno: 30 },
  { empno: 7782, ename: "CLARK",   job: "MANAGER",   deptno: 10 },
  { empno: 7566, ename: "JONES",   job: "MANAGER",   deptno: 20 },
  { empno: 7788, ename: "SCOTT",   job: "ANALYST",   deptno: 20 },
  { empno: 7902, ename: "FORD",    job: "ANALYST",   deptno: 20 },
  { empno: 7844, ename: "TURNER",  job: "SALESMAN",  deptno: 30 },
  { empno: 7900, ename: "JAMES",   job: "CLERK",     deptno: 30 },
  { empno: 7654, ename: "MARTIN",  job: "SALESMAN",  deptno: 30 },
  { empno: 7499, ename: "ALLEN",   job: "SALESMAN",  deptno: 30 },
  { empno: 7521, ename: "WARD",    job: "SALESMAN",  deptno: 30 },
  { empno: 7934, ename: "MILLER",  job: "CLERK",     deptno: 10 },
  { empno: 7369, ename: "SMITH",   job: "CLERK",     deptno: 20 },
  { empno: 7876, ename: "ADAMS",   job: "CLERK",     deptno: 20 }
]);

I didn't declare the schema upfront, as it will come as-is from the application. For performance reasons, I declare indexes on both sides to enable fast navigation between employees and departments, and between departments and employees.

Query examples

This schema supports all cardinalities, including millions of employees per department—something you wouldn't embed—and is normalized to ensure that updates affect only a single document. It also allows for bidirectional querying.

Here's an example of a query that joins all department information to employees as if it were embedded, but evaluated at read time:

db.employees.aggregate([  
  {  
    $lookup: {                 // add all department info in an array  
      from: "departments",  
      localField: "deptno",  
      foreignField: "deptno",  // fast access by index on departments 
      as: "department"  
    }  
  },  
  {  
    $set: {  // get first (and only) match (guaranteed by unique index)  
      department: { $arrayElemAt: ["$department", 0] }  
    }  
  }  
]);

Here's an example of a query that joins all employee information to departments as if it were duplicated and embedded, but evaluated at read time:

db.departments.aggregate([  
  {  
    $lookup: {                 // add all employee info in an array
      from: "employees",          
      localField: "deptno",      
      foreignField: "deptno",  // fast access by index on employees
      as: "employees"             
    }  
  }  
]);

From a performance standpoint, performing a lookup is more costly than reading from a single embedded collection. However, this overhead isn't significant when browsing through tens or hundreds of documents. When choosing this model, because a department might have a million employees, you don't retrieve all the data at once. Instead, a $match will filter documents before the $lookup in the first query, or a filter will be applied within the $lookup pipeline in the second query.

I have covered those variations in a previous post:

Many-to-One: Stronger Relationship Design with MongoDB

Franck Pachot ・ Dec 15

#mongodb #database #schema #modeling

What about referential integrity for these queries? If an employee is inserted with a deptno that does not exist in departments, the lookup finds no match. The first query omits the department information, and the second query doesn't show the new employee because it lists only the known department. This is expected behaviour for an application that didn't insert the referenced department.

Relational database administrators often overstate how serious this is, and even call it data corruption. Because SQL defaults to inner joins, that employee would be missing from the result of the first query, but with outer joins like $lookup in MongoDB, this does not happen. It’s more like a NULL in SQL: the information is not yet known, so it isn’t shown. You can add the department later, and the queries will reflect the information as it becomes available.

You may still want to detect when referenced items are not inserted after some time, for example, due to an application bug.

Foreign key definition as a `$lookup` stage

I define referential integrity using two stages: a lookup stage and a match stage that verify whether the referenced document exists:

const lookupStage = {
  $lookup: {
    from: "departments",
    localField: "deptno",
    foreignField: "deptno",
    as: "dept"
  }
};

const matchStage = { $match: { dept: { $size: 0 } } }
;

The definition is simple and similar to an SQL foreign key. In practice, it can be more complex and precise. Document databases are well-suited to well-understood applications where business logic extends beyond what can be defined by a foreign key. For example, some employees may temporarily have no department—such as new hires—or may belong to two departments during a transition. MongoDB’s flexible schema supports these cases, and you define referential integrity rules accordingly. You aren’t constrained to an application-unaware model as with SQL schemas. I'll keep it simple for this example.

One-time validation with an aggregation pipeline

I insert a new employee, Eliot, into dept 42, which doesn’t exist yet:

db.employees.insertOne({
  empno: 9002,
  ename: "Eliot",
  job: "CTO",
  deptno: 42       // Missing department
});

This doesn’t raise any errors. In all queries, the employee is visible only by department number, with no other information about that department.

If you decide that such a situation should not stay and must be detected, you can use an aggregation pipeline to list the violations, with the lookup and match stage defined earlier:

db.employees.aggregate([ lookupStage, matchStage ])
;

This shows the employees referencing a department that doesn't exist:

[
  {
    _id: ObjectId('694d8b6cd0e5c67212d4b14f'),
    empno: 9002,
    ename: 'Eliot',
    job: 'CTO',
    deptno: 42,
    dept: []
  }
]

We’ve caught the violation asynchronously and can decide what to do. Maybe the "deptno" was wrong, maybe we failed to insert the department, or someone deleted it, or we missed a business scenario where employees can be assigned to a department number without more information.

In SQL databases, the rules are basic and not driven by use cases. They check only for anomalies arising from normalization, and any violation is treated as an error without further consideration. However, in MongoDB, where you build a database for a known application, the integrity is part of the business logic.

Whether you should run this validation depends on your database's size and the risk of integrity issues. After major data refactoring, run it as an extra check. To avoid production impact, run it on a read replica—an advantage of asynchronous verification. You don't need a high isolation level, as, at worst, concurrent transactions may trigger a false warning, which can be checked later. If you restore backups for disaster recovery testing, it’s wise to run the validation on that copy to verify both the restore process and data integrity of the primary database.

Real-time watcher with change streams

You may also decide to perform validations in near real time, checking the changes shortly after they occur.

I start a change stream for employees, looking for inserts/updates,

and apply the same $lookup + $match to just the changed doc:

const cs = db.employees.watch([
  { $match: { operationType: { $in: ["insert", "update", "replace"] } } }
]);
print("👀 Watching employees for referential integrity violations...");

while (cs.hasNext()) {  
  const change = cs.next(); // Get the next change event  
  if (["insert", "update", "replace"].includes(change.operationType)) {  
    const result = db.employees.aggregate([  
      { $match: { _id: change.documentKey._id } }, // check the new document
      lookupStage, // lookup dept info by deptno  
      matchStage   // keep only docs with NO matching dept  
    ]).toArray();  

    if (result.length > 0) {  
      print("\n⚠ Real-time Referential Integrity Violation Detected:");  
      printjson(result[0]);  
    }  
  }  
}

In another session, I insert another employee, Dwight, in department 42, which is still missing:

db.employees.insertOne({
  empno: 9001,
  ename: "Dwight",
  job: "CEO",
  deptno: 42 // 🚨 Still missing
});

The loop on the change stream get notified of the insert, applies the aggregation pipeline verification and returns the following output:

⚠ Real-time Referential Integrity Violation Detected:
{
  _id: ObjectId('694da3aa8cd2fa3fe4d4b0c2'),
  empno: 9001,
  ename: 'Dwight',
  job: 'CEO',
  deptno: 42,
  dept: []
}

Rather than an error that blocks the application, the application handles this as any application alert. It can either create a department automatically or have a user analyze the situation.

Fixing the Violations

I add the missing department:

db.departments.insertOne({
  deptno: 42,
  dname: "DEVELOPER EXPERIENCE",
  loc: "REMOTE"
});

I re-run the batch check:

db.employees.aggregate([ lookupStage, matchStage ]);

Conclusion

In SQL databases, foreign key constraints require reading the parent record before inserting or updating a child and applying locks if necessary. When deleting or updating referenced keys, the database checks for existing children and waits if a child is being inserted. If users manually perform these operations on production or if the database administrator doubts the development team, using a SQL database with declared foreign keys is advisable.

However, these operations generally originate from an application that has already conducted the necessary checks: it reads the referenced table to get the key, verifies no children exist before deleting a parent, and often prefers logical over physical deletions. Additional validations can identify bugs, but they can run asynchronously and integrate with business logic and alert systems.

In MongoDB, data integrity is implemented by developers across various development stages in a DevOps manner, rather than solely during write operations. This strategy relies on not altering the production database beyond structured development practices like peer reviews and testing. However, if the database administrator (DBA) doesn't control who accesses the database or doesn't trust the developers, they believe that all verifications should be carried out within the database before each commit.

DEV Community

No Foreign Keys in MongoDB: Rethinking Referential Integrity

Two collections with reference

Query examples

Many-to-One: Stronger Relationship Design with MongoDB

Franck Pachot ・ Dec 15

Foreign key definition as a `$lookup` stage

One-time validation with an aggregation pipeline

Real-time watcher with change streams

Fixing the Violations

Conclusion

Top comments (0)

Two collections with reference

Query examples

Many-to-One: Stronger Relationship Design with MongoDB

Franck Pachot ・ Dec 15

Foreign key definition as a $lookup stage

One-time validation with an aggregation pipeline

Real-time watcher with change streams

Fixing the Violations

Conclusion

Foreign key definition as a `$lookup` stage