Franck Pachot

Posted on Feb 27 • Edited on Mar 25

From Relational Algebra to Document Semantics

#mongodb #document #relational #database

The relational model was designed not to mirror application structures, but to optimize reasoning about data dependencies independently of access patterns. By restricting data to relations in First Normal Form (1NF) and enforcing closure, relational algebra makes query behavior mathematically derivable, so correctness, equivalence, and optimization follow from the model itself rather than implementation-specific rules.

Document databases take the opposite approach: they optimize representation for applications, network, and storage. Domain concepts such as ownership, lifecycle, and cardinality are embedded directly in the data’s shape. Without assuming value atomicity, relational algebra no longer applies, so semantics must be defined explicitly. This is what MongoDB did when defining query behavior in its document database.

Relations are not relationships

Many people incorrectly assume that any data with relationships must be relational. In relational database theory, though, “relation” and “relationship” are distinct, and confusing them leads to misunderstanding.

A relation is a mathematical object: a set of tuples over a fixed set of attributes, where each attribute draws values from an atomic domain. Formally, it is a subset of the Cartesian product of its attribute domains: tuples are unordered, duplicates are disallowed, and attribute values are atomic. A relation represents facts, not structure or behavior.
A relationship is not a basic construct in the relational model. In Codd’s work, it denotes a semantic association between entities informally. Such associations, like all facts, are represented by relations. In SQL, this is typically enforced via foreign key constraints, which are predicates. In application models, relationships may be implemented by embedding, referencing, or inheritance, depending on their behavior.

Applications rarely start from relations

Most applications do not model data as relations. They model aggregates. It's obvious in Domain-Driven Design and Object-Oriented analysis, but it was true long before object‑oriented languages. Data structures in applications have always been hierarchical, with sub-structures and repeating groups:

Record-based models include nested tables, like in the COBOL data division, where relationships are expressed through containment and repetition:

01 EMPLOYEE.
   05 NAME        PIC X(20).
   05 SKILLS.
      10 SKILL    PIC X(10) OCCURS 5 TIMES.

Object‑oriented models are more flexible but follow the same pattern with embedded objects, arrays, or lists:

@Entity
class Employee {
    @Id Long id;
    String name;
    @ElementCollection
    List<String> skills;
}

These models express semantics that do not exist in relational algebra:

ownership: the skills belong to the employee
lifecycle: the root aggregate is deleted with all its elements
cardinality: this model supposes a short and bounded list of skills per employee, but a large and growing number of employees
locality: employee's skills are queried with the employee, and stored together on disk

Those models are optimized for application behavior rather than for algebraic reasoning. This is not new. It is how applications have always been written. Why do relational systems require data to be represented as flat relations at the logical level?

Why the relational model exists

When E. F. Codd published “Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks” in 1969, his objective was not to help developers write applications, but to establish a formal mathematical foundation for data management.

He based the relational model and relational algebra on mathematics:

Set theory for relations, with operations such as ∪ (union), ∩ (intersection), − (difference), and × (Cartesian product)
First‑order predicate logic for constraints and querying: selection (σ) corresponds to logical predicates, and joins correspond to conjunction with implicit existential quantification (∃) over the join attributes
A closed algebraic query language at the logical level, where every operation produces another relation

Within this framework, a relation is defined as:

a set of tuples, unordered, with no duplicates (SQL is more permissive with tables)
where all tuples share the same attributes (SQL is more permissive with NULLable columns)
and every attribute value is drawn from a simple, atomic, domain (SQL is more permissive with composite datatypes)

These properties are not modeling advice. They are the definition. They constitute the formal definition of a relation in the relational model. In practice, however, DBMS implementations were inspired by this model without strictly adhering to every aspect of it.

First normal form

First Normal Form (1NF) is often presented as a design guideline. In Codd’s original work, it is not. It is a mandatory constraint to apply the first‑order predicate logic.

Without atomic attribute values, relations cease to be relations as defined by Codd, and the standard relational algebra no longer applies. Comparisons become ambiguous, predicates are no longer Boolean, and algebraic equivalence rules break down.

Relational algebra is a closed system where inputs and outputs are relations. Its operators—selection (σ), projection (π), join (⨝)—all assume that:

attributes can be compared using equality
predicates evaluate to true or false
each comparison involves exactly one value

This is what enables equivalence rules, join reordering, and provable optimizations, and the maths is defined for atomic values only.

Let's see some examples in SQL, inspired by the relational model (but not itself a pure relational language). Note that while the relational model defines a closed algebra—where both inputs and outputs are relations—SQL does not. SQL expressions often take scalar values (bind variables or parameters) as inputs. Query results are multisets (bags) of rows, returned as a single two‑dimensional table per query and typically consumed via a cursor or result set, where duplicates are allowed, and ordering may be observable and significant.

1NF to apply mathematical operations to relations

Here is an example with a table of employees' skills:

CREATE TABLE employee_skill (
    employee_name TEXT,
    skill TEXT
);

INSERT INTO employee_skill VALUES
   ('Ann', 'SQL'),
   ('Ann', 'Java'),
   ('Ann', 'Python'),
   ('Bob', 'Java')
;

A simple selection involves a predicate comparing a column to a value:

SELECT DISTINCT *
   FROM employee_skill
   WHERE skill = 'Java'
;

It returns another relation with only the facts that verify the predicate:


 employee_name | skill
---------------+-------
 Ann           | Java
 Bob           | Java

This works because skill is atomic, equality has one meaning, and the result is still a relation.

This collapses without 1NF

Here is another model, simple, but it violates 1NF:

CREATE TABLE employee_array (
    name TEXT,
    skills TEXT[]
);

INSERT INTO employee_array VALUES
   ('Ann', ARRAY['SQL', 'Java', 'Python']),
   ('Bob', ARRAY['Java'])
;

The same predicate no longer applies:

SELECT DISTINCT *
   FROM employee_array
   WHERE skills = 'Java'
;


ERROR:  malformed array literal: "Java"
LINE 3: WHERE skills = 'Java';
                       ^
DETAIL:  Array value must start with "{" or dimension information.

PostgreSQL requires new operators:

SELECT DISTINCT *
   FROM employee_array
   WHERE 'Java' = ANY(skills)
;

 name |      skills
------+-------------------
 Ann  | {SQL,Java,Python}
 Bob  | {Java}

(2 rows)

SELECT DISTINCT *
   FROM employee_array
   WHERE skills @> ARRAY['Java']
;

 name |      skills
------+-------------------
 Ann  | {SQL,Java,Python}
 Bob  | {Java}

(2 rows)

These operators encode membership and containment. They are not part of relational algebra, and their exact semantics and syntax are vendor‑specific in SQL systems.

SQL/JSON does not restore relational semantics

JSON and JSONB datatypes do not change this:

CREATE TABLE employee_json (
    doc JSONB
);

INSERT INTO employee_json VALUES
  ('{"name": "Ann", "skills": ["SQL", "Java", "Python"]}'),
  ('{"name": "Bob", "skills": ["Java"]}')
;

Selections rely on path navigation and containment, with special operators specific to the datatype:

SELECT DISTINCT *
   FROM employee_json
   WHERE doc->'skills' ? 'Java'
;

                         doc
------------------------------------------------------
 {"name": "Ann", "skills": ["SQL", "Java", "Python"]}
 {"name": "Bob", "skills": ["Java"]}

(2 rows)

SELECT DISTINCT *
   FROM employee_json
   WHERE doc->'skills' @> '["Java"]'
;

                         doc
------------------------------------------------------
 {"name": "Ann", "skills": ["SQL", "Java", "Python"]}
 {"name": "Bob", "skills": ["Java"]}

(2 rows)

This involves different data types, operators, semantics, and also indexing. A different type of index is required to serve those predicates, a GIN index in PostgreSQL, with different syntax and different capabilities.

JSON has been added to the SQL standard as SQL/JSON, but this doesn't unify the semantics. For example, an SQL array starts at 1, and a JSON array starts at 0:

SELECT 
 skills[0] "0",
 skills[1] "1",
 skills[2] "2",
 skills[3] "3"
FROM employee_array
;

 0 |  1   |  2   |   3
---+------+------+--------
   | SQL  | Java | Python
   | Java |      |

(2 rows)

SELECT 
 doc->'skills'->0 "0",
 doc->'skills'->1 "1",
 doc->'skills'->2 "2",
 doc->'skills'->3 "3"
FROM employee_json
;

   0    |   1    |    2     | 3
--------+--------+----------+---
 "SQL"  | "Java" | "Python" |
 "Java" |        |          |

(2 rows)

JSON support in RDBMS extends SQL beyond relational algebra and introduces datatype‑specific semantics that are not algebraically closed. This is expected and was foreseen when enforcing the first normal form. Codd’s insight was that once attributes stop being atomic, mathematics no longer dictates behavior. Meaning must be defined explicitly.

MongoDB’s added semantics

MongoDB embraces the document model directly to match the data representation in the domain model and application structures:


db.employees.insertMany([
  { name: "Bob", skills: "Java" },
  { name: "Ann", skills: ["SQL", "Java", "Python"] }
]);

This is intentionally not 1NF because multiple entities and values may belong to the same aggregate. Relational operations cannot simply be defined in terms of the mathematical definition.

Selection resembles the relational operation, but when applied to a non‑1NF collection, MongoDB defines an explicit, extended semantics:


db.employees.find({ skills: "Java" })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

The same predicate applies to scalars and arrays. The document matches if the value or any array element satisfies the filtering condition. The same in SQL would require a union of a query using SQL selection and another using JSON containment, and casting the final result to the same datatype.

With MongoDB, indexes, comparisons, and sorting follow the same rule, as confirmed by execution plans. I create an index on skills that can index scalar values as well as array items, with one index type, and a generic syntax:


db.employees.createIndex({ skills: 1 })
;

It is easy to verify that the index is used to find the documents on a multi-key path (which means including an array in the path):

db.employees.find(
 { skills: "Java" }
).explain().queryPlanner.winningPlan

{
  isCached: false,
  stage: 'FETCH',
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { skills: 1 },
    indexName: 'skills_1',
    isMultiKey: true,
    multiKeyPaths: { skills: [ 'skills' ] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { skills: [ '["Java", "Java"]' ] }
  }
}

The multi-key index has one index entry per item when the key is an array. When a comparison operator is applied to an array field, MongoDB compares the query value to each array element individually.

I used an equality predicate, but the indexBounds in the execution plan show that the same can apply to a range. The same index is used for non-equality predicates:

db.employees.find({ skills: { $gt: "M"} })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

db.employees.find({ skills: { $lt: "M"} })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

Only Ann has a skill with a name after 'M' in the alphabet. Both Ann and Bob have a skill with a name before 'M' in the alphabet.

When sorting on an array field, MongoDB uses the minimum array element for ascending sort and the maximum array element for descending sort, according to BSON comparison order:

db.employees.find().sort({ skills: 1 })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  }
]

db.employees.find().sort({ skills: -1 })
;

[
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c3'),
    name: 'Ann',
    skills: [ 'SQL', 'Java', 'Python' ]
  },
  {
    _id: ObjectId('69a0ccaece22bf6640d4b0c2'),
    name: 'Bob',
    skills: 'Java'
  }
]

Here, 'Java' is first in alphabetical order, so both employees are at the same rank in an ascending sort, but 'SQL' is last in alphabetical order, so 'Ann' appears first in a descending sort.

Again, the index is used:

db.employees.find().sort({ skills: 1 }).explain().queryPlanner.winningPlan
;

{
  isCached: false,
  stage: 'FETCH',
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { skills: 1 },
    indexName: 'skills_1',
    isMultiKey: true,
    multiKeyPaths: { skills: [ 'skills' ] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { skills: [ '[MinKey, MaxKey]' ] }
  }
}

MongoDB is optimized for scalable OLTP. Index access is a must for equality and range predicates as well as sorting for pagination. In SQL databases, inverted indexes such as GIN are typically specialized for containment and equality predicates and offer more limited support for range ordering and pagination than B‑tree indexes.

Not forcing First Normal Form allows storage and indexing to remain efficient:

compound index may include fields from multiple entities within one aggregate
storage involves a single disk I/O per aggregate

By deviating from 1NF, closure is not guaranteed—by design. An explicit $unwind operation in an aggregation pipeline can normalize the result to a relation, in its mathematical sense, if needed.

With MongoDB, we can list the skills of employees who have a 'Java' skill, with all their skill, as a relational result:

db.employees.aggregate([  
  { $match: { skills: "Java" } },  
  { $unwind: "$skills" },
  { $project: { _id: 0, name: "$name", skill: "$skills" } }
])
;

[
  { name: 'Bob', skill: 'Java' },
  { name: 'Ann', skill: 'SQL' },
  { name: 'Ann', skill: 'Java' },
  { name: 'Ann', skill: 'Python' }
]

Interestingly, returning the result as a relation, is easier in MongoDB with $unwind than in SQL. The simple MongoDB query would be much more complex in the PostgreSQL examples above:

-- Correlated semi-join (EXISTS) over a normalized relation
SELECT DISTINCT es.employee_name AS name, es.skill  
  FROM employee_skill es  
  WHERE EXISTS (  
    SELECT 1  
    FROM employee_skill j  
    WHERE j.employee_name = es.employee_name  
      AND j.skill = 'Java'
;  

-- Existential quantification (ANY) over a non-1NF attribute (ARRAY) with explicit normalization (UNNEST)
SELECT DISTINCT ea.name, s.skill  
  FROM employee_array ea  
  CROSS JOIN LATERAL unnest(ea.skills) AS s(skill)  
  WHERE 'Java' = ANY (ea.skills)
;  

-- JSON containment predicate (@>) with explicit normalization (jsonb_array_elements)
SELECT DISTINCT doc->>'name' AS name, skill  
  FROM employee_json  
  CROSS JOIN LATERAL jsonb_array_elements_text(doc->'skills') AS s(skill)  
  WHERE doc->'skills' @> '["Java"]'
;

Those queries must return all employees's skill, not only the one used for the filter, because they are part of the same aggregate. With those SQL queries, the object-relational mapping (ORM) must regroup those rows to build the aggregate.

In practice, the MongoDB query will not even $unwind to mimic a relation as it gets directly the aggregate:

db.employees.aggregate([  
  { $match: { skills: "Java" } },  
  { $project: { _id: 0, name: 1, skills: 1 } }
])
;

[
  { name: 'Bob', skills: 'Java' },
  { name: 'Ann', skills: [ 'SQL', 'Java', 'Python' ] }
]

With this query, MongoDB returns the binary BSON object directly to the application driver, instead of converting it into records or JSON text like most SQL databases.

Conclusion

We exposed the enhanced semantics for selection over a non-1NF collection, as an example. MongoDB does more than enhance selection. All relational operations are extended with a document semantics, and the API is simplified for flexible data types:

Selection works over scalars and arrays
Projection reshapes documents
Sort semantics are defined over arrays
Indexes apply uniformly to scalars and array elements
Joins exist ($lookup), and the semantics is defined even if the key is an array.

Here is a short table comparing how relational operations are available in SQL and MongoDB:

Operation (Codd)	SQL operator / construct	MongoDB equivalent
Selection (`σ`)	`WHERE`	`find()`, `$match`
Projection (`π`)	`SELECT`	`find()`, `$project`
Cartesian product (`×`)	`CROSS JOIN`	`$lookup` + empty pipeline
Union (`∪`)	`UNION` / `UNION ALL`	`$unionWith`
Set difference (`−`)	`EXCEPT` / `MINUS`	`$setDifference`
Intersection (`∩`)	`INTERSECT`	`$setIntersection`
Join (`⋈`)	`JOIN ... ON`	`$lookup` + `$unwind`
Division (`÷`)	double `NOT EXISTS`	`$setIsSubset`
Rename (`ρ`)	`AS`	`$project`, `$addFields`

Relational theory is independent of physical implementation, but most RDBMS map each relation to a single table, and an index can cover the columns from a single table. Relational databases stem from mathematics and use normalization to structure storage. In contrast, applications center on aggregates, and MongoDB preserves these aggregates down to the storage layer.

First Normal Form (1NF) is required by the relational model—and therefore by relational SQL databases—because relational algebra assumes atomic attributes. In contrast, document databases relax these algebraic constraints in favor of explicit, document-oriented semantics, enabling similar operations to be performed directly on documents. So, when you hear that “data is relational” or feel you must apply normal forms to your domain model, recognize that this perspective is tied to a specific relational database implementation, not the nature of your data. The same data can support the same application with a document model.

Both models are valid but optimize for different abstraction layers: relational algebra offers a logical model for reasoning about data independently of the domain, while document databases model data as applications already do.

DEV Community