Franck Pachot

Posted on Mar 3 • Edited on Mar 4

Updating "denormalized" aggregates with "duplicates": MongoDB vs. PostgreSQL

#mongodb #postgres #oracle #denormalization

TL;DR: In database-centric development, duplication is dangerous because, in shared, normalized relational databases, it leads to update anomalies. In contrast, in DDD, aggregates own their data, so duplication is intentional and updates are controlled by the application. MongoDB supports this approach with fine-grained updates on document arrays and fields, with updateMany, $set, and arrayFilters, while PostgreSQL JSONB often requires rewriting entire documents, which can make normalization more appealing. As a result, duplication itself is not the core problem—what matters is who owns the data and how updates are optimized.

In Domain-Driven Design (DDD), aggregates define consistency boundaries, leading to an intentionally denormalized model. In a relational database, this can feel counterintuitive due to concerns over data duplication and update anomalies. In a shared SQL database with direct/unrestricted access (accepting ad-hoc SQL queries across the entire schema), normalization is often preferred to guarantee a single source of truth for each piece of data and to rely on joins/dynamic lookups at query time.

DDD addresses these concerns through the concept of ownership: each aggregate owns its data, and any duplication is deliberate and controlled. Updates occur only through the aggregate root (or the coordinating application service), whose behavior is tested and reviewed before deployment to production. The database is never modified directly from outside this boundary. As a result, changes are propagated safely — often via domain events and eventual consistency — ensuring the value is updated in all necessary places without violating aggregate boundaries. In a bounded context, a single update command can modify many occurrences in a collection of aggregates, but the performance depends on the capability of the database to update within the aggregate.

Example: player's scores

Consider 100,000 game records, each containing 1,000 players and their scores. Each score stores the player’s name directly instead of referencing a separate player document. When a player is renamed—a rare event—the name must be updated in all affected documents with a single updateMany() operation. This is more work than updating a single record, but it removes an extra lookup for every read, which happens far more often, and better matches the aggregate’s read and consistency requirements.

Documents in MongoDB

I built the following dataset:


db.scores.drop()

db.scores.createIndex({ "players.name": 1 })
db.scores.createIndex({ "players.score": 1 })

async function init(docs,players,totalplayers=10000,batchsize=5000){
for (let c = 0; c < docs ; c += batchsize) {
  await db.scores.insertMany(
    Array.from({ length: batchsize }, () => ({
      players: Array.from({ length: players }, (_, i) => ({
        name: `player${Math.round(totalplayers*Math.random())}`,
        score: Math.random()
      }))
    }))
  );
  print(`${db.scores.countDocuments()} documents (with ${players} in array)}`);
  }
}

init(1e5,1e3)
;

In a normalized relational model, this scenario would use a many-to-many schema: a games table with 100,000 rows, a players table with 10,000 rows, and an association table with 100,000,000 rows. That’s the trade-off for storing each player’s name only once.

Instead, with a single collection, I stored 100,000 documents, each embedding 1,000 scores together with the player’s names:


db.scores.countDocuments();

100000

I created two indexes on fields in the array to show the impact of updates. Each document has 1,000 keys, giving a total of 100,000,000 keys per index:


db.scores.validate({full: true}).keysPerIndex;

{
  _id_: 100000,
  'players.name_1': 95165332,
  'players.score_1': 100000000
}

There are fewer than 100,000,000 keys for "name" because they were generated randomly, and some are duplicated within a document (for example, one player may have multiple scores). These duplicates are removed at the index key level.

Below are the index sizes in MB, each index is a few GB:

db.scores.stats(1024 * 1024).indexSizes;

{
  _id_: 2.1640625,
  'players.name_1': 2216.81640625,
  'players.score_1': 3611.44140625
}

I created two indexes to demonstrate that, in MongoDB’s document storage, updating array elements by deleting and reinserting them with new values does not add overhead to other fields, even when those fields are indexed.

A single player, "player42," has a recorded score in 9.5% of the documents:

db.scores.countDocuments({"players.name":"player42"});

9548

I want to rename "player42" so the new name appears in all historical records, as if it had been normalized to a single value.

UpdateMany() in MongoDB

I can change the name "player42" to "Franck" using a single updateMany() operation on the collection:

db.scores.updateMany(  
  { "players.name": "player42" },  
  { $set: { "players.$[i].name": "Franck" } },  
  { arrayFilters: [{ "i.name": "player42" }] }  
);

This means:

Select all documents that contain at least one element in the players array with name equal to "player42".
For each of these documents, consider every players element whose name is "player42" as i.
For each such element i, update i.name to "Franck".

In most cases, it’s fine if this update isn’t atomic. If atomicity is required, you should execute it within a transaction. It is also faster within a transaction, as long as the number of updates is bounded, as it has to be synced to disk only once at commit.

I executed it in a transaction and gathered statistics both before and after the transaction:


function updatePlayer42() { 

  // gather some stats
  const bstat_name = db.scores.stats({ indexDetails: true }).indexDetails["players.name_1"].cursor;
  const bstat_score = db.scores.stats({ indexDetails: true }).indexDetails["players.score_1"].cursor;
  const btime = new Date()
  print(`Start: ${btime}\n Index size:${JSON.stringify(db.scores.stats(1024 * 1024).indexSizes)} MB`);

  // ---- transaction start ----  
  const session = db.getMongo().startSession();  
  const scores = session.getDatabase(db.getName()).scores;  

  try {  

    session.startTransaction();  

    const res = scores.updateMany(  
      { "players.name": "player42" },  
      { $set: { "players.$[m].name": "Franck" } },  
      { arrayFilters: [{ "m.name": "player42" }] }  
    );  

    printjson(res);  

    session.commitTransaction();  

  } catch (e) {  
    print("Transaction aborted:", e);  
    session.abortTransaction();  
  } finally {  
    session.endSession();  
  }  
  // ---- transaction end ----  

  // gather and print stats
  const etime = new Date()
  const estat_name = db.scores.stats({ indexDetails: true }).indexDetails["players.name_1"].cursor;
  const estat_score = db.scores.stats({ indexDetails: true }).indexDetails["players.score_1"].cursor;

  // calculate and print stats
  print(`End: ${etime} - duration: ${ etime-btime} ms\n Index size:${JSON.stringify(db.scores.stats(1024 * 1024).indexSizes)} MB`);
  print(`Index on "players.name" cursor stats:`)
 Object.entries(estat_name).forEach(([k, v]) => {  const delta = v - (bstat_name[k] ?? 0);  if (delta !== 0) print(` ${k}: ${delta}`);  });
  print(`Index on "players.score" cursor stats:`)
 Object.entries(estat_score).forEach(([k, v]) => {  const delta = v - (bstat_score[k] ?? 0);  if (delta !== 0) print(` ${k}: ${delta}`);  });

}

updatePlayer42()

Here is the output:

Start: Mon Mar 02 2026 22:51:31 GMT+0000 (Coordinated Universal Time)
 Index size:{"_id_":2.7734375,"players.name_1":2216.81640625,"players.score_1":3611.44140625} MB
{
  acknowledged: true,
  insertedId: null,
  matchedCount: 9548,
  modifiedCount: 9548,
  upsertedCount: 0
}
End: Mon Mar 02 2026 22:51:43 GMT+0000 (Coordinated Universal Time) - duration: 12472 ms
 Index size:{"_id_":2.7734375, "players.name_1":2216.81640625, "players.score_1":3611.44140625} MB
Index on "players.name" cursor stats:
 cache cursors reuse count: 1
 close calls that result in cache: 2
 create calls: 1
 cursor bounds cleared from reset: 9549
 cursor bounds comparisons performed: 9549
 cursor bounds next called on an unpositioned cursor: 9549
 cursor bounds next early exit: 1
 insert calls: 9548
 insert key and value bytes: 114477
 next calls: 9549
 remove calls: 9548
 remove key bytes removed: 133573
 reset calls: 28648
Index on "players.score" cursor stats:

The statistics show only those that increased and display the difference.

In 12 seconds, this operation atomically updated the name of "player42" in 9,548 documents, out of a total of 100,000 large documents, and the indexes were maintained with strong consistency. The overall index size did not increase significantly. The index on "score" was unchanged. The index on "name" received 9,548 new entries (12 bytes on average) and removed 9,548 entries (14 bytes on average).

Although this is slower than performing a single update in a reference table, it is nowhere near the kind of update nightmare that SQL practitioners typically associate with denormalization. The crucial point is that only the index entries related to the updated array element and field must be maintained. To understand where the fear comes from, let's do the same in a SQL database.

Comparison with PostgreSQL and JSONB

SQL databases were designed for normalisation, and even if they accept some JSON datatypes, they may not have the same optimisations, and such an update is much more expensive.

I created a similar dataset in PostgreSQL 18 with JSONB:


CREATE TABLE scores (  
  id     bigint PRIMARY KEY,  
  doc    jsonb  
)
;

INSERT INTO scores (id, doc)  
SELECT  
  g,  
  jsonb_build_object(  
    'players',  
    jsonb_agg(  
      jsonb_build_object(  
        'name',  'player' || ((g * 1000 + s) % 10000),  
        'score', random()  
      )  
    )  
  )  
FROM generate_series(1, 100000) AS g  
CROSS JOIN generate_series(1, 1000) AS s  
GROUP BY g;  

SELECT count(*) FROM scores 
 WHERE doc @? '$.players[*] ? (@.name == "player42")'
;

CREATE INDEX ON scores 
 USING gin ((doc->'players') jsonb_path_ops)
;

VACUUM ANALYZE scores
;

A single GIN index can cover equality filters (or rather containment) on the two fields.

I executed the same update, using @? in a WHERE clause to retrieve the documents to update via the GIN index, and jsonb_set to modify the name. There's no direct access to the JSONB content stored in the table, and it must read the JSON array with jsonb_array_elements and rebuild it with jsonb_agg:

EXPLAIN (ANALYZE, BUFFERS, WAL, COSTS OFF)
UPDATE scores     
SET doc = (    
  SELECT jsonb_agg(    
    CASE     
      WHEN p->>'name' = 'player42'     
      THEN jsonb_set(p, '{name}', '"Franck"')    
      ELSE p     
    END    
  )    
  FROM jsonb_array_elements(doc->'players') p    
)    
WHERE (doc->'players') @? '$[*] ? (@.name == "player42")'
;

Here is the execution plan:

                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Update on scores (actual time=13126.454..13126.459 rows=0.00 loops=1)
   Buffers: shared hit=641088 read=39252 dirtied=57413 written=57455
   WAL: records=310285 fpi=31129 bytes=406743374 buffers full=33593
   ->  Bitmap Heap Scan on scores (actual time=6.333..7907.504 rows=10000.00 loops=1)
         Recheck Cond: ((doc -> 'players'::text) @? '$[*]?(@."name" == "player42")'::jsonpath)
         Heap Blocks: exact=736
         Buffers: shared hit=88347 read=32397 written=20050
         ->  Bitmap Index Scan on scores_expr_idx (actual time=3.491..3.491 rows=10000.00 loops=1)
               Index Cond: ((doc -> 'players'::text) @? '$[*]?(@."name" == "player42")'::jsonpath)
               Index Searches: 1
               Buffers: shared hit=1 read=7
         SubPlan 1
           ->  Aggregate (actual time=0.519..0.519 rows=1.00 loops=10000)
                 Buffers: shared hit=60000
                 ->  Function Scan on jsonb_array_elements p (actual time=0.060..0.091 rows=1000.00 loops=10000)
                       Buffers: shared hit=60000
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.647 ms
 Execution Time: 13126.742 ms
(20 rows)

To update 10,000 documents, PostgreSQL generated 641,088 WAL records (~64 per document) and wrote 57,413 blocks (~46 KB per document). This is less than rewriting the full raw JSON, thanks to binary internal representation and TOAST compression, but still far larger than the logical change—a single field update. PostgreSQL JSONB makes denormalization workable but write‑amplified when duplicated values must be updated, not because they are duplicated, but because they reside in large, immutable JSONB documents.

Comparison with Oracle Database and OSON

In order to test with another binary JSON implementation on SQL database, I've run the same on Oracle Autonomous 26ai, though the MongoDB emulation proxy, except the lines collecting the statistics, as they are not provided, but looked at the performance through the performance advisor on a window where 5000 documents were updated:

ASH Analytics shows that most of the time is spent on random reads on the multi-value indexes:

Execution statistics show that it executes row-by-row and each update of a document is a single update statement that replaces the whole document: update "ORA"."scores" set "DATA" = :1 where ("RESID" = :2 ) returning "RESID", "ETAG" into :3 , :4:

A document model that stores aggregates as JSON—despite data duplication and possible updateMany() calls—is justified only when the storage layer is optimized for documents, such as MongoDB with BSON in WiredTiger. An RDBMS that emulates updateMany() on storage designed for relational models may not offer the same optimizations and can make such updates problematic. SQL databases, even with a binary JSON type, are generally optimized for normalization and joins rather than document-centric operations.

Of course, one could think it is a good case for JSON-relational duality views, as it can store the embedded array in another table. I tried, but arrayFilters option is not supported on duality view collections.

This explains why SQL experts are afraid of denormalization when storing DDD aggregates as documents. In SQL databases, a normalized schema is generally preferable. Always review the execution plan when assessing the scalability of your design.

Conclusion

In traditional relational databases, avoiding duplication is essential because the database is the main authority for correctness: every application, report, and ad‑hoc SQL query must respect the same invariants. In that context, normalization and foreign keys are the safest way to prevent inconsistencies.

In a Domain‑Driven Design (DDD) architecture, that responsibility shifts. Aggregates define consistency boundaries, a single trusted service mediates database access, and the application enforces invariants. Duplication is therefore intentional and bounded, not accidental.

The experiments above show that this difference is both physical and conceptual. In MongoDB, updating a value embedded in an array changes only the affected elements and their index entries. Index maintenance scales with the number of modified elements, not with overall document size, so even highly denormalized aggregates can still be updated efficiently and atomically.

In PostgreSQL, JSONB supports denormalized structures but with very different update semantics. Any change requires rebuilding the entire JSON value and regenerating all related index entries, regardless of how small the logical update is. Indexes help find the rows, but cannot avoid the cost of rewriting the document.

As a result, the trade‑off is clear: PostgreSQL JSONB–based denormalization mainly optimizes reads while imposing a write cost proportional to document size, whereas MongoDB’s document model supports both read locality and fine‑grained, efficient updates within aggregates. The issue is not whether denormalization is “good” or “bad,” but whether the database’s storage and indexing model fits the aggregate’s read‑write patterns and ownership assumptions.