Franck Pachot for MongoDB

Posted on Sep 8 • Edited on Sep 15

Resilience of MongoDB's WiredTiger Storage Engine to Disk Failure Compared to PostgreSQL and Oracle

#mongodb #wiredtiger #postgres #oracle

There have been jokes that have contributed to persistent myths about MongoDB's durability. The authors of those myths ignore that MongoDB's storage engine is among the most robust in the industry. MongoDB uses WiredTiger (created by the same author as Berkeley DB), which provides block corruption protection stronger than that of many other databases. In this article, I'll show how to reproduce a simple write loss, in a lab, and see how the database detects it to avoid returning corrupt data.

I like comparing the trade-offs and implementations of various databases, but I stick to discussing those I'm familiar with. Let's deliberately corrupt PostgreSQL, Oracle, and MongoDB at the storage level to see what happens. This experiment can be reproduced in a lab and your comments are welcome.

PostgreSQL

To expose the issue when a database doesn't detect lost writes, I chose PostgreSQL for this demonstration. As of version 18, PostgreSQL enables checksums by default. I'm testing it with the Release Candidate in a Docker lab:

docker run --rm -it --cap-add=SYS_PTRACE postgres:18rc1 bash

# Install some utilities
apt update -y && apt install -y strace

# Start PostgreSQL
POSTGRES_PASSWORD=x \
strace -fy -e trace=pread64,pwrite64 \
docker-entrypoint.sh postgres &

# Connect to PostgreSQL
psql -U postgres

I've started PostgreSQL, tracing the read and write calls with strace.

I check that block checksum is enabled:

postgres=# show data_checksums;
 data_checksums
----------------
 on
(1 row)

I create a demo table and insert random data:

create table demo (k int primary key, v text);
copy demo from program $$
cat /dev/urandom |
 base64 |
 head -10000 |
 awk '{print NR"\t"$0}' 
$$ ;
vacuum demo;
checkpoint;
create extension pg_buffercache;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;

I triggered a checkpoint to write to disk and flushed the shared buffers, so that I can see the read IO for the next query:

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192

  k  |                                      v
-----+------------------------------------------------------------------------------
 999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)

This has read two pages from the index, and then the page that contains the row I'm querying. This page is in base/5/16388 at offset 114688.

I use dd to save the content of this page, then update the row 999, trigger a checkpoint, and flush the buffer cache:

\! dd if=/var/lib/postgresql/18/docker/base/5/16388 of=block1.tmp bs=1 skip=114688 count=8192

update demo set v='xxxxxx' where k=999;
checkpoint;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;

I query the row again, and it shows the updated value:

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10m\261\1K~\0\08\1\200\1\0 \4 \2\3\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192

  k  |   v
-----+--------
 999 | xxxxxx
(1 row)

It reads the same index pages, but the leaf points to another table page, at offset 114688, that holds the new value.

To simulate disk corruption, I copy the previous block to this new location, and query again:

\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=114688 conv=notrunc

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
  k  |                                      v
-----+------------------------------------------------------------------------------
 999 | MQIEZSsmBjk7MRtgIZLL/MqABsjhuMR6I4LtayWfR2764PdB+AcQt2saRtFXkgBUGCLKzM8SBmKX
(1 row)

There's no error because this new block has a correct checksum, as it is a valid block, just not at its right place. And it holds the right structure, as it comes from a block of the same table. However, it shows a row that should not be there. This is an error that can happen with a failure in the storage that does not write a block at the right place.

Checksum is still useful if the corruption is not aligned with well formatted blocks. For example, I'm replacing the first half of the page with the second part of the page:

checkpoint;
\! dd of=/var/lib/postgresql/18/docker/base/5/16388 if=block1.tmp bs=1 seek=118784 count=4096 conv=notrunc

set enable_seqscan to false;
select distinct pg_buffercache_evict(bufferid) from pg_buffercache;
select * from demo where k=999;

[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0@j\225\1R\232\0\0\210\08\36\360\37\4 \0\0\0\0\350\237\20\0\330\237 \0"..., 8192, 24576) = 8192
[pid   161] pread64(59</var/lib/postgresql/18/docker/base/5/16394>, "\0\0\0\0\320\243\200\1mj\0\0\324\5\0\t\360\37\4 \0\0\0\0\340\237 \0\320\237 \0"..., 8192, 32768) = 8192
[pid   161] pread64(58</var/lib/postgresql/18/docker/base/5/16388>, "\0\0\0\0\10\232\232\1\302\260\4\0000\1`\1\0 \4 \0\0\0\0\220\237\322\0 \237\322\0"..., 8192, 114688) = 8192
2025-09-08 17:58:41.876 UTC [161] LOG:  page verification failed, calculated checksum 20176 but expected 45250
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] LOG:  invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
2025-09-08 17:58:41.876 UTC [161] ERROR:  invalid page in block 14 of relation "base/5/16388"
2025-09-08 17:58:41.876 UTC [161] STATEMENT:  select * from demo where k=999;
ERROR:  invalid page in block 14 of relation "base/5/16388"

Here, the checksum calculated is not correct and an error has been raised. PostgreSQL checksums can detect some block corruption, but it is still possible that a bug or a malicious user that has access to the filesystem can change the data without being detected.

Oracle Database

To detect lost writes like the one I simulated above, Oracle Database compares the block checksum with the standby databases, as there is a low chance that the corruption happened in both environments. I've demonstrated this in the past with a similar demo: 18c new Lost Write Protection

WiredTiger (MongoDB storage engine)

MongoDB employs the WiredTiger storage engine, which is designed to prevent lost writes and detect disk failures that might return the wrong page. To achieve this, WiredTiger stores a checksum alongside each page address within the pointers between the B-tree pages, in an address cookie:

An address cookie is an opaque set of bytes returned by the block manager to reference a block in a B-tree file. It includes an offset, size, checksum, and object id.

In my lab, I first start a MongoDB container and compile wt, a command-line utility that allows direct interaction with WiredTiger files. This tool enables me to examine the storage engine without relying on the MongoDB query layer, and I'll use it in this series of blog posts:

docker run --rm -it --cap-add=SYS_PTRACE mongo bash
# install required packages
apt-get update && apt-get install -y git xxd strace curl jq python3 python3-dev python3-pip python3-venv python3-bson build-essential cmake gcc g++ libstdc++-12-dev libtool autoconf automake swig liblz4-dev zlib1g-dev libmemkind-dev libsnappy-dev libsodium-dev libzstd-dev
# get latest WiredTiger
curl -L $(curl -s https://api.github.com/repos/wiredtiger/wiredtiger/releases/latest | jq -r '.tarball_url') -o wiredtiger.tar.gz
# Compile
mkdir /wiredtiger && tar -xzf wiredtiger.tar.gz --strip-components=1 -C /wiredtiger ; cd /wiredtiger
mkdir build && cmake -S /wiredtiger -B /wiredtiger/build \
        -DHAVE_BUILTIN_EXTENSION_SNAPPY=1 \
        -DCMAKE_BUILD_TYPE=Release \
        -DENABLE_WERROR=0 \
        -DENABLE_QPL=0 \
        -DCMAKE_C_FLAGS="-O0 -Wno-error -Wno-format-overflow -Wno-error=array-bounds -Wno-error=format-overflow -Wno-error=nonnull" \
        -DPYTHON_EXECUTABLE=$(which python3)
cmake --build /wiredtiger/build
export PATH=$PATH:/wiredtiger/build:/wiredtiger/tools

It takes some time, but the wt command line utility will make the investigation easier. That's an advantage of MongoDB pluggable storage—you can examine it in layers.

I create a demo table and insert 10,000 records:

root@7d6d105a1663:/tmp# wt create table:demo

root@7d6d105a1663:/tmp# ls -alrt

total 68
drwxr-xr-x. 1 root root  4096 Sep  8 19:21 ..
-rw-r--r--. 1 root root    21 Sep  8 19:35 WiredTiger.lock
-rw-r--r--. 1 root root    50 Sep  8 19:35 WiredTiger
-rw-r--r--. 1 root root   299 Sep  8 19:35 WiredTiger.basecfg
-rw-r--r--. 1 root root  4096 Sep  8 19:35 demo.wt
-rw-r--r--. 1 root root  4096 Sep  8 19:35 WiredTigerHS.wt
-rw-r--r--. 1 root root  1475 Sep  8 19:35 WiredTiger.turtle
-rw-r--r--. 1 root root 32768 Sep  8 19:35 WiredTiger.wt
drwxrwxrwt. 1 root root  4096 Sep  8 19:35 .

root@7d6d105a1663:/tmp# wt list

colgroup:demo
file:demo.wt
table:demo

cat /dev/urandom |
 base64 |
 head -10000 |
 awk '{print "i",NR,$0}' |
wt dump -e table:demo

...
Inserted key '9997' and value 'ILZeUq/u/ErLB/i7LOUb4nwYP6D535trb8Mt3vcJXXRAqLeAiYIHn5bEWs1buflmiZMYd3rMMvhh'.
Inserted key '9998' and value 'y+b0eTV/4Ao12qRqtHhgP2xGUr+C9ZOfvOG3ZwbdDNXvpnbM1/laoJ9Yzyt6cbLJOR6jdQktpgFM'.
Inserted key '9999' and value 'cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c'.
Inserted key '10000' and value 'QYwyjaRxa9Q+5dvzwQtvv2QE/uS/vhRPCVsQ6p7re/L2yDrVRxyqkvSyMHeRCzMIsIovrCUJpPXI'.

I read record 9999 with wt and use strace to see the read calls:

strace -yy -e trace=pread64 -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 806912) = 28672
9999
cJY9uWtopqFOuZjggkZDWVZEEdygpMLyL7LscqehnKoVY7BrmTh4ZzyTLrZ1glROwLtZYbvLbu5c
...

This record is at offset 806912 in a 28672 bytes block. I “save” this block with dd:

dd if=demo.wt of=block1.tmp bs=1 skip=806912 count=28672

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0680832 s, 421 kB/s

I update this record to "xxxxxx" and trace the write calls:

strace -yy -e trace=pwrite64 -xs 36 wt dump -e table:demo <<<"u 9999 xxxxxx"

...
Updated key '9999' to value 'xxxxxx'.
pwrite64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
pwrite64(6</tmp/demo.wt>, "\x00...\x4d\x19\x14\x4e"..., 4096, 876544) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x4f\x74\xdb\x1e"..., 4096, 880640) = 4096
pwrite64(6</tmp/demo.wt>, "\x00...\x21\xcc\x25\x06"..., 4096, 884736) = 4096
...

This writes a new block (WiredTiger does not write in-place, which helps to avoid corruption) of 28672 bytes, and updates the B-tree branches.

I can read this new value:

strace -yy -e trace=pread64  -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x02\xb1\x1f\xf5"..., 28672, 847872) = 28672
9999
xxxxxx
...

To simulate disk corruption, I do the same as I did on PostgreSQL: replace the current block with the old one. I save the current block before overwriting it:

dd if=demo.wt of=block2.tmp bs=1 skip=847872 count=28672

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0688249 s, 417 kB/s

dd of=demo.wt if=block1.tmp bs=1 seek=847872 conv=notrunc

28672+0 records in
28672+0 records out
28672 bytes (29 kB, 28 KiB) copied, 0.0666375 s, 430 kB/s

If I try to read the record in this block, the corruption is detected:

strace -yy -e trace=pread64  -xs 36 wt dump -e table:demo <<<"s 9999"

...
pread64(6</tmp/demo.wt>, "\x00...\x1f\x63\x77\x41"..., 28672, 847872) = 28672
[1757361305:392519][8246:0x7fe9a087e740], wt, file:demo.wt, WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __wti_block_read_off, 279: demo.wt: potential hardware corruption,
read checksum error for 28672B block at offset 847872: block header
checksum of 0x4177631f doesn't match expected checksum of 0xf51fb102
[1757361305:392904][8246:0x7fe9a087e740], wt, file:demo.wt,
WT_SESSION.open_cursor: [WT_VERB_DEFAULT][ERROR]: __bm_corrupt_dump, 86: {0: 847872, 28672, 0xf51fb102}: (chunk 1 of 28): 00 00 00 00 00
00 00 00 3a 00 00 00 00 00 00 00 40 6e 00 00 a8 02 00 00 07 04 00 01 00 70 00 00 1f 63 77 41 01 00 00 00 11 39 36 39 33 80 8c 43 38 30 58 51 66 64 
...

Even if the block checksum is correct for the block itself, it was detected that the checksum of the block, 0x4177631f, which is visible as 1f 63 77 41 in the hexadecimal dump, or \x1f\x63\x77\x41 in the read trace, is different from the expected 0xf51fb102 from the address cookie.

0xf51fb102 was visible as \x02\xb1\x1f\xf5 in the write call of the update, and is visible as 02 b1 1f f5 in the block that I've saved before overwriting it:

root@7d6d105a1663:/tmp# xxd -l 36 block2.tmp
00000000: 0000 0000 0000 0000 5c00 0000 0000 0000  ........\.......
00000010: f96d 0000 a802 0000 0704 0001 0070 0000  .m...........p..
00000020: 02b1 1ff5

Even with access to the files, it would be extremely difficult to corrupt the data in an undetected way because any change must update the checksum, and the checksum is referenced in all address cookies in other blocks. Block corruption is highly unlikely as the blocks are not updated in place, and failure to write blocks would break the pointers.

WiredTiger is open-source and you can check WT_BLOCK_HEADER definition. In this structure, the block size (disk_size) field appears before the checksum field: for example, 00 70 00 00 = 0x00007000 = 28,672 bytes, followed by the checksum 02 b1 1f f5 = 0xf51fb102. One advantage of WiredTiger is that B-tree leaf blocks can have flexible sizes, which MongoDB uses to keep documents as one chunk on disk and improve data locality.

Checksum verification is implemented in block_read.c and performs two validations:

It checks that the checksum stored in the block header matches the expected checksum from the address cookie (the B-tree pointer created when the block was written).
It zeroes out the checksum field in the header and recomputes the checksum over the block content, verifying it also matches the expected checksum. This ensures both the block’s integrity and its identity.

Conclusion

PostgreSQL requires you to enable checksums to detect data corruption. This detects when a page’s checksum does not match its content. However, if the system erroneously writes a different, but valid, block from the same table in place of the intended one, or misses a write and the previous version of the block remains, PostgreSQL cannot identify this issue. As a result, some disk failures may escape detection and return wrong results.

Oracle Database stores blocks with checksums and can enable checking them on read. With a Data Guard standby and some network overhead, the database can transmit checksums over the network to verify data integrity when reading.

MongoDB WiredTiger enables checksums by default and can detect the wrong blocks without the need to contact replicas. It embeds the expected checksum inside the B-tree address cookie so that every internal B-tree pointer to a leaf page includes the checksum for the referenced page. If an obsolete or different page is swapped in, any mismatch will be detected because the pointer's checksum won’t match. WiredTiger uses copy-on-write, not in-place overwrites, further reducing the risk of corruption.

Here is a description of WiredTiger by Keith Bostic: