DEV Community: Hugo Vantighem

Invariant-Driven Architecture: 20M transactions on a €80/mo Cloud VM.

Hugo Vantighem — Mon, 25 May 2026 06:27:00 +0000

📎 This is Part 2. Part 1 — Postgres-grade serializable at 20k ops/s on a laptop (don't try this at home) presented 20,000+ durable, invariant-validated transactions per second — on a MacBook Air M3, 8 cores, fan barely audible.

A laptop number is only half the story. The natural next test is whether the same architecture holds on a small, cheap public VM with network-attached storage and strict fsync(2) durability — three constraints that each, on its own, tend to move the bottleneck by an order of magnitude on most stacks. So that's what I ran.

The Setup

VM: Scaleway POP2-2C-8G — 2 vCPUs AMD EPYC 7543 @ 2.8 GHz, 8 GiB RAM. Yes — two vCPUs.
Disk: SBS Block storage, 100 GB, 15,000 provisioned IOPS. Network-attached, not local NVMe.
OS: Ubuntu 22.04, kernel 5.15. Linux strict fsync(2) — every commit hits the SSD for real, no Apple-style controller-cache shortcut.
Software: the exact same codebase that ran on the M3, with NATS + Mongo + Mongo Express as the side stack in Docker.
Price: ~€80/month order-of-magnitude — ~€54 of compute (POP2-2C-8G at €0.0735/h) plus ~€20–25 of SBS volume with provisioned IOPS. Hourly that's €0.11.

Two names, one stack

Two terms surface across this series. They sit at different layers, so it's worth pinning them now.

Invariant-Driven Architecture (IDA) — the design philosophy. The system, end to end (ingress, sequencer, system, storage), is engineered around a single obsession: validating and enforcing business invariants on every commit, with no compromise on throughput. It's a DDD-based philosophy.
Atomic State Platform — the concrete implementation. The software we're benchmarking right here — that just put down 33,091 sustained items per second on a €80/mo Scaleway VM.

IDA is how. Atomic State is what. The €80/mo VM is where. The rest of this post is how much.

1. The Visual Punch — 10 minutes non-stop, 20 million items 📈

The first test is the one that closes the "but does it sustain?" question forever.

BENCH_DURATION=10m BATCH_SIZE=1000 make cloud-bench-broker

Result, over 600 consecutive seconds on the POP2:

Metric	Value
Throughput sustained	33,091 items/s
Total items committed	19,855,000
WAL bytes written	7.6 GB
p99 round-trip	71 ms
Integrity audit	✅ INTEGRITY_OK (1,000 aggregates)
Durability	FSYNC-ON (Linux strict)

Yes, that's higher than the laptop — and on network-attached storage, not local NVMe. SBS is block storage over the data-center fabric; every fsync round-trips to the SBS backend before it returns. Measured fsync latency on this volume: 2.0 ms (vs 130 µs on the laptop's local NVMe — 15× slower per call). The 33,091 items/s holds despite the network disk, not because of a fast one. Pebble's group commit amortises that 2 ms across the whole batch — roughly 2 µs of effective fsync cost per item. That ratio is the architectural lever.

More on why the cloud beats the laptop on this same scenario in §4. For now, the headline isn't the number. The headline is the steadiness.

I sampled the engine's /metrics and the host's resource counters every 15 s through the entire run. Three things every senior engineer should care about:

pebble_health.l0_files stays in [0, 6] the whole 10 minutes. Never higher. Compaction kicks in the moment L0 hits 6, completes before the next batch needs the slot.
compactions_in_progress oscillates 0 ↔ 1 — Pebble is keeping up in real time, never queueing.
estimated_debt_bytes never crosses 100 MB — less than 1% of the WAL written.

At this scale — 34 MB/s sustained ingestion, ~20 GB of raw data committed — a mis-tuned LSM-tree triggers a Write Stall: the engine has to pause writes while compaction catches up, and the curve falls off a cliff. We see none of that. The puts counter climbs linearly from 0 to 19.85 M across the whole 10 minutes.

CPU%, L0 files and debt_MB never drift outside the bands shown above during the active 10-minute window.

[preflight] measured: lat_avg=2 051 µs iops=487
[preflight] Mac ref:  lat_avg=131 µs   iops=7 568 (M-series NVMe, same fio command)
[preflight] gate:     lat_avg <= 5 000 µs (override via SCW_FSYNC_MAX_LAT_US)
✅ PASS - instance is fsync-fast enough for a meaningful bench.

That's what "industrial-grade" looks like on a VM that costs €0.11/h.

2. The Ceiling Demo — batch=1000 vs batch=2000

Now the test that tells us where the wall is. Same VM, same 1-minute window, double the batch:

+2.8% throughput for +59% tail latency. Past 1,000 items per batch, more batching is just queueing. We've squeezed the disk dry.

Let's do the math on what the bottleneck is. The cloud-up preflight measured 2.0 ms average fsync latency on this SBS volume (fio --rw=randwrite --bs=4k --direct=1 --sync=1, the same command everywhere). At 33k items/s and one fsync per batch, that's 60 ms of fsync wall-time per second — 6% of the budget. Pebble's group commit already amortises fsync across the whole batch.

The remaining 94% of wall-time is CPU:

HTTP/JSON serialization on the producer
1,000 invariant evaluations per chunk
Pebble's batch building + commit

The network-attached disk is no longer the wall. Two 2.8 GHz AMD EPYC vCPUs at sustained 70% CPU are.

The fact that doubling the batch buys 2.8% is the experimental proof: there's no more disk to amortise, only CPU to share.

The full numbers, side by side

Full side-by-side numbers for batch=1000 vs batch=2000.

Read the bold rows together: we halved the number of fsyncs (33.3 → 17.1 per second) and throughput barely moved (+2.8%). If the disk were the wall, cutting fsyncs in half would have bought a lot more. It didn't — because CPU holds a flat ~70% plateau across both runs. That's the ceiling, in two numbers.

Engine + system samples, every 5 s (abridged — start / mid / end). Same CPU band, same L0 transient peak (10), and acks (= fsyncs) running at exactly half the rate on bs=2000 for the same puts — one fsync per batch, batches twice as big, no throughput dividend.

3. The Proof by Absurdity — 64 Workers on 2 Cores

This is the test every junior engineer expects to help throughput, and every senior engineer expects to kill it. The reflex: "if the engine is slow, scale out the producer."

make cloud-bench-perf-dense   # 64 workers, batch=100

Same VM. Same disk. Same system. The only change is the producer pattern.

Read that twice. CPU usage drops from 70% to 30% — and throughput collapses 5.5×.

That asymmetry is the signature of context-switching hell. The Linux scheduler spends its cycles swapping 64 producer goroutines + 64 HTTP request handlers + the system + 3 Docker containers across 2 physical cores. The real work-per-cycle drops; the cores look idle because they spend their time saving and restoring register state. The engine's single-writer queue becomes the rendezvous point — workers pile up — p99 explodes to 1.7 seconds.

This is the design principle of the engine stated as a measurement:

The engine is single-writer by design — one writer to Pebble, no lock contention.
The ingress must batch upstream of the engine to amortise fsync.
More producer threads = less throughput on a CPU-constrained host. Mathematically, not ideologically.

4. The Mac Parallel — More Cores Buy More Headroom

The same three scenarios, on the MacBook Air M3 from Part 1 (8 cores, 16 GB):

Three readings:

(a) On a single-worker pattern, the cloud Linux wins. Mac fsync(2) is ~15× faster than the cloud SBS per call (~130 µs vs ~2 ms), but at batch=1000 the per-batch fsync is amortised over 1,000 items — and the rest of the pipeline (HTTP serialization, internal sequencing, scheduler) now dominates.

(b) When the workload has CPU work to spare and concurrency is contained, the extra cores cash in. Going from batch=1000 to batch=2000 adds compute per batch but releases parallelism inside the engine (more items concurrently invariant-checked by the system). The Mac has 6 extra cores to spend on it, so its throughput climbs +54% (23,755 → 36,549). The cloud, pinned at 2 vCPUs, gains only +2.8% on the identical change — it has no spare core to convert the extra parallelism into work.

(c) The Mac does not flinch at 64 workers — it has the cores to absorb them. This is the exact scenario where the 2-vCPU cloud VM collapsed to 5,992 (§3). The 8-core Mac runs the identical 64-worker, batch=100, payload=0 B workload at 43,392 items/s — 7.2× the cloud, and above its own single-worker broker run (23,755). Context-switching only becomes hell when threads vastly outnumber cores; with 8 cores the scheduler keeps up and the engine's single writer stays fed. The perf-dense collapse was never about the workload — it was about core count.

The hierarchy of constraints is universal, regardless of OS, disk brand, or vendor SKU:

batch size > producer concurrency > raw fsync speed

Match those three to your hardware and your real ingress pattern, and the throughput follows. Get them wrong and a top-end laptop loses to a €80/month VM — or wins against one — depending on the day.

The Numbers, at a Glance

€80/month is the order of magnitude — ~€54 of compute, ~€20–25 of provisioned-IOPS SBS volume. Hourly: €0.11. The whole bench session that produced the cloud rows of this table cost ~€0.05 of cloud time — at this scale, validating an architecture decision on a representative VM is essentially free.

How the runs were captured

Each row above came from the same harness: a fresh VM with a clean Pebble, system + Mongo projection started as systemd units, NATS + Mongo + Mongo Express brought up in Docker, and the engine's /metrics endpoint sampled every 5–15 seconds during the run. An fio pre-flight (--rw=randwrite --bs=4k --direct=1 --sync=1, 5 s) gates the run on a configurable fsync latency threshold; on this VM it measured 2,051 µs average, well under the 5 ms gate. Every number in the tables above is either a direct read from /metrics, a count from the bench JSON output, or a delta between consecutive samples — nothing synthetic, no extrapolation.

Conclusion

20 million durable, invariant-validated transactions in 10 minutes, on a public-cloud VM that costs less per month than a SaaS subscription. Every run ends with INTEGRITY_OK.

What Atomic State Platform does on the M3 in Part 1, it does unchanged on a €80/mo Linux VM:

The single-writer engine means the disk stops being the wall as soon as you batch upstream.
Smaller, slower-fsync CPUs reach the same throughput envelope on cheap cloud as a powerful laptop — provided the producer pattern is cooperative.
Bigger machines buy headroom for concurrency, not raw throughput.

You don't need a 64-core server. You don't need an NVMe array. You don't need a datacenter rack. You need the right pattern, applied to the right SKU.

Part 3 unpacks the deeper claim: how Invariant-Driven Architecture lets Atomic State Platform sidestep the classical database stack outright — no Postgres, no Redis, no event-sourcing scaffolding. Just a system with a single fsync per batch, doing what nothing else does on a €80/mo box.

Postgres-grade Serializable at 20k+ ops/s — on a laptop. Don’t try this at home.

Hugo Vantighem — Sat, 23 May 2026 17:14:52 +0000

They didn't know it was impossible, so they did it. — Mark Twain

In the software industry, we've been raised with a dogma: you must choose between Massive Performance (NoSQL, eventual consistency) and Domain Rigor (SQL, strong consistency, serializable).

We are told that locks, latencies, and ACID properties are the natural enemies of speed. That if you want to scale, you have to let go of your business invariants.

I decided to test another hypothesis. And I broke the myth.

The Result: 20,000+ Validated Transactions per Second

This isn't a "fire and forget" ingestion log.

This isn't a volatile cache experiment.

What you see here is Business Transaction Durability:

Invariants validated — every business rule is checked before commit.
State persisted — every change is durably written to disk.
Strong Consistency — Serializable-level isolation.

At 20,000+ ops/s, we are not just talking about speed. We are talking about the ability to maintain absolute domain integrity under massive load.

And the kicker: this is running on a MacBook Air M3 — 8 cores, 16 GB of RAM, the same machine I write the code on. No 64-core server. No NVMe array. No datacenter rack. One laptop, fan barely audible, doing the work of a small cluster.

Why General-Purpose Databases Hit a Ceiling

Most databases are built for general cases. They treat every row the same way because they don't know your business.

This "Domain Ignorance" leads to generic row locks, MVCC bookkeeping, cross-table coordination, and massive overhead — costs you pay on every single transaction, whether your domain needs them or not.

Not Magic — Discipline

For the skeptics: this isn't sorcery. It's discipline applied to the right layer — designing the system so the hardware does exactly what it's good at, and nothing else.

I'm not reinventing the storage wheel. The foundation is Pebble, the same proven LSM-tree engine that powers CockroachDB. But the engine is just the floor. The real lever is the orchestration of the domain logic on top of it — and that's what Part 2 puts a name on.

A Note on the Benchmark Scope

I know what you're thinking. "20k+ ops/s? That must be an internal memory trick."

It isn't. To ensure these numbers reflect real-world usage, the benchmark covers the entire lifecycle of a business transaction:

Client-side serialization — the payload starts from the app.
Local communication — end-to-end roundtrip.
Server-side deserialization & parsing.
Business Invariants validation.
Disk persistence with full durability guarantees — fsync on every commit.

The workload: batch=1000, payload=1KB, single-node, single laptop. Here's the run, with the system-level disk stats captured live during the bench:

[23755.87 items/s] | items=1424000 | batch=1000 | payload=1KB | durability=FSYNC-ON

Live capture during the bench (batch=1000, 1KB, fsync ON). Disk on fire, CPU bored.

Two things jump out of that stats panel — and together they're the whole point:

The disk is screaming. Sustained 100–200 MB/s with the ⚡ markers firing almost every second. This is real fsync'd traffic hitting the SSD, not a memory cache pretending to be durable. If you pulled the power cord mid-run, every committed transaction would still be there on reboot.
The CPU is bored (~18% on an 8-core M3). The compute is idle while the disk pegs out — that asymmetry is the whole story.

And this isn't the ceiling. With bigger batches the same laptop pushes further; even at batch=1, it doesn't fall off a cliff. The full envelope is Part 2.

What's Next?

This is just Part 1. In a few days, Part 2 finishes the picture and lands the real punchline: business rules aren't a tax on performance — they're the contract that lets the machine fly. And the whole thing runs on hardware your team could expense, not a cloud bill that needs board approval.

Stay tuned. The era of the "Impossible Trade-off" is over.