Don Johnson

Posted on Mar 25

I Built a Stream Processor That Only Recomputes What Changed

#ocaml #distributedsystems #streamprocessing #performance

I spent weeks studying how incremental computation works in production trading systems. Not the papers. The actual implementations. How self-adjusting computation engines track dependencies, propagate changes, and avoid redundant work.

One thing kept bothering me: the model is incredibly powerful, but it's locked inside single-process libraries. If you want surgical recomputation — where changing one input only touches the nodes that actually depend on it — you have to give up distribution. If you want distribution, you're back to recomputing entire windows on every tick.

That gap is where Ripple came from.

The experiment that started it

I built a prototype. A simple incremental graph: 10,000 leaf nodes (one per stock symbol), each feeding through a map node into a fold that aggregates them all. The question was simple: when one leaf changes, how many nodes actually need to recompute?

The answer should be 3. The leaf, its map, and the fold. Not 10,000. Not 40,000. Three.

The first implementation used a linear scan to find dirty nodes. It worked, but stabilization took 27 microseconds at 10,000 symbols. That sounds fast until you multiply it by the event rate. At 100K events per second, you're spending 2.7 seconds per second just on stabilization. The math doesn't work.

So I replaced the linear scan with a min-heap ordered by topological height. Nodes get processed parents-before-children, and only dirty nodes enter the heap. The same stabilization dropped to 250 nanoseconds. That's a 100x improvement from one data structure change.

But the heap alone wasn't enough. The fold node was still O(N) — it re-summed all 10,000 parents on every stabilization, even though only one parent changed. The fix was an incremental fold: track which parents changed during dirty propagation, then subtract the old value and add the new. O(1) per changed parent, regardless of how many parents exist.

That combination — heap-based propagation plus incremental fold — is what makes the whole thing work.

The delta algebra rabbit hole

Once the graph engine was fast, I needed to figure out how to send changes between distributed nodes. Not full values. Deltas.

This turned into a deeper problem than I expected. If you're sending deltas over a network, and the network can duplicate or reorder messages, your deltas need to be idempotent. Applying the same update twice has to produce the same result as applying it once.

That rules out relative patches like "increment price by 5." You need absolute patches: "set price to 150." It feels wasteful, but it's the only way to get effectively-once semantics without distributed transactions.

I ended up with a small algebra:

apply(Set(v), _)              = Ok v              -- replacement
apply(d, apply(d, v))         = apply(d, v)       -- idempotent
apply(diff(old, new), old)    = Ok new            -- roundtrip
compose(d, Remove)            = Remove            -- annihilation
compose(d, Set(v))            = Set(v)            -- right identity
apply(compose(d1,d2), v)      = apply(d2, apply(d1, v))  -- compatible

Six laws. Every one of them is verified by property-based tests across thousands of random inputs. If any law breaks, the commit is blocked.

The roundtrip property — apply(diff(old, new), old) = new — is the one that matters most. It means you can always reconstruct the new value from the old value and the delta. This is the foundation of checkpoint and replay.

The checkpoint/restore discovery

I had a hypothesis: if the graph is deterministic (same inputs always produce same outputs), and deltas are idempotent (retries are safe), then checkpoint/restore should be straightforward. Snapshot the leaf values, save them, and on recovery, restore the leaves and re-stabilize. The compute nodes don't need checkpointing — they'll recompute from their dependencies.

I wrote a chaos test to verify. Process 100 events. Crash at a random point. Restore from checkpoint. Continue processing. Compare the final output against an uninterrupted run.

I ran it at 100 different random crash points. All 100 produced the correct output.

That was the moment I knew the architecture was sound. Not because I proved it on paper, but because I tried to break it 100 times and couldn't.

The effect injection pattern

One of the less obvious decisions: every source of non-determinism goes through an injectable interface. Time, randomness, I/O — none of it is called directly. There's a module type:

module type EFFECT = sig
  val now : unit -> Time_ns.t
  val random_int : int -> int
end

Production uses the live clock. Tests use a deterministic clock that only advances when you tell it to. This means replay is truly deterministic — given the same inputs and the same effect implementation, you get the same outputs. Every time.

This pattern isn't original. Jane Street uses it extensively. But applying it to a distributed system — where you need deterministic replay across multiple nodes after a crash — makes it load-bearing infrastructure, not just a testing convenience.

What actually got built

The final system is 6,200 lines of OCaml across 16 libraries:

Graph engine — heap-based stabilization, incremental fold, cutoff optimization
Schema layer — type-safe schemas derived from OCaml types, backward/forward compatibility checking
Wire protocol — bin_prot serialization with CRC-32C integrity on every message
Delta transport — sequence-ordered delivery with gap detection and retransmission
Checkpointing — snapshot/restore with pluggable stores (memory, disk, S3)
Windowing — tumbling, sliding, session windows with watermark tracking
Observability — Prometheus metrics, W3C distributed tracing, graph introspection
Coordinator — consistent hashing, partition assignment, failure detection
Worker — lifecycle state machine with health endpoints

Three binaries: a VWAP demo pipeline, a worker process, and a CLI.

The numbers, measured not projected:

What	Measured
Stabilization at 10K symbols	250 ns
Serde roundtrip	82 ns
VWAP throughput	2.16M events/sec
6M event replay recovery	2.1 seconds
Heap growth over 1M events	0.1%

What I learned building it

Data structures matter more than algorithms. The 100x improvement from linear scan to min-heap wasn't a clever algorithm. It was picking the right data structure for the access pattern. The heap gives you O(R log R) where R is the number of dirty nodes. The linear scan gives you O(N) where N is the total graph. When R is 3 and N is 40,000, that's the whole game.

Algebraic properties are testable contracts. The six delta laws aren't documentation. They're property-based tests that run on every commit. When I accidentally introduced a non-idempotent patch variant (list insertion by index), the tests caught it immediately. The law apply(d, apply(d, v)) = apply(d, v) failed. I removed the variant. The algebra stays clean because the tests enforce it.

Chaos testing builds confidence that proofs can't. I could reason about why checkpoint/restore should work. I could trace through the logic. But running 100 random crash points and seeing 100 correct recoveries — that's a different kind of confidence. It's the difference between believing your parachute works and having jumped with it.

The pre-commit hook is the best decision I made. Every commit runs: build, all 117 tests, and a benchmark regression gate. If stabilization time exceeds 3 microseconds, the commit is blocked. Not a CI notification. Not a Slack alert. The commit literally does not happen. This means the benchmarks in the README are always true. They're not aspirational numbers from a good run six months ago. They're what the code does right now.

The experimentation process

The honest version: this didn't come out clean. The first graph engine was too slow. The first delta type had non-idempotent variants that I had to remove. The first fold was O(N) and I didn't realize it until the benchmark showed 42 microseconds instead of the expected 600 nanoseconds.

Each of those failures taught me something specific:

The slow engine taught me that O(N) scanning is the enemy, even when N feels small. 40,000 nodes at 50 nanoseconds per check is 2 milliseconds. That's invisible in a unit test and fatal at production event rates.

The non-idempotent delta taught me that algebraic properties aren't academic. They're the contract that makes distributed recovery work. If apply(d, apply(d, v)) != apply(d, v), your effectively-once guarantee is a lie.

The O(N) fold taught me to benchmark before trusting projections. I projected 600 nanoseconds. I measured 42,000. The projection was based on heap overhead per node. The measurement included the fold re-scanning every parent. The number you measure is the number that matters.

The beautiful part of this process is that each failure narrowed the design space. By the time I had the heap, the incremental fold, and the idempotent deltas, the architecture was almost inevitable. Not because I designed it top-down, but because the experiments eliminated everything else.

Try it

The whole thing is open source under MIT.

There's a live simulation on the landing page where you can watch the graph work — 50 symbols, trades arriving, only the affected path lighting up while everything else stays dark.

Landing page: https://copyleftdev.github.io/ripple/

Source: https://github.com/copyleftdev/ripple

git clone https://github.com/copyleftdev/ripple.git
cd ripple
make build
make demo    # 2M+ events/sec VWAP pipeline
make test    # 117 inline + property + load + chaos tests

If you work on trading systems, real-time analytics, or any pipeline where you're recomputing more than you should — take a look. The core insight is simple: track dependencies, propagate only what changed, make deltas idempotent. The rest is engineering.