Rafa Calderon

Posted on Mar 7

How I Built a CRDT Engine for a Collaborative Whiteboard in Rust

#rust #npm #webassembly #python

I'm currently building a real-time collaborative whiteboard. Think of it as Figma's infinite canvas, but focused on stylus input and handwriting. Multiple users draw simultaneously, offline sessions sync on reconnect, and every stroke appears on everyone's screen without conflicts.

Sounds simple. It isn't.

After evaluating existing CRDT libraries, none of them modeled the domain correctly — they're built for text editors, not vector graphics. So I built vectis-crdt: a Rust library that compiles to both native and WebAssembly, with the server also consuming it in Rust.

This is the story of why I made every design decision I made.

The problem: three requirements that pull in opposite directions

A real-time collaborative whiteboard has three fundamental constraints:

Immediate local responsiveness: every stylus touch must appear on screen before any server round-trip. 80ms of latency between pen-down and pixel-rendered breaks the experience.
Eventual convergence: two clients that have applied the same set of operations — in any order — must end up with identical visible state.
No conflicts: a whiteboard has no "conflicts" to resolve. Two users drawing simultaneously both draw. You never show a conflict dialog to someone holding a stylus.

The classic solution is CRDTs (Conflict-free Replicated Data Types). But which flavor?

Why RGA + YATA, not OT or simple counters

I evaluated three approaches:

Operational Transformation (OT) — Used by Google Docs. Requires a central server to sequence all operations before distributing them. That kills offline support and adds latency on every stroke.

State-based CRDTs (sets, counters) — Simple but wrong for this domain. A whiteboard needs ordered strokes (z-order). "Stroke B is drawn on top of stroke A" is fundamental semantics. A set has no order; a counter has no identity.

RGA (Replicated Growable Array) + YATA — This is what Yjs uses internally for text. It maintains a sequence with a total deterministic order for concurrent insertions. I adapted it: instead of characters, each slot holds a stroke reference.

The key insight: the array is small. A whiteboard has hundreds of strokes, not millions of characters. This allows simpler data structures — a Vec with a HashMap index rather than a tree — without any performance penalty.

Before diving into the code, let's map out the mental model. At its core, this engine relies on three pillars: a way to uniquely identify every single action across space and time (OpId and Vector Clocks), a strict set of rules to resolve order when users draw simultaneously (the YATA algorithm), and a mechanism to clean up deleted data without corrupting the document's history (Garbage Collection). Let's start from the bottom up.

The base layer: OpId and Vector Clocks

Every operation in the system gets a globally unique identifier:

pub struct OpId {
    pub lamport: LamportTs,  // monotonically increasing logical clock
    pub actor:   ActorId,    // u64 — compact wire representation of a peer
}

ActorId is a u64 assigned by the server on first connection, not a UUID. This saves 8 bytes per reference on the wire — significant when every stroke carries three of them (id, origin_left, origin_right).

The ordering on OpId is total and deterministic:

impl Ord for OpId {
    fn cmp(&self, other: &Self) -> Ordering {
        self.lamport.0.cmp(&other.lamport.0)
            .then_with(|| self.actor.0.cmp(&other.actor.0))
    }
}

Higher Lamport wins; on tie (concurrent operations), higher ActorId wins. This is the tiebreaker that makes the CRDT converge when two users draw at the exact same logical moment.

For causality tracking, each peer maintains a Vector Clock:

pub struct VectorClock {
    clocks: BTreeMap<ActorId, u64>,  // actor → max lamport seen from that actor
}

The vector clock drives three features: causal delivery, delta synchronization, and garbage collection. It's the single most important data structure in the system.

Operation lifecycle

Before diving into each component, it's worth seeing how they fit together:

insert_stroke()
    │
    ├─ simplify(epsilon)        ← RDP reduces 500 pts → ~40 pts
    │
    ├─ tick LamportTs           ← generates new OpId
    │
    ├─ RgaArray::integrate()    ← places the stroke in z-order
    │
    ├─ StrokeStore::insert()    ← stores points + LWW properties
    │
    └─ pending_ops.push()       ← queues for sending to server
                                       │
                                       ▼
                               encode_update() → wire (LEB128 + f32 LE)
                                       │
                                       ▼
                               peer receives → CausalBuffer → apply_remote()

Each stage is independent and testable. The separation between RgaArray (only 16-byte references) and StrokeStore (actual point data) is deliberate: it keeps the working set for conflict integration in L1/L2 cache.

The core CRDT: the YATA integration algorithm

The RGA array maintains the z-ordering of strokes. Each item stores its insertion context:

pub struct RgaItem {
    pub id:           OpId,
    pub origin_left:  OpId,   // item to the left at insert time
    pub origin_right: OpId,   // item to the right at insert time
    pub content:      StrokeId,
    pub state:        ItemState,  // Active or Tombstone { deleted_at: OpId }
}

The genius of YATA is how it resolves concurrent insertions. Imagine Alice and Bob both insert a stroke at the same position simultaneously:

Alice inserts A with origin_left = X
Bob inserts B with origin_left = X

Without a rule, the order would depend on which operation arrives first — breaking convergence. The YATA rule:

The YATA Rule: Within items sharing the same origin_left, items whose own origin_left is to the right of ours belong to the "right subtree" and are skipped. Among the remaining items in the same zone, higher OpId goes further left.

The implementation:

pub fn integrate(&mut self, item: RgaItem) {
    // Idempotent: if already present, do nothing.
    if self.index.contains_key(&item.id) { return; }

    let scan_start      = /* position after origin_left */;
    let scan_end        = /* position of origin_right */;
    let origin_left_pos = /* position of our origin_left */;

    let mut insert_pos = scan_start;
    for i in scan_start..scan_end {
        let existing        = &self.items[i];
        let existing_ol_pos = /* position of existing.origin_left */;

        if existing_ol_pos < origin_left_pos {
            break;  // passed our zone
        } else if existing_ol_pos > origin_left_pos {
            insert_pos = i + 1;  // skip right-subtree item
        } else {
            // Same zone: higher OpId → further left
            if existing.id > item.id {
                insert_pos = i + 1;
            } else {
                break;
            }
        }
    }
    self.items.insert(insert_pos, item);
    self.rebuild_index_from(insert_pos);
}

This is O(k) where k is the number of concurrent conflicting operations at that position — typically 1 or 2 in practice, O(n) worst case.

Deletions: tombstones and why you can't just remove items

In a distributed system, you can't immediately remove a deleted item from the array. The problematic scenario:

Alice has [A, B, C]. She deletes B.
Bob, offline, inserts D after B. Bob has [A, B, D, C].
Bob reconnects and his insert arrives.
If we'd already erased B from Alice's array, D's origin_left = B.id would be unresolvable.

The solution: tombstones. Deleted items remain in the array with state Tombstone { deleted_at: OpId }, invisible to the application but still present for conflict resolution.

This means the array can grow unbounded over time — which leads to garbage collection.

Causal delivery: the CausalBuffer

Operations can arrive out of order over WebSocket. If InsertStroke(B, origin_left=A.id) arrives before InsertStroke(A), applying B immediately would place it at the wrong z-order position.

The CausalBuffer holds not-yet-ready operations and retries them every time a new operation is successfully applied. The non-obvious part is that this can trigger a cascade: applying A unblocks B, which in turn unblocks C and D that were waiting on B. A single operation can free an entire chain.

fn apply_remote_buffered(&mut self, op: Operation) {
    self.pending.push(op);
    // Loop until no more ops can be unblocked
    loop {
        let before = self.pending.len();
        self.pending.retain(|op| {
            if is_causally_ready(op, &self.doc) {
                self.doc.apply_remote(op.clone());
                false  // remove from buffer
            } else {
                true   // keep waiting
            }
        });
        if self.pending.len() == before { break; }  // nothing new unblocked
    }
}

fn is_causally_ready(op: &Operation, doc: &Document) -> bool {
    match op {
        Operation::InsertStroke { origin_left, .. } =>
            origin_left.is_zero() || doc.stroke_order.index.contains_key(origin_left),
        Operation::DeleteStroke { target, .. } =>
            doc.stroke_order.index.contains_key(target),
        Operation::UpdateProperty { target, .. } =>
            doc.stroke_store.contains(target),
        Operation::UpdateMetadata { .. } => true,
    }
}

The buffer has a hard limit of 10,000 operations. If exceeded, the client requests a full snapshot from the server rather than trying to recover — a safer failure mode than OOM.

Incremental Garbage Collection with re-parenting

Tombstones are safe to remove only when all known peers have seen the deletion — a condition called "causal stability". The system tracks this via the Minimum Version Vector (MVV): the server periodically broadcasts the pointwise minimum of all known vector clocks.

A tombstone T with deleted_at = op is GC-eligible when:

mvv.get(op.actor) >= op.lamport

Incremental GC runs in bounded cycles (max_gc_per_cycle = 5,000 items) to avoid long pauses.

The critical bug without re-parenting

Here's the problem I almost missed. Consider this state:

Array: [A] → [B, active] → [C, active]
              C.origin_left = B.id

B gets deleted (tombstone). Eventually, GC removes it when causally stable. Now C.origin_left points to an ID that no longer exists in the array.

When the snapshot is serialized and reconstructed on another peer, the op sequence is: InsertStroke(A), InsertStroke(C, origin_left=B.id). But B is not in the snapshot because it was GC'd. C has nowhere to anchor → it gets inserted at the end. Z-order is corrupted.

The fix: re-parenting before retain. Before physically erasing tombstones, the GC walks surviving items whose origin_left points to an ID in the remove_set, and finds the nearest surviving ancestor:

fn find_kept_ancestor(mut origin: OpId, remove_set: &HashSet<OpId>, ...) -> OpId {
    for _ in 0..MAX_DEPTH {
        if !remove_set.contains(&origin) { return origin; }
        origin = items[index[&origin]].origin_left;
    }
    OpId::ZERO  // attach to root if the entire chain was GC'd
}

The multi-hop case: A → B(deleted) → C(deleted) → D(alive). GC of B and C re-parents D directly to A. This is deterministic: two peers with the same MVV produce exactly the same re-parented state. Convergence is preserved.

Mutable properties: LWW-Registers per field

Strokes have mutable properties: color, stroke width, opacity, transform. If Alice changes the color while Bob changes the opacity, both changes must survive — not conflict.

The solution: each property is an independent Last-Write-Wins Register:

pub struct StrokeProperties {
    pub color:        LwwRegister<u32>,
    pub stroke_width: LwwRegister<f32>,
    pub opacity:      LwwRegister<f32>,
    pub transform:    LwwRegister<Transform2D>,
}

impl<T: Clone> LwwRegister<T> {
    pub fn apply(&mut self, value: T, timestamp: OpId) -> bool {
        if timestamp > self.timestamp {
            self.value = value;
            self.timestamp = timestamp;
            true
        } else { false }
    }
}

Color change and opacity change use different registers → both survive. Concurrent color changes by two users → higher timestamp wins. On equal timestamps (same Lamport), the higher ActorId wins — the same deterministic tiebreaker the RGA uses.

An honest note on LWW: the winner is the one with the higher OpId, not the most recent by wall-clock time. If Alice writes color=red at t=5 and Bob writes color=blue also at t=5 but with a higher ActorId, Alice's red is lost even if she was "the last one" from her perspective. For aesthetic stroke properties, this semantics is acceptable.

Delta synchronization

When a client reconnects after being offline, you don't want to send the entire document history — just the operations the client hasn't seen yet.

The Vector Clock diff method computes exactly this:

pub fn diff(&self, other: &VectorClock) -> Vec<(ActorId, u64, u64)> {
    // Returns (actor, from_ts, to_ts) ranges
    // where `self` has seen more than `other`
    self.clocks.iter()
        .filter(|(&actor, &my_ts)| my_ts > other.get(actor))
        .map(|(&actor, &my_ts)| (actor, other.get(actor) + 1, my_ts))
        .collect()
}

Client sends its vector clock → server computes diff → server sends only the missing operations. O(actors) to compute, not O(operations). For a document with 50,000 operations of history and a client that missed 200, those 200 ops are transmitted — not 50,000.

A production detail that matters: a peer disconnected for longer than the gc_grace_period may hold references to already-GC'd tombstones. On reconnect, it must receive a full snapshot instead of a delta — the server needs to detect this condition by comparing the client's vector clock against the current MVV.

Stroke simplification: Ramer-Douglas-Peucker at insert time

A stylus at 240Hz produces one point every ~4ms. A 3-second stroke = ~720 raw points. Storing and transmitting all of them is wasteful — the human eye can't perceive the difference at normal zoom levels.

I implemented the Ramer-Douglas-Peucker algorithm, applied automatically at insert time:

pub fn insert_stroke(&mut self, mut data: StrokeData, props: StrokeProperties) -> StrokeId {
    if self.simplify_epsilon > 0.0 && data.points.len() > 2 {
        data.simplify(self.simplify_epsilon);  // in-place, before storing
    }
    // ...
}

The RDP implementation uses an iterative stack rather than recursion — no stack overflow risk for 50k-point strokes:

fn rdp_indices(points: &[StrokePoint], epsilon: f32) -> Vec<usize> {
    let mut stack: Vec<(usize, usize)> = Vec::with_capacity(64);
    stack.push((0, n - 1));
    while let Some((start, end)) = stack.pop() {
        // find max perpendicular distance in [start, end]
        // if > epsilon: keep the point, push both halves
    }
}

Typical reduction at epsilon = 0.5:

Stroke type	Original pts	With epsilon=0.5	Reduction
Straight line	500	2	99.6%
Smooth curve	500	25–60	~90%
Calligraphic signature	500	80–150	~75%

Simplification happens before storing in the store and before emitting the network op — the remote peer receives the already-simplified stroke.

Zero-copy Wasm rendering: bypassing the JS boundary

The rendering hot path must be fast. Every animation frame, the canvas engine needs all visible stroke data. Crossing the JS↔Wasm boundary with individual function calls is expensive (~100–200ns each).

The solution: pack all visible strokes into a single contiguous buffer in Wasm linear memory, then hand JS a raw pointer:

#[wasm_bindgen]
pub fn build_render_data_viewport(
    &mut self, vx0: f32, vy0: f32, vx1: f32, vy1: f32, stroke_expand: f32
) -> *const u8 {
    let viewport = Aabb { min_x: vx0, min_y: vy0, max_x: vx1, max_y: vy1 };
    self.render_buf.clear();  // reuse buffer — no alloc
    for id in self.inner.visible_stroke_ids() {
        if let Some((data, props)) = self.inner.get_stroke(&id) {
            let bounds = if props.transform.value.is_identity() {
                data.bounds.expanded(stroke_expand)
            } else {
                data.bounds.transform(&props.transform.value).expanded(stroke_expand)
            };
            if !bounds.intersects(&viewport) { continue; }  // AABB culling
            write_stroke_to_buf(&mut self.render_buf, &id, data, props);
        }
    }
    self.render_buf.as_ptr()
}

In JavaScript:

const ptr = doc.build_render_data_viewport(vx0, vy0, vx1, vy1, expand);
const len = doc.get_render_data_len();
const view = new DataView(wasmInstance.memory.buffer, ptr, len);
// Direct read from Wasm memory — zero copies, zero allocations on the JS side

render_buf is reused across frames with clear() instead of alloc. The pointer is valid only until the next operation that mutates the buffer — the JS client must read all data in the same frame before any mutation.

AABB culling skips strokes outside the viewport without iterating their points. For a whiteboard with 5,000 strokes but only 200 visible in the current viewport, the difference is substantial.

Wire format: LEB128 and encoding decisions

The binary protocol uses unsigned LEB128 for all integers and IEEE 754 little-endian for floats. Some non-obvious decisions:

Why u64 instead of UUID for ActorId: in LEB128, an ActorId < 2^14 takes 2 bytes. A UUID always takes 16 bytes. Each operation carries three OpIds (id, origin_left, origin_right): 6 bytes vs 48 bytes per stroke just in identifiers.

Why fixed LE for cursors instead of LEB128: awareness cursors are sent in bulk (N × 28 bytes) and decoded via chunks_exact(28) without parsing. The predictability of the fixed format compensates for the potential inefficiency for small ActorIds — cursors are not persisted and decode speed on the hot path matters more than size.

Optional LZ4 for updates > 200 bytes: stroke points compress well because they are sequential coordinates with small deltas. An InsertStroke of 100 points goes from ~1,200 bytes to ~900 bytes with LZ4. The 200-byte threshold avoids compressing small ops where the LZ4 header overhead would exceed the savings.

A typical InsertStroke of 40 simplified points takes ~560 bytes on the wire uncompressed.

Local undo

Undo on a collaborative whiteboard is subtle. The naive approach — "revert to previous state" — is wrong because you can't undo what other users did. The correct semantic: undo generates a delete operation that is broadcast to all peers.

pub fn undo_last_stroke(&mut self) -> Option<StrokeId> {
    while let Some(id) = self.undo_stack.pop() {
        if self.delete_stroke(id) {  // generates a DeleteStroke in pending_ops
            return Some(id);
        }
        // Stroke was already deleted by a remote peer → skip, try the previous one
    }
    None
}

The undo stack only tracks the local actor's strokes. The stack is session-only (not persisted in snapshots). The cap is 200 entries — enough for any reasonable undo sequence.

Defensive limits and error philosophy

The library enforces hard limits to prevent resource exhaustion from malformed or malicious payloads:

pub const MAX_POINTS_PER_STROKE: usize = 50_000;  // 240Hz × ~3.5min without simplification
pub const MAX_STROKES:           usize = 100_000;  // ~8 MB RGA memory
pub const MAX_ACTORS:            usize = 10_000;   // bounds the VectorClock BTreeMap

The philosophy is dual, and intentional:

decode_* paths (untrusted external data): return Err or None when limits are exceeded. The error propagates to the caller — malformed data is rejected before any allocation.
apply_remote (already-parsed ops from remote peers): silent drop. The CRDT is designed to be tolerant; dropping a remote op is preferable to OOM. If the document already has MAX_STROKES, the new one simply isn't inserted.

Local limits are more permissive: insert_stroke (trusted local operation) has no point count limit — auto-simplification with RDP epsilon=0.5 reduces 50k points to ~500 before storing.

Formal guarantees

Property	Guarantee
Strong Eventual Consistency	Two replicas with the same operation set have identical `visible_stroke_ids()`, regardless of application order.
Idempotency	`apply_remote(op)` called twice is equivalent to once. Safe with WebSocket redelivery.
Commutativity	Application order of concurrent ops doesn't change final state.
GC Safety	Only causally stable tombstones are removed. No operation that any known peer might still need is GC'd prematurely.
Snapshot-replay equivalence	`decode_snapshot(encode_snapshot(doc))` produces the same visible state as `doc`. Z-order and properties are identical.

What vectis-crdt does NOT guarantee

Equally important:

Limitation	Description
GC without MVV	GC requires the server to compute and broadcast the MVV. Without a server, GC cannot run safely in a pure P2P scenario.
LWW wall-clock consistency	If two peers modify the same property offline, the winner is the one with the higher OpId — not the "most recent by system clock".
Liveness under indefinite partition	If two peers are disconnected indefinitely, they don't converge until they reconnect. This is inherent to CRDTs without coordination.

Build output

cargo build --release --target wasm32-unknown-unknown
wasm-opt -O3 -o vectis_crdt_bg.wasm vectis_crdt_bg.wasm
gzip -9 vectis_crdt_bg.wasm

Result: ~85 KB gzipped. Fits comfortably in a single HTTP/2 push alongside the app bundle.

Conclusion

Building vectis-crdt showed me that domain-specific CRDTs are worth the investment. A general-purpose CRDT library would have forced the whiteboard domain to adapt to the library's model. Instead, the model adapts to the domain:

Strokes, not characters, are the unit of the array.
Z-ordering is a first-class concept, not an afterthought.
Simplification, viewport culling, and awareness are built in, not bolted on.

The YATA algorithm gives convergence. Vector clocks give causal consistency. The MVV gives safe GC. The Wasm bridge gives zero-copy rendering. Each piece is independently verifiable.

References and further reading

The algorithms and data structures in vectis-crdt have solid academic foundations. If you want to go deeper into the theory behind each decision:

RGA — Replicated Growable Array
Roh, H. G., Jeon, M., Kim, J. S., & Lee, J. (2011). Replicated abstract data types: Building blocks for collaborative applications. Journal of Parallel and Distributed Computing, 71(3), 354–368.
The original paper defining the replicated array model with origin_left and the integration algorithm.

YATA — Yet Another Transformation Approach
Nicolaescu, P., Jahns, K., Derntl, M., & Klamma, R. (2016). Near Real-Time Peer-to-Peer Shared Editing on Extensible Data Types. ECSCW 2016.
Introduces origin_right and the right-subtree correction that fixes the interleaving cases that plain RGA doesn't handle. The basis of Yjs's integration algorithm.

Lamport Clocks
Lamport, L. (1978). Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7), 558–565.
The foundational paper on logical clocks. Defines the "happens-before" relation and the monotonic clock construction used in LamportTs.

Vector Clocks
Fidge, C. J. (1988). Timestamps in message-passing systems that preserve the partial ordering. Proceedings of the 11th Australian Computer Science Conference.
Mattern, F. (1989). Virtual time and global states of distributed systems. Parallel and Distributed Algorithms.
Two independent and simultaneous papers that formalize vector clocks. Foundation of VectorClock::dominates and delta synchronization.

CRDTs — theoretical framework
Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). Conflict-free Replicated Data Types. INRIA Research Report RR-7687.
Shapiro, M., et al. (2011). A comprehensive study of Convergent and Commutative Replicated Data Types. SSS 2011.
The two reference papers that formalize CRDTs, distinguish state-based from op-based, and establish the mathematical conditions for convergence.

Ramer-Douglas-Peucker
Ramer, U. (1972). An iterative procedure for the polygonal approximation of plane curves. Computer Graphics and Image Processing, 1(3), 244–256.
Douglas, D. H., & Peucker, T. K. (1973). Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Cartographica, 10(2), 112–122.
The two original papers (published independently) of the polyline simplification algorithm used in StrokeData::simplify.

Reference implementations

Yjs — mature YATA implementation for collaborative text in JS, by Nicolaescu et al.
Automerge — general-purpose CRDT with a Rust backend
diamond-types — high-performance reference implementation of RGA in Rust, by Joseph Gentle

vectis-crdt:
GitHub · crates.io · npm

Tags: rust, crdt, distributed-systems, webassembly, collaborative

DEV Community