DEV Community

Rizwan Saleem
Rizwan Saleem

Posted on

Building a resilient data synchronization layer with CRDTs and a practical operational surface

Building a resilient data synchronization layer with CRDTs and a practical operational surface

Building a resilient data synchronization layer with CRDTs and a practical operational surface

Data synchronization is a perennial challenge in distributed apps: how to keep copies in different services consistent, handle conflicting updates gracefully, and do it without sacrificing availability. In this tutorial, you’ll build a practical, end-to-end data synchronization layer using Conflict-free Replicated Data Types (CRDTs) paired with a simple operational surface for monitoring, debugging, and observability. The goal is to give you a concrete pattern you can adapt to real-world apps-web, mobile, or edge-without requiring exotic infrastructure.

Key ideas you'll take away

  • How CRDTs enable eventual consistency without coordination bottlenecks.
  • A pragmatic data model and API surface that supports multi-region sync, offline edits, and conflict resolution.
  • An implementation pattern with a local store, a sync protocol, and a central event log.
  • Methods for observability, testing, and rollback in a CRDT-based system.

Prerequisites

  • Familiarity with JavaScript/TypeScript or a similar language.
  • Basic understanding of distributed systems concepts (latency, availability, partition tolerance).
  • Knowledge of data structures (maps, sets) and a simple REST or WebSocket-based API.

Overview of the approach

  • Use a CRDT to represent domain data locally. We’ll implement a simple OR-Set (Observed-Removed Set) CRDT for a collaborative notes-like app, extended to a more general key-value model.
  • Each device keeps a local store and a causal history vector (logical clocks) to apply updates in a deterministic, conflict-free way.
  • Synchronization happens via a gossip-like protocol or a central event log. Updates are versioned and merged deterministically.
  • An operational surface includes: a change feed API, health checks, tape-based rollback, and dashboards for visibility.

System design sketch

  • Data model
    • Entity: Item { id: string, value: string, tags: string[], ts: number }
    • CRDT core: OR-Set for collections of items, extended with per-item tombstones.
    • Local store: a Map plus a tombstone set to know deleted items.
  • Synchronization protocol
    • Each device maintains a Vector Clock (VC) or causal metadata per item.
    • When devices connect, they exchange deltas: added items, updated items, and tombstones.
    • Merges are performed deterministically using CRDT rules; conflicts resolve via last-writer-wins with a sane tie-breaker or via content-based comparison.
  • Operational surface
    • REST/WebSocket endpoints to query local state and push deltas.
    • Sync status endpoint showing last sync time, delta size, and conflict count.
    • Rollback ability by recording a local checkpoint (tape) before applying a batch of updates.
    • Observability: metrics on replication lag, number of items, conflict rate, and partition events.
  • Reliability considerations
    • Idempotent deltas, retry logic, and exactly-once semantics for delta application where possible.
    • Safe rollback using deterministic replays from a recorded checkpoint.
    • Testing with property-based tests to validate CRDT invariants.

Step 1: Define the CRDT core (OR-Set) and data model

  • An OR-Set uses add and remove events tracked by unique tags per item addition to resolve conflicts.
  • We’ll implement a simple wrapper around a Map that stores, for each item id, a set of addition tags and a tombstone set of removed tags.

Code (TypeScript-like pseudocode)

type Item = {
  id: string;
  value: string;
  tags: string[]; // metadata if needed
  ts: number; // logical timestamp at creation
};

// A per-item addition namespace
type Dot = string; // e.g., "<node-id>:<counter>"

class ORSet {
  // adds: itemId -> Set<Dot>
  private adds: Map<string, Set<Dot>> = new Map();
  // removes: itemId -> Set<Dot> (dots that have been removed)
  private removes: Map<string, Set<Dot>> = new Map();

  // add an item with a unique dot
  add(itemId: string, dot: Dot): void {
    if (!this.adds.has(itemId)) this.adds.set(itemId, new Set());
    this.adds.get(itemId)!.add(dot);
  }

  // remove an item by removing a previously added dot
  remove(itemId: string, dot: Dot): void {
    if (!this.removes.has(itemId)) this.removes.set(itemId, new Set());
    this.removes.get(itemId)!.add(dot);
  }

  // merge another ORSet into this one
  merge(other: ORSet): void {
    for (const [id, dots] of other.adds.entries()) {
      if (!this.adds.has(id)) this.adds.set(id, new Set());
      for (const d of dots) this.adds.get(id)!.add(d);
    }
    for (const [id, dots] of other.removes.entries()) {
      if (!this.removes.has(id)) this.removes.set(id, new Set());
      for (const d of dots) this.removes.get(id)!.add(d);
    }
  }

  // materialize the current view: items present = adds - removes
  present(): string[] {
    const result: string[] = [];
    for (const [id, addDots] of this.adds.entries()) {
      const removeDots = this.removes.get(id) ?? new Set<string>();
      // item exists if there is any add dot not retracted by a corresponding remove dot
      // This simplified view assumes a one-to-one mapping; in a real system you track dots per item value.
      const isPresent = Array.from(addDots).some(dot => !removeDots.has(dot));
      if (isPresent) result.push(id);
    }
    return result;
  }

  // For a real app, you'd store full item payloads per id, not just ids.
}
Enter fullscreen mode Exit fullscreen mode

Notes:

  • This is a simplified OR-Set sketch to illustrate the concept. In a real implementation, you’ll keep the payload (Item) per add dot, so you can reconstruct the exact content after merges, and you’ll manage per-item tags or timestamps to resolve dovetailing updates.
  • Each add-dot should be globally unique (e.g., nodeId:counter) so that different devices don’t collide.

Step 2: Local store with synchronization hooks

  • The local store houses items and their per-item CRDT state.
  • Each device keeps:
    • items: Map
    • dotCounter per node to generate unique add-dots
    • a delta queue to push to remote peers
  • When local edits occur, we create a new item, assign a new dot, and update the OR-Set.

Code (high-level)

class CRDTStore {
  private items: Map<string, Item> = new Map();
  private orset: ORSet = new ORSet();
  private nodeId: string;
  private counter: number = 0;
  private deltaLog: Delta[] = []; // for outbound syncs

  constructor(nodeId: string) {
    this.nodeId = nodeId;
  }

  private nextDot(): Dot {
    this.counter += 1;
    return `${this.nodeId}:${this.counter}`;
  }

  addItem(id: string, value: string, ts: number): void {
    const dot = this.nextDot();
    const item: Item = { id, value, ts, tags: [] };
    this.items.set(id, item);
    this.orset.add(id, dot);
    this.deltaLog.push({ type: 'add', id, value, dot, ts, node: this.nodeId });
  }

  removeItem(id: string, dot: Dot): void {
    this.orset.remove(id, dot);
    this.deltaLog.push({ type: 'remove', id, dot, node: this.nodeId });
  }

  applyDelta(delta: Delta): void {
    // Merge delta into OR-Set, reconstruct items as needed
    // For simplicity, assume delta carries enough payload to update items
    // In a full implementation you'd map adds/removes to items payloads.
  }

  getPendingDeltas(): Delta[] {
    const out = this.deltaLog.slice();
    this.deltaLog = [];
    return out;
  }

  // expose a snapshot of current visible items
  snapshot(): Item[] {
    // materialize from OR-Set present plus stored payloads
    return Array.from(this.items.values());
  }
}

type Delta =
  | { type: 'add'; id: string; value: string; dot: Dot; ts: number; node: string }
  | { type: 'remove'; id: string; dot: Dot; node: string };
Enter fullscreen mode Exit fullscreen mode

Step 3: Synchronization protocol

  • Use a central broker (e.g., a lightweight REST endpoint or a WebSocket channel) or a peer-to-peer gossip mechanism to exchange deltas.
  • Each sync session exchanges a batch of deltas. The receiver applies deltas in a deterministic order (e.g., by dot lexicographic order or by delta sequence timestamp).

Example sync flow (simplified)

  • Node A and Node B connect.
  • Each sends its pendingDelta list (deltas not yet seen by the other).
  • Each applies the received deltas via applyDelta.
  • After applying, each sends back an acknowledgment or a fresh delta log containing new changes.

Code sketch for delta exchange (conceptual)

async function syncWithPeer(local: CRDTStore, peerApi: PeerApi) {
  const localDeltas = local.getPendingDeltas();
  const remoteDeltas = await peerApi.fetchDeltas(localDeltas.length ? localDeltas.ts : 0);
  // The peer would return deltas since a given timestamp
  for (const d of remoteDeltas) {
    local.applyDelta(d);
  }
  // After applying, push our deltas
  await peerApi.pushDeltas(localDeltas);
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Operational surface and tooling

  • Expose an API to inspect state, trigger a manual sync, and rollback if needed.
  • Implement tape-based checkpoints: before applying a batch of deltas, record a checkpoint (a deterministic replay point). If something goes wrong, you can revert to the checkpoint and re-run deltas.

API surface ideas

  • GET /state - returns current items with their values and metadata.
  • POST /sync - initiate a sync with a remote peer.
  • POST /checkpoint - create a tape checkpoint with a human-friendly label.
  • POST /rollback - rollback to the last checkpoint and replay deltas up to now.
  • GET /health - basic liveness and sync status metrics.

Considerations for reliability

  • Idempotent delta application: ensure applying the same delta twice has no effect.
  • Ordering: apply deltas deterministically. If you receive concurrent adds for the same id, use a tie-breaker (e.g., dot-based ordering).
  • Conflict resolution: CRDTs avoid conflicts, but you might want to surface conflicts to users when they arise (e.g., two edits to the same item). You can implement a lightweight “conflict log” to show divergent values and allow user resolution.

Step 5: Testing strategy

  • Unit tests for CRDT merge semantics:
    • Adding/removing the same item from two nodes should converge to the same final state.
    • Deltas applied multiple times should not double-apply.
  • Property-based tests:
    • Random sequences of adds/removes across simulated nodes should converge to the same final snapshot.
    • After a network partition and subsequent reconciliation, all nodes have identical visible data or acceptable divergent-but-resolvable states.
  • End-to-end tests:
    • Simulate multiple fake peers with a central hub and verify convergence under delays and out-of-order delivery.

Example test scenario (pseudo)

  • Create three CRDTStore instances: A, B, C with distinct nodeIds.
  • Each adds several items independently.
  • Simulate network: exchange deltas between A-B, B-C, A-C in various orders with delays.
  • After reconciliation, verify that A.snapshot() equals B.snapshot() equals C.snapshot().

Step 6: A small, runnable example you can adapt

  • Implement a minimal local in-browser CRDT toy with a simple sync over WebRTC or WebSocket via a central server.
  • Start a small demo with two browser tabs:
    • Tab 1 adds an item, Tab 2 adds another item.
    • Tabs exchange deltas and converge to the same view.
    • Then Tab 1 updates item, Tab 2 deletes the item; reconcile to a consistent result.

Quick-start checklist

  • Pick a domain: notes, tasks, or contact records. For this tutorial, start with a notes collection where each note is an Item.
  • Implement the OR-Set skeleton and local store as shown above, focusing on correctness of add/remove/dot semantics.
  • Build a tiny sync layer that uses a central REST endpoint to fetch/push deltas.
  • Add a checkpoints/rollback feature to revert a batch of changes safely.
  • Add basic observability: counters for items, deltas exchanged, and sync latency.

Illustrative example: running scenario

  • Node A and Node B both run the app with separate local stores.
  • A creates Note #1 with dot A1, B creates Note #2 with dot B1.
  • A and B sync. Each applies the other's deltas.
  • A edits Note #1; creates a new dot A2; B edits Note #2 and creates dot B2.
  • They sync again. The final state reflects both edits without overwriting one another, due to CRDT semantics.
  • A detects a long-running sync lag; operator triggers a checkpoint before applying a batch; if something goes wrong, rollback to the checkpoint and re-run.

What to customize for your project

  • Data model complexity: if you need per-field synchronization, consider a CRDT that supports add-writes on fields or a LWW element set with metadata.
  • Conflict visibility: expose conflict metadata to users or operators, with a lightweight UI to resolve or annotate.
  • Security: sign deltas, verify authenticity, and encrypt payloads if sensitive data is synchronized across untrusted networks.

Next steps

  • If you’d like, I can tailor this into a concrete, runnable codebase in your preferred stack (Node.js with TypeScript, Python, or Go), including a minimal server, client library, and a small demo app. I can also sketch a small Piper-like sample project with tests and a runnable Docker setup to simulate multi-node synchronization.

Would you like me to convert this into a ready-to-run starter repo in your preferred language and framework? If yes, tell me:

  • Your language/framework choice (TypeScript/Node, Python, Go, etc.)
  • Whether you prefer a central server or a peer-to-peer sync style
  • The domain you want to model (notes, tasks, contacts, etc.)
  • Any constraints (offline-first, latency targets, deployment environment)

-

Rizwan Saleem | https://rizwansaleem.co

Top comments (0)