Building a resilient data synchronization layer with CRDTs and a practical operational surface
Building a resilient data synchronization layer with CRDTs and a practical operational surface
Data synchronization is a perennial challenge in distributed apps: how to keep copies in different services consistent, handle conflicting updates gracefully, and do it without sacrificing availability. In this tutorial, you’ll build a practical, end-to-end data synchronization layer using Conflict-free Replicated Data Types (CRDTs) paired with a simple operational surface for monitoring, debugging, and observability. The goal is to give you a concrete pattern you can adapt to real-world apps-web, mobile, or edge-without requiring exotic infrastructure.
Key ideas you'll take away
- How CRDTs enable eventual consistency without coordination bottlenecks.
- A pragmatic data model and API surface that supports multi-region sync, offline edits, and conflict resolution.
- An implementation pattern with a local store, a sync protocol, and a central event log.
- Methods for observability, testing, and rollback in a CRDT-based system.
Prerequisites
- Familiarity with JavaScript/TypeScript or a similar language.
- Basic understanding of distributed systems concepts (latency, availability, partition tolerance).
- Knowledge of data structures (maps, sets) and a simple REST or WebSocket-based API.
Overview of the approach
- Use a CRDT to represent domain data locally. We’ll implement a simple OR-Set (Observed-Removed Set) CRDT for a collaborative notes-like app, extended to a more general key-value model.
- Each device keeps a local store and a causal history vector (logical clocks) to apply updates in a deterministic, conflict-free way.
- Synchronization happens via a gossip-like protocol or a central event log. Updates are versioned and merged deterministically.
- An operational surface includes: a change feed API, health checks, tape-based rollback, and dashboards for visibility.
System design sketch
- Data model
- Entity: Item { id: string, value: string, tags: string[], ts: number }
- CRDT core: OR-Set for collections of items, extended with per-item tombstones.
- Local store: a Map plus a tombstone set to know deleted items.
- Synchronization protocol
- Each device maintains a Vector Clock (VC) or causal metadata per item.
- When devices connect, they exchange deltas: added items, updated items, and tombstones.
- Merges are performed deterministically using CRDT rules; conflicts resolve via last-writer-wins with a sane tie-breaker or via content-based comparison.
- Operational surface
- REST/WebSocket endpoints to query local state and push deltas.
- Sync status endpoint showing last sync time, delta size, and conflict count.
- Rollback ability by recording a local checkpoint (tape) before applying a batch of updates.
- Observability: metrics on replication lag, number of items, conflict rate, and partition events.
- Reliability considerations
- Idempotent deltas, retry logic, and exactly-once semantics for delta application where possible.
- Safe rollback using deterministic replays from a recorded checkpoint.
- Testing with property-based tests to validate CRDT invariants.
Step 1: Define the CRDT core (OR-Set) and data model
- An OR-Set uses add and remove events tracked by unique tags per item addition to resolve conflicts.
- We’ll implement a simple wrapper around a Map that stores, for each item id, a set of addition tags and a tombstone set of removed tags.
Code (TypeScript-like pseudocode)
type Item = {
id: string;
value: string;
tags: string[]; // metadata if needed
ts: number; // logical timestamp at creation
};
// A per-item addition namespace
type Dot = string; // e.g., "<node-id>:<counter>"
class ORSet {
// adds: itemId -> Set<Dot>
private adds: Map<string, Set<Dot>> = new Map();
// removes: itemId -> Set<Dot> (dots that have been removed)
private removes: Map<string, Set<Dot>> = new Map();
// add an item with a unique dot
add(itemId: string, dot: Dot): void {
if (!this.adds.has(itemId)) this.adds.set(itemId, new Set());
this.adds.get(itemId)!.add(dot);
}
// remove an item by removing a previously added dot
remove(itemId: string, dot: Dot): void {
if (!this.removes.has(itemId)) this.removes.set(itemId, new Set());
this.removes.get(itemId)!.add(dot);
}
// merge another ORSet into this one
merge(other: ORSet): void {
for (const [id, dots] of other.adds.entries()) {
if (!this.adds.has(id)) this.adds.set(id, new Set());
for (const d of dots) this.adds.get(id)!.add(d);
}
for (const [id, dots] of other.removes.entries()) {
if (!this.removes.has(id)) this.removes.set(id, new Set());
for (const d of dots) this.removes.get(id)!.add(d);
}
}
// materialize the current view: items present = adds - removes
present(): string[] {
const result: string[] = [];
for (const [id, addDots] of this.adds.entries()) {
const removeDots = this.removes.get(id) ?? new Set<string>();
// item exists if there is any add dot not retracted by a corresponding remove dot
// This simplified view assumes a one-to-one mapping; in a real system you track dots per item value.
const isPresent = Array.from(addDots).some(dot => !removeDots.has(dot));
if (isPresent) result.push(id);
}
return result;
}
// For a real app, you'd store full item payloads per id, not just ids.
}
Notes:
- This is a simplified OR-Set sketch to illustrate the concept. In a real implementation, you’ll keep the payload (Item) per add dot, so you can reconstruct the exact content after merges, and you’ll manage per-item tags or timestamps to resolve dovetailing updates.
- Each add-dot should be globally unique (e.g., nodeId:counter) so that different devices don’t collide.
Step 2: Local store with synchronization hooks
- The local store houses items and their per-item CRDT state.
- Each device keeps:
- items: Map
- dotCounter per node to generate unique add-dots
- a delta queue to push to remote peers
- When local edits occur, we create a new item, assign a new dot, and update the OR-Set.
Code (high-level)
class CRDTStore {
private items: Map<string, Item> = new Map();
private orset: ORSet = new ORSet();
private nodeId: string;
private counter: number = 0;
private deltaLog: Delta[] = []; // for outbound syncs
constructor(nodeId: string) {
this.nodeId = nodeId;
}
private nextDot(): Dot {
this.counter += 1;
return `${this.nodeId}:${this.counter}`;
}
addItem(id: string, value: string, ts: number): void {
const dot = this.nextDot();
const item: Item = { id, value, ts, tags: [] };
this.items.set(id, item);
this.orset.add(id, dot);
this.deltaLog.push({ type: 'add', id, value, dot, ts, node: this.nodeId });
}
removeItem(id: string, dot: Dot): void {
this.orset.remove(id, dot);
this.deltaLog.push({ type: 'remove', id, dot, node: this.nodeId });
}
applyDelta(delta: Delta): void {
// Merge delta into OR-Set, reconstruct items as needed
// For simplicity, assume delta carries enough payload to update items
// In a full implementation you'd map adds/removes to items payloads.
}
getPendingDeltas(): Delta[] {
const out = this.deltaLog.slice();
this.deltaLog = [];
return out;
}
// expose a snapshot of current visible items
snapshot(): Item[] {
// materialize from OR-Set present plus stored payloads
return Array.from(this.items.values());
}
}
type Delta =
| { type: 'add'; id: string; value: string; dot: Dot; ts: number; node: string }
| { type: 'remove'; id: string; dot: Dot; node: string };
Step 3: Synchronization protocol
- Use a central broker (e.g., a lightweight REST endpoint or a WebSocket channel) or a peer-to-peer gossip mechanism to exchange deltas.
- Each sync session exchanges a batch of deltas. The receiver applies deltas in a deterministic order (e.g., by dot lexicographic order or by delta sequence timestamp).
Example sync flow (simplified)
- Node A and Node B connect.
- Each sends its pendingDelta list (deltas not yet seen by the other).
- Each applies the received deltas via applyDelta.
- After applying, each sends back an acknowledgment or a fresh delta log containing new changes.
Code sketch for delta exchange (conceptual)
async function syncWithPeer(local: CRDTStore, peerApi: PeerApi) {
const localDeltas = local.getPendingDeltas();
const remoteDeltas = await peerApi.fetchDeltas(localDeltas.length ? localDeltas.ts : 0);
// The peer would return deltas since a given timestamp
for (const d of remoteDeltas) {
local.applyDelta(d);
}
// After applying, push our deltas
await peerApi.pushDeltas(localDeltas);
}
Step 4: Operational surface and tooling
- Expose an API to inspect state, trigger a manual sync, and rollback if needed.
- Implement tape-based checkpoints: before applying a batch of deltas, record a checkpoint (a deterministic replay point). If something goes wrong, you can revert to the checkpoint and re-run deltas.
API surface ideas
- GET /state - returns current items with their values and metadata.
- POST /sync - initiate a sync with a remote peer.
- POST /checkpoint - create a tape checkpoint with a human-friendly label.
- POST /rollback - rollback to the last checkpoint and replay deltas up to now.
- GET /health - basic liveness and sync status metrics.
Considerations for reliability
- Idempotent delta application: ensure applying the same delta twice has no effect.
- Ordering: apply deltas deterministically. If you receive concurrent adds for the same id, use a tie-breaker (e.g., dot-based ordering).
- Conflict resolution: CRDTs avoid conflicts, but you might want to surface conflicts to users when they arise (e.g., two edits to the same item). You can implement a lightweight “conflict log” to show divergent values and allow user resolution.
Step 5: Testing strategy
- Unit tests for CRDT merge semantics:
- Adding/removing the same item from two nodes should converge to the same final state.
- Deltas applied multiple times should not double-apply.
- Property-based tests:
- Random sequences of adds/removes across simulated nodes should converge to the same final snapshot.
- After a network partition and subsequent reconciliation, all nodes have identical visible data or acceptable divergent-but-resolvable states.
- End-to-end tests:
- Simulate multiple fake peers with a central hub and verify convergence under delays and out-of-order delivery.
Example test scenario (pseudo)
- Create three CRDTStore instances: A, B, C with distinct nodeIds.
- Each adds several items independently.
- Simulate network: exchange deltas between A-B, B-C, A-C in various orders with delays.
- After reconciliation, verify that A.snapshot() equals B.snapshot() equals C.snapshot().
Step 6: A small, runnable example you can adapt
- Implement a minimal local in-browser CRDT toy with a simple sync over WebRTC or WebSocket via a central server.
- Start a small demo with two browser tabs:
- Tab 1 adds an item, Tab 2 adds another item.
- Tabs exchange deltas and converge to the same view.
- Then Tab 1 updates item, Tab 2 deletes the item; reconcile to a consistent result.
Quick-start checklist
- Pick a domain: notes, tasks, or contact records. For this tutorial, start with a notes collection where each note is an Item.
- Implement the OR-Set skeleton and local store as shown above, focusing on correctness of add/remove/dot semantics.
- Build a tiny sync layer that uses a central REST endpoint to fetch/push deltas.
- Add a checkpoints/rollback feature to revert a batch of changes safely.
- Add basic observability: counters for items, deltas exchanged, and sync latency.
Illustrative example: running scenario
- Node A and Node B both run the app with separate local stores.
- A creates Note #1 with dot A1, B creates Note #2 with dot B1.
- A and B sync. Each applies the other's deltas.
- A edits Note #1; creates a new dot A2; B edits Note #2 and creates dot B2.
- They sync again. The final state reflects both edits without overwriting one another, due to CRDT semantics.
- A detects a long-running sync lag; operator triggers a checkpoint before applying a batch; if something goes wrong, rollback to the checkpoint and re-run.
What to customize for your project
- Data model complexity: if you need per-field synchronization, consider a CRDT that supports add-writes on fields or a LWW element set with metadata.
- Conflict visibility: expose conflict metadata to users or operators, with a lightweight UI to resolve or annotate.
- Security: sign deltas, verify authenticity, and encrypt payloads if sensitive data is synchronized across untrusted networks.
Next steps
- If you’d like, I can tailor this into a concrete, runnable codebase in your preferred stack (Node.js with TypeScript, Python, or Go), including a minimal server, client library, and a small demo app. I can also sketch a small Piper-like sample project with tests and a runnable Docker setup to simulate multi-node synchronization.
Would you like me to convert this into a ready-to-run starter repo in your preferred language and framework? If yes, tell me:
- Your language/framework choice (TypeScript/Node, Python, Go, etc.)
- Whether you prefer a central server or a peer-to-peer sync style
- The domain you want to model (notes, tasks, contacts, etc.)
- Any constraints (offline-first, latency targets, deployment environment)
-
Rizwan Saleem | https://rizwansaleem.co
Top comments (0)