Joud Awad

Posted on Jun 19

44/60 Days System Design Questions

#abotwrotethis #frontend #backend #database

Your team just shipped "offline mode" for your field-service app.
Technicians work in basements, tunnels, plant floors — connectivity is unreliable.

The demo looked great. Navigator.onLine said false, the app kept working, the sync button pulsed green. You shipped to 400 users.

Then the incident reports started.

Technician in Munich finishes a repair job offline, syncs when she gets signal. Her updates are gone — overwritten by a colleague who edited the same record online 12 minutes earlier. Last-write-wins. Her write lost.

Technician in São Paulo opens the app, goes offline, edits three assets. Comes back online. App throws an unhandled promise rejection and crashes. IndexedDB schema was on version 2. The update shipped version 3. The migration never ran because he'd never opened the app while online.

You're now asked to actually fix offline-first — not demo it.

Here's the setup:

• 400 field technicians, avg offline window of 40 minutes
• Write conflicts happen ~3x per day, always on the same 8 "hot" assets
• Sync currently runs on reconnect via a single bulk POST
• IndexedDB is being used but schema migrations are undocumented
• You need conflict resolution, migration safety, and sync reliability

What's your architecture?

A) Move everything to localStorage + a manual JSON diff on sync. Simpler API, deterministic schema versioning — no IndexedDB migration headaches.

B) Keep IndexedDB, add a vector-clock field to every record. On sync, compare clocks — if diverged, surface a merge UI. Let the technician decide. Never throw away a write.

C) Wrap IndexedDB in a versioned schema migration layer (like Dexie.js). Add a per-record updated_at + device_id composite key. Last-write-wins, but last-write is now deterministic and auditable.

D) Replace client-side storage with a CRDT-based sync engine (Automerge or Yjs). Operations are commutative and associative — merge is always valid, conflicts are structurally impossible.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments (including the one that sounds like overengineering but is what Linear, Figma, and Notion actually run).

Drop your answer

Top comments (4)

Joud Awad • Jun 19

Why D wins (CRDT engine — Automerge / Yjs):

CRDTs (Conflict-free Replicated Data Types) are the only approach where merge is mathematically guaranteed to produce the same result regardless of the order operations arrive. They're commutative, associative, and idempotent.

Two technicians can edit the same asset record offline for 6 hours, sync in any order — and the final state is always deterministic. No merge UI. No "who wins" logic. No data loss.

Linear uses Automerge. Figma uses a custom CRDT for multiplayer. Notion runs a similar model for their block store. This isn't overengineering — it's what teams reach after trying everything else.

Tradeoff: CRDT payloads are larger (they carry operation history). Yjs is lighter than Automerge for most cases. Server-side merge logic needs to understand CRDT ops, not just overwrite rows.

Joud Awad • Jun 19

Why B is the correct second choice (Vector Clocks + Merge UI):

Vector clocks detect exactly when a conflict occurred — which device wrote what and in what order. When diverged, you surface a merge UI and let the technician decide.

Right answer when humans must stay in the loop. A technician marking an asset "inspected" vs another marking it "needs repair" — that's not a conflict an algorithm should auto-resolve.

Tradeoff: every conflict requires user action. At 3/day it's manageable. At 300/day it's a support nightmare.

Joud Awad • Jun 19

Why C is good but not enough (Versioned IndexedDB + LWW):

Dexie.js makes IndexedDB migrations genuinely manageable. LWW with a deterministic composite key (updated_at + device_id) is production-grade.

But LWW throws away real writes. The Munich technician always loses to anyone who touched the same record online in the interim. In a field-service app where "inspected" vs "not inspected" has regulatory consequences — that's data loss you can't accept.

LWW works when the cost of losing a write is low (a UI preference). It breaks when every write has operational meaning.

Joud Awad • Jun 19

Why A fails (localStorage + JSON diff):

localStorage is hard-capped at ~5MB. At 200 records with attachment metadata, you'll hit it — silently, with no useful error in older Chrome. No indexing. No range queries. In private/incognito mode, it's scoped per-tab and wiped on close.

JSON diff breaks on nested arrays, partial field updates, deletions. You'd end up building a mini-CRDT by hand without the guarantees.

Works in a weekend prototype. Corrupts silently in production.