Audience: this article is for frontend engineers with React experience who have already implemented a "save locally and sync later" approach, but have never had to deal with the failures that surface in production — growing payloads, crashes during transmission, and devices that come back online after hours disconnected.
When you implement offline-first for the first time, the solution seems obvious:
- Save to IndexedDB
- When online, send to the server
- If it fails, try again
The problem is step 3.
"Try again" is a state machine disguised as a conditional. And when you don't model the states explicitly, the machine exists anyway — it's just not under your control.
The thesis of this article is specific: failure in offline synchronization is not a binary state. It is at least six states with precise operational semantics and deterministic transitions. Treating those states as binary guarantees infinite retry loops, silently lost data, and bugs that only appear when the user is in the field without internet, 300 km away from you.
The concrete problem: the item that never leaves the queue
Consider an offline POS (point of sale) system. The cashier records orders with item photos embedded as base64 in the payload. Under normal conditions, each order is ~20 KB. But a special promotion requires a higher-resolution photo — the payload grows to 400 KB.
The server has a 256 KB limit per request. The behavior with binary synchronization:
- Item enters the queue with status "pending"
- Device comes online
- Runner tries to send — server returns 413
- Runner marks as "error" and schedules retry
- Retry tries again — 413
- Infinite loop
The item never leaves the queue. It never will. And no visible error log — because from the runner's point of view, it is "trying". The POS operator has no idea that sale will never reach the server.
This scenario is not hypothetical. It is documented in the code of offline-first-sync-queue — and the solution reveals why explicit states are the only defense:
if (chunk.length === 1 && size > maxBatchPayloadSize) {
await markDeadLocalTooLarge(db, chunk[0], size, maxBatchPayloadSize);
sent += 1;
dead += 1;
continue;
}
The item does not get a retry. It gets a different state — DEAD_LETTER — with the exact error preserved for diagnosis:
lastError: `payload_too_large_local:${computedBytes}>${maxBytes}`,
Now the item is visible, diagnosable, and the loop has been avoided. But this is only possible because a DEAD_LETTER state exists separately from RETRYABLE_ERROR.
The state machine you are already implementing (without knowing it)
If your queue has retry logic, you have already made implicit decisions about state transitions. The question is whether those decisions are encoded or scattered across conditionals.
The project's sync-contract defines the states explicitly via Zod:
export const SyncStatusSchema = z.enum([
'PENDING', // initial state, ready to be sent
'IN_FLIGHT', // currently being processed by the server
'SYNCED', // successfully acknowledged by server
'RETRYABLE_ERROR', // temporary failure, should retry
'FATAL_ERROR', // permanent failure (e.g., validation), should not retry
'DEAD_LETTER' // terminal state after max retries or fatal error
]);
A reasonable engineer could argue: "three states are enough — pending, synced, error". The rest of this article proves why IN_FLIGHT, RETRYABLE_ERROR, and DEAD_LETTER need to be separate states — not subcategories of "error".
Why IN_FLIGHT needs to exist
Without IN_FLIGHT, you have no idea what happened to an item when the device crashes during transmission.
The scenario: the runner picks the item from the queue, sends it to the server, and the device loses power before receiving confirmation. On the next startup, the item is still PENDING. The runner tries to send it again. The server may have processed it the first time — now you have a duplicate.
The solution is to mark the item as IN_FLIGHT before sending, and recover items stuck in that state based on a timestamp:
async function requeueStaleInFlight(
db: Awaited<ReturnType<typeof getDB>>,
staleAfterMs: number,
) {
const now = Date.now();
const tx = db.transaction('syncQueue', 'readwrite');
const store = tx.objectStore('syncQueue');
const byStatus = store.index('by-status');
let cursor = await byStatus.openCursor('IN_FLIGHT');
while (cursor) {
const item = cursor.value as SyncQueueItem;
const inFlightAt = item.inFlightAt ?? 0;
if (inFlightAt > 0 && inFlightAt + staleAfterMs < now) {
await cursor.update({
...item,
status: 'RETRYABLE_ERROR',
nextAttemptAt: now,
inFlightAt: undefined,
lastError: 'stale_in_flight',
});
}
cursor = await cursor.continue();
}
await tx.done;
}
Three visible decisions in this snippet:
- The
by-statusindex avoids scanning the entire queue on every recovery run -
inFlightAtis the timestamp for calculating staleness (not a boolean) - The transition goes to
RETRYABLE_ERRORwithnextAttemptAt: now— immediate retry, not backoff, because the stall was local, not a server failure
Why RETRYABLE_ERROR and FATAL_ERROR need to be separate states
Not every failure deserves a retry. The decision logic lives in shouldRetry:
export function shouldRetry(status?: number) {
if (status === undefined) return true; // no response = network failure, retry
if (status === 408) return true; // server timeout, retry
// invalid payload, no authorization, or oversized: do NOT auto-retry
if (status === 400 || status === 401 || status === 403 || status === 413) return false;
if (status === 429) return true; // rate limit, retry with backoff
if (status >= 500) return true; // server error, retry
return false;
}
The non-obvious point: undefined is not a vague "no response" — it is the specific absence of an HTTP status, which indicates a network failure (not a server failure). 413 is listed alongside 400/401/403 because an oversized payload with automatic retry would be a design bug, not a resilience strategy. You would be retrying something that will never pass — the POS scenario described above.
Why DEAD_LETTER needs to be a terminal state
DEAD_LETTER exists for items that should no longer be processed, but that should also not be silently deleted. An item in DEAD_LETTER is evidence of a problem that requires human intervention or different code — not more retries.
Without this state, the alternatives are worse:
- Delete the item → you lost data
- Keep as
FATAL_ERROR→ you cannot distinguish "failed with 400" from "failed 10 times and we gave up" - Infinite loop → the POS scenario
Trade-offs you need to know before adopting this architecture
The lock is single-tab — by design
The lock mechanism that prevents the sync runner from running in parallel has an explicit limitation in the JSDoc:
/**
* Simple IndexedDB lock to prevent running the runner in parallel.
* Returns `null` when the lock is already held.
*
* Note: the lock is "best effort" (single-tab). For multi-tab, prefer:
* - BroadcastChannel + leader election, or
* - Web Locks API (when available), or
* - a lock with "ownerId" (tabId) + heartbeats.
*/
If your use case involves multiple tabs open simultaneously (common in management dashboards), this lock is not sufficient. The Web Locks API solves this, but has limited support in some browsers. BroadcastChannel with leader election is more portable but adds coordination complexity.
The simplicity trade-off (TTL + IndexedDB, zero dependencies) was made consciously for the mobile POS scenario, where multiple tabs are unlikely. For your scenario, it may not be the right choice.
Grouping by URL, method, and entityType — not by "general queue"
The runner does not send all pending items to a single endpoint. It groups by URL, HTTP method, and entity type:
/**
* Do NOT assume the entire batch uses the same endpoint.
* In real systems, "orders", "inventory", and "payments" almost always have different URLs.
*/
const byUrl = groupBy(pending, (x) => x.url);
for (const [url, urlItems] of byUrl.entries()) {
const byMethod = groupBy(urlItems, (x) => x.method);
for (const [method, methodItems] of byMethod.entries()) {
const byEntity = groupBy(methodItems, (x) => x.entityType);
for (const [entityTypeStr, entityItems] of byEntity.entries()) {
const chunks = chunkByMaxPayloadSize({ entityType, deviceId, items: entityItems, maxBytes: maxBatchPayloadSize });
// sends each chunk
}
}
}
The cost: more requests per sync cycle. The benefit: separate API contracts per entity are preserved, and an error in payments does not block orders.
JSON.stringify is not sufficient for deduplication
The non-obvious claim: key ordering in JSON.stringify is not guaranteed by the spec. If you use the result as a hash to detect duplicates, two identical objects with keys in different order produce different hashes.
The proof lies in the fact that the same stableStringify function was implemented independently on the client and on the server — same logic, no shared code:
// client (apps/web/src/lib/sync/enqueue.ts)
function stableStringify(value: unknown): string {
const keys = Object.keys(value as object).sort();
// ...
}
// server (apps/api/src/pos-sync/orders.repo.ts)
function stableStringify(value: unknown): string {
const keys = Object.keys(value).sort();
// ...
}
The duplication is the evidence: if JSON.stringify were sufficient, neither of them would need to exist. You can verify this by replacing stableStringify with JSON.stringify in the deduplication tests and watching them fail.
What this article does not cover
This article covers the local queue and its state machine. It does not cover:
- Multi-device synchronization (how to resolve conflicts when two devices edit the same record offline)
- Conflict resolution (CRDT, last-write-wins, or custom merge)
- Bidirectional synchronization (server to client)
- Service Workers as an alternative offline persistence mechanism
For bidirectional sync with conflict resolution, the starting point is Electric SQL or PowerSync. For Service Workers as the primary path, the Workbox documentation covers the Background Sync API.
Conclusion: idempotency is not optional when two devices compete
You may have understood the title ("failure is not binary") and thought the solution was simply adding more states. But there is a problem that only surfaces after you understand both the queue and server-side idempotency: two offline devices can send the same externalId simultaneously when they come back online.
The correct handling is not "the second request fails with 409". It is:
try {
await this.prisma.order.create({ data: { externalId, syncStatus: 'IN_FLIGHT', ... } });
return { status: 'created' };
} catch (error) {
// Race condition: another process created the record between findUnique and create
const raced = await this.prisma.order.findUnique({ where: { externalId } });
if (!raced) throw error; // real error, not a race condition
return this.updateExisting(...); // treat as idempotent update
}
This article's introduction could not have described this snippet. The race condition only emerges once you already understand that IN_FLIGHT on the server (during create) can coexist with IN_FLIGHT on the client (during transmission) — and that these are two different devices trying to create the same record.
The next step: if you already have a working offline sync implementation, add an IN_FLIGHT state with a timestamp and a recovery job for stuck items. That is the smallest delta that prevents the most common class of silently lost data. The recovery code is in this repository — the requeueStaleInFlight function can be adapted to any IndexedDB stack.
References
- Repository: IndexGrid/offline-first-sync-queue
-
SyncStatusSchema:packages/sync-contract/src/index.ts -
requeueStaleInFlight:apps/web/src/lib/sync/runner.ts -
shouldRetry:apps/web/src/lib/sync/retry.ts -
upsertByExternalId:apps/api/src/pos-sync/orders.repo.ts - Web Locks API: MDN / caniuse
- Background Sync API: Chrome Developers
Top comments (0)