Salazar

Posted on Mar 31 • Edited on Apr 4

The Hidden Problems of Offline-First Sync: Idempotency, Retry Storms, and Dead Letters

#architecture #webdev #distributedsystems #nestjs

Audience: this article is for frontend engineers with React experience who have already implemented a "save locally and sync later" approach, but have never had to deal with the failures that surface in production — growing payloads, crashes during transmission, and devices that come back online after hours disconnected.

When you implement offline-first for the first time, the solution seems obvious:

Save to IndexedDB
When online, send to the server
If it fails, try again

The problem is step 3.

"Try again" is a state machine disguised as a conditional. And when you don't model the states explicitly, the machine exists anyway — it's just not under your control.

The thesis of this article is specific: failure in offline synchronization is not a binary state. It is at least six states with precise operational semantics and deterministic transitions. Treating those states as binary guarantees infinite retry loops, silently lost data, and bugs that only appear when the user is in the field without internet, 300 km away from you.

The concrete problem: the item that never leaves the queue

Consider an offline POS (point of sale) system. The cashier records orders with item photos embedded as base64 in the payload. Under normal conditions, each order is ~20 KB. But a special promotion requires a higher-resolution photo — the payload grows to 400 KB.

The server has a 256 KB limit per request. The behavior with binary synchronization:

Item enters the queue with status "pending"
Device comes online
Runner tries to send — server returns 413
Runner marks as "error" and schedules retry
Retry tries again — 413
Infinite loop

The item never leaves the queue. It never will. And no visible error log — because from the runner's point of view, it is "trying". The POS operator has no idea that sale will never reach the server.

This scenario is not hypothetical. It is documented in the code of offline-first-sync-queue — and the solution reveals why explicit states are the only defense:

if (chunk.length === 1 && size > maxBatchPayloadSize) {
  await markFatalLocalTooLarge(
    db,
    chunk[0],
    size,
    maxBatchPayloadSize,
  );
  sent += 1;
  terminal += 1;
  continue;
}

The item does not get a retry. It gets a terminal non-retryable state — FATAL_ERROR — with the exact error preserved for diagnosis:

status: 'FATAL_ERROR',
nextAttemptAt: Number.MAX_SAFE_INTEGER,
inFlightAt: undefined,
lastError: `payload_too_large_local:${computedBytes}>${maxBytes}`,

Now the item is visible, diagnosable, and the loop has been avoided. But this is only possible because a FATAL_ERROR state exists separately from RETRYABLE_ERROR; DEAD_LETTER remains reserved for retry exhaustion, missing per-item results after too many attempts, or the unknown-status fallback.

The state machine you are already implementing (without knowing it)

If your queue has retry logic, you have already made implicit decisions about state transitions. The question is whether those decisions are encoded or scattered across conditionals.

The project's sync-contract defines the states explicitly via Zod:

export const SyncStatusSchema = z.enum([
  'PENDING',         // Initial state, ready to be sent
  'IN_FLIGHT',       // Currently being processed by the server
  'SYNCED',          // Successfully acknowledged by server
  'RETRYABLE_ERROR', // Temporary failure, should retry
  'FATAL_ERROR',     // Permanent failure (e.g., validation), should not retry
  'DEAD_LETTER'      // Terminal state after max retries or fatal error
]);

A reasonable engineer could argue: "three states are enough — pending, synced, error". The rest of this article proves why IN_FLIGHT, RETRYABLE_ERROR, FATAL_ERROR, and DEAD_LETTER need to be separate states — not subcategories of "error".

Why `IN_FLIGHT` needs to exist

Without IN_FLIGHT, you have no idea what happened to an item when the device crashes during transmission.

The scenario: the runner picks the item from the queue, sends it to the server, and the device loses power before receiving confirmation. On the next startup, the item is still PENDING. The runner tries to send it again. The server may have processed it the first time — now you have a duplicate.

The solution is to mark the item as IN_FLIGHT before sending, and recover items stuck in that state based on a timestamp:

async function requeueStaleInFlight(
  db: Awaited<ReturnType<typeof getDB>>,
  staleAfterMs: number,
) {
  const now = Date.now();
  const tx = db.transaction('syncQueue', 'readwrite');
  const store = tx.objectStore('syncQueue');
  const byStatus = store.index('by-status');

  let cursor = await byStatus.openCursor('IN_FLIGHT');
  while (cursor) {
    const item = cursor.value as SyncQueueItem;
    const inFlightAt = item.inFlightAt ?? 0;

    if (inFlightAt > 0 && inFlightAt + staleAfterMs < now) {
      await cursor.update({
        ...item,
        status: 'RETRYABLE_ERROR',
        nextAttemptAt: now,
        inFlightAt: undefined,
        lastError: 'stale_in_flight',
      });
    }
    cursor = await cursor.continue();
  }
  await tx.done;
}

Three visible decisions in this snippet:

The by-status index avoids scanning the entire queue on every recovery run
inFlightAt is the timestamp for calculating staleness (not a boolean)
The transition goes to RETRYABLE_ERROR with nextAttemptAt: now — immediate retry, not backoff, because the stall was local, not a server failure

Why `RETRYABLE_ERROR` and `FATAL_ERROR` need to be separate states

Not every failure deserves a retry. The decision logic lives in shouldRetry:

export function shouldRetry(status?: number) {
  // Network: undefined status => retry
  if (status === undefined) return true;

  // Timeout
  if (status === 408) return true;

  // Payload/credentials: no automatic retry (manual action required)
  if (status === 400 || status === 401 || status === 403 || status === 413) return false;

  // Rate limit / server errors
  if (status === 429) return true;
  if (status >= 500) return true;

  return false;
}

The non-obvious point: undefined is not a vague "no response" — it is the specific absence of an HTTP status, which indicates a network failure (not a server failure). 413 is listed alongside 400/401/403 because an oversized payload with automatic retry would be a design bug, not a resilience strategy. You would be retrying something that will never pass — the POS scenario described above.

Why `DEAD_LETTER` needs to be a terminal state

DEAD_LETTER exists for items that should no longer be processed, but that should also not be silently deleted. An item in DEAD_LETTER is evidence of a problem that requires human intervention or different code — not more retries.

Without this state, the alternatives are worse:

Delete the item → you lost data
Keep as FATAL_ERROR → you cannot distinguish "failed with 400" from "failed 10 times and we gave up"
Infinite loop → the POS scenario

Trade-offs you need to know before adopting this architecture

The lock is single-tab — by design

The lock mechanism that prevents the sync runner from running in parallel has an explicit limitation in the JSDoc:

/**
 * Simple IndexedDB lock to prevent running the runner in parallel.
 * Returns `null` when the lock is already held.
 *
 * Note: this lock is "best effort" (single-tab). For multi-tab, prefer:
 * - BroadcastChannel + leader election, or
 * - Web Locks API (when available), or
 * - a lock with "ownerId" (tabId) + heartbeats.
 */

If your use case involves multiple tabs open simultaneously (common in management dashboards), this lock is not sufficient. The Web Locks API solves this, but has limited support in some browsers. BroadcastChannel with leader election is more portable but adds coordination complexity.

The simplicity trade-off (TTL + IndexedDB, zero dependencies) was made consciously for the mobile POS scenario, where multiple tabs are unlikely. For your scenario, it may not be the right choice.

Grouping by URL, method, and entityType — not by "general queue"

The runner does not send all pending items to a single endpoint. It groups by URL, HTTP method, and entity type:

/**
 * Important correction: do NOT assume the whole batch uses the same endpoint.
 * In real systems, "orders", "inventory", and "payments" almost always have different URLs.
 */
const byUrl = groupBy(pending, (x) => x.url);

for (const [url, urlItems] of byUrl.entries()) {
  const byMethod = groupBy(urlItems, (x) => x.method);
  for (const [method, methodItems] of byMethod.entries()) {
    const byEntity = groupBy(methodItems, (x) => x.entityType);
    for (const [entityTypeStr, entityItems] of byEntity.entries()) {
      const chunks = chunkByMaxPayloadSize({
        entityType,
        deviceId,
        items: entityItems,
        maxBytes: maxBatchPayloadSize,
      });
      // send each chunk
    }
  }
}

The cost: more requests per sync cycle. The benefit: separate API contracts per entity are preserved, and an error in payments does not block orders.

`JSON.stringify` is not sufficient for deduplication

The non-obvious claim: key ordering in JSON.stringify is not guaranteed by the spec. If you use the result as a hash to detect duplicates, two identical objects with keys in different order produce different hashes.

The proof lies in the fact that the same stableStringify function was implemented independently on the client and on the server — same logic, no shared code:

// client (apps/web/src/lib/sync/enqueue.ts)
function stableStringify(value: unknown): string {
  if (value === undefined) return '"__undefined__"';
  if (value === null) return 'null';
  if (typeof value !== 'object') return JSON.stringify(value) ?? 'null';

  if (Array.isArray(value)) {
    return `[${value.map(stableStringify).join(',')}]`;
  }

  const keys = Object.keys(value as object).sort();
  const props = keys.map((k) => `${JSON.stringify(k)}:${stableStringify((value as Record<string, unknown>)[k])}`);
  return `{${props.join(',')}}`;
}

// server (apps/api/src/pos-sync/orders.repo.ts)
function stableStringify(value: unknown): string {
  if (value === undefined) return '"__undefined__"';
  if (value === null) return 'null';
  if (typeof value !== 'object') return JSON.stringify(value) ?? 'null';
  if (Array.isArray(value)) return `[${value.map(stableStringify).join(',')}]`;
  const keys = Object.keys(value).sort();
  const props = keys.map(
    (k) =>
      `${JSON.stringify(k)}:${stableStringify(
        (value as Record<string, unknown>)[k],
      )}`,
  );
  return `{${props.join(',')}}`;
}

The duplication is the evidence: if JSON.stringify were sufficient, neither of them would need to exist. You can verify this by replacing stableStringify with JSON.stringify in the deduplication tests and watching them fail.

What this article does not cover

This article covers the local queue and its state machine. It does not cover:

Multi-device synchronization (how to resolve conflicts when two devices edit the same record offline)
Conflict resolution (CRDT, last-write-wins, or custom merge)
Bidirectional synchronization (server to client)
Service Workers as an alternative offline persistence mechanism

For bidirectional sync with conflict resolution, the starting point is Electric SQL or PowerSync. For Service Workers as the primary path, the Workbox documentation covers the Background Sync API.

Conclusion: idempotency is not optional when two devices compete

You may have understood the title ("failure is not binary") and thought the solution was simply adding more states. But there is a problem that only surfaces after you understand both the queue and server-side idempotency: two offline devices can send the same externalId simultaneously when they come back online.

The correct handling is not "the second request fails with 409". It is:

try {
  await this.prisma.order.create({
    data: {
      externalId,
      entityType: 'order',
      payload,
      syncStatus: 'IN_FLIGHT',
      retryCount: 0,
      nextAttemptAt: null,
      lastError: null,
    },
  });
  try {
    await this.prisma.order.update({
      where: { externalId },
      data: {
        syncStatus: 'SYNCED',
        nextAttemptAt: null,
        lastError: null,
        updatedAt: now,
      },
    });
  } catch (error) {
    await this.markRetryableOrDeadLetter({
      externalId,
      previousRetryCount: 0,
      maxRetries,
      reason:
        error instanceof Error ? error.message : 'internal_server_error',
    });
    throw error;
  }
  return { status: 'created' };
} catch (error) {
  const raced = await this.prisma.order.findUnique({
    where: { externalId },
    select: { payload: true, retryCount: true },
  });
  if (!raced) throw error;
  return this.updateExisting(
    externalId,
    raced.payload,
    raced.retryCount,
    payload,
  );
}

This article's introduction could not have described this snippet. The race condition only emerges once you already understand that IN_FLIGHT on the server (during create) can coexist with IN_FLIGHT on the client (during transmission) — and that these are two different devices trying to create the same record.

The next step: if you already have a working offline sync implementation, add an IN_FLIGHT state with a timestamp and a recovery job for stuck items. That is the smallest delta that prevents the most common class of silently lost data. The recovery code is in this repository — the requeueStaleInFlight function can be adapted to any IndexedDB stack.

References

Repository: IndexGrid/offline-first-sync-queue
SyncStatusSchema: packages/sync-contract/src/index.ts
requeueStaleInFlight: apps/web/src/lib/sync/runner.ts
shouldRetry: apps/web/src/lib/sync/retry.ts
upsertByExternalId: apps/api/src/pos-sync/orders.repo.ts
Web Locks API: MDN / caniuse
Background Sync API: Chrome Developers

Top comments (3)

arun rajkumar • Jun 22

Retry storms are the one that bit us hardest — the moment a downstream recovers, every queued client fires at once and you've turned a brief outage into a self-inflicted thundering herd. Idempotency keys saved us, but only once we stopped generating them client-side-at-send and started deriving them from the operation's intent, so a retry of "the same thing" actually collided instead of looking new. And the dead-letter queue is less a safety net than a confession log — what lands there tells you exactly which assumptions about ordering you got wrong. The FATAL_ERROR vs DEAD_LETTER distinction is the part most teams collapse and then can't tell "never will work" from "gave up after trying." Great writeup.

Andre Cytryn • Mar 31

the DEAD_LETTER state distinction is something I've seen get skipped way too often. teams usually collapse it with FATAL_ERROR and then lose the diagnostic signal entirely. the point about 413 not deserving a retry is underrated — I've seen retry loops spin for hours on oversized payloads because the error handling treated all non-2xx responses the same way. the stableStringify duplication is a nice tell too: if you need it in two places independently, the naive approach was definitely broken.

Salazar • Apr 1 • Edited

The DEAD_LETTER/FATAL_ERROR collapse was exactly what burned me in the project that motivated this.
I had a ticketing system that needed to work offline at venues with unreliable connectivity, and for a long time I was treating all terminal failures the same way. The result was that I had no idea whether an item failed because of a transient condition or because the payload was structurally broken. I lost the diagnostic signal entirely and had no basis for deciding what to retry and what to escalate.

The 413 case is the sharpest version of that problem. A retry loop on an oversized payload isn't just wasteful, it's the system confidently doing the wrong thing. That's what pushed me toward explicit state modeling: if the system can't distinguish between "try again later" and "this will never work," it isn't really handling failure. It's just deferring it.