Siddhant Jain

Posted on Mar 22 • Originally published at keelstack.me

The Silent Job Loss: Why Your Node.js SaaS Needs a Persistent Task Queue

#stripe #saas #node #typescript

597 tests. 93.13% coverage. Here's what they protect.

A user pays. Your server receives the Stripe webhook. You fire off an async task to generate their report. Thirty seconds later you deploy a hotfix.

The report is never generated. The user is charged. Nobody gets an error. You find out three days later in a support ticket.

This is not a theoretical failure mode. It is the default behavior of every Node.js backend that queues work in memory.

Part 1: Memory Is Volatile

The most common pattern for async work in Node.js looks like this:

// User pays → webhook fires → kick off async work
webhookHandler(event) {
  // Fire and forget
  generateReport(event.userId, event.reportId);
  return res.status(200).json({ received: true });
}

async function generateReport(userId: string, reportId: string) {
  // This lives entirely in process memory
  const data = await fetchUserData(userId);
  const report = await callLLM(data);
  await saveReport(reportId, report);
}

This works perfectly in development. It fails silently in production for three reasons:

Deployments. Every deploy kills the running process. Any in-flight generateReport call dies mid-execution. No error is thrown anywhere visible. The job is gone.

Crashes. An unhandled exception or OOM kill takes every in-flight job with it. Same silent outcome.

Scaling. The moment you run two processes (two dynos, two containers), there is no coordination. A job kicked off in process A can only run in process A. Process B has no knowledge it exists.

The fix is not complicated in concept: write the job to durable storage before you start executing it. That way, if the process dies, the job survives. On restart, you find it and finish it.

The hard part is doing this correctly — specifically, making the claim step atomic so two workers cannot grab the same job at the same time.

Part 2: The Atomic Claim Problem

The naive approach to claiming a job looks like this:

-- Worker 1 and Worker 2 both run this simultaneously
UPDATE jobs
SET status = 'processing', claimed_at = NOW()
WHERE id = (
  SELECT id FROM jobs WHERE status = 'pending' LIMIT 1
)

At low load, this works. Under concurrency it doesn't. Two workers can both execute the subquery and get the same row before either has written processing. You get double-processing: the same report generated twice, the same email sent twice, the same billing event fired twice.

The standard fix in Postgres is FOR UPDATE SKIP LOCKED:

UPDATE jobs
SET status = 'processing', claimed_at = NOW()
WHERE id = (
  SELECT id FROM jobs
  WHERE status = 'pending'
  ORDER BY created_at ASC
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
RETURNING *

FOR UPDATE takes a row-level lock. SKIP LOCKED tells any other worker that hits a locked row to skip it rather than wait. The result: each worker atomically claims a different job. No deadlocks, no double-processing, regardless of how many workers are running.

KeelStack does not use Postgres for the job store (it runs without a database in zero-config mode). It uses Redis. But the same guarantee is needed, and Redis provides it through Lua scripts.

Here is the actual claim() implementation in RedisPersistentJobStore:

async claim(jobId: string): Promise<PersistedJob | null> {
  await this.connect();
  const luaScript = `
    local data = redis.call('GET', KEYS[1])
    if not data then return nil end
    local job = cjson.decode(data)
    if job.state ~= 'pending' then return nil end
    job.state = 'processing'
    job.claimedAt = ARGV[1]
    redis.call('SET', KEYS[1], cjson.encode(job), 'EX', ARGV[2])
    return cjson.encode(job)
  `;
  const result = await this.client.eval(
    luaScript, 1, this.key(jobId),
    new Date().toISOString(), String(JOB_TTL),
  ) as string | null;

  if (!result) return null;
  return JSON.parse(result) as PersistedJob;
}

The Lua script runs atomically inside Redis's single-threaded executor. Between the GET and the SET, nothing else can run. No other worker can see the job as pending and claim it. Exactly one caller gets the job back. Everyone else gets null.

The in-memory implementation (used in development and tests) gets the same guarantee for free because JavaScript's event loop is single-threaded:

async claim(jobId: string): Promise<PersistedJob | null> {
  const job = this.jobs.get(jobId);
  if (!job || job.state !== 'pending') return null;
  // Single-threaded JS: this read-modify-write is atomic within one process
  job.state = 'processing';
  job.claimedAt = new Date().toISOString();
  return { ...job };
}

The test that verifies this contract fires 20 concurrent claim attempts and asserts exactly one wins:

it('claim() concurrency — only one of N concurrent callers wins', async () => {
  await store.enqueue(makeJobInput());
  const results = await Promise.all(
    Array.from({ length: 20 }, () => store.claim('job-001')),
  );
  const winners = results.filter(Boolean);
  expect(winners).toHaveLength(1);
});

Part 3: Exponential Backoff and the Dead-Letter Log

Once a job is claimed, it runs. Sometimes the handler fails. The question is: what do you do next?

The worst answer is: retry immediately. If an LLM provider is rate-limiting you, hammering it again in the same second makes the situation worse for everyone. If your database just had a connection timeout, you want to give it time to recover. Retrying immediately into a recovering system causes the Thundering Herd problem: every waiting job piles in at once, overloading whatever just came back up.

The RetryableJobRunner uses exponential backoff with jitter:

function exponentialDelay(attempt: number, baseMs: number, maxMs: number): number {
  // Jitter: randomize ±20% to spread retries across instances
  const jitter = 1 + (Math.random() * 0.4 - 0.2);
  const delay = baseMs * Math.pow(2, attempt) * jitter;
  return Math.min(delay, maxMs);
}

With baseDelayMs: 250 and maxDelayMs: 30_000, the delays look like this:

Attempt	Base delay	With jitter (approx)
1	500ms	400–600ms
2	1,000ms	800ms–1.2s
3	2,000ms	1.6–2.4s
4	4,000ms	3.2–4.8s
5	8,000ms	6.4–9.6s
Cap	—	30s

The jitter is important. Without it, every worker that got rate-limited at the same moment retries at exactly the same time. With jitter, they spread out across a window, smoothing the load on whatever they are calling.

The non-retryable escape hatch. Not all errors deserve retries. If a user submits malformed data and your handler throws a validation error, retrying five times wastes four attempts and delays the dead-letter signal by minutes. The NonRetryableError class handles this:

export class NonRetryableError extends Error {
  constructor(message: string) {
    super(message);
    this.name = 'NonRetryableError';
  }
}

// In your handler:
if (!isValidPayload(job.payload)) {
  throw new NonRetryableError('Malformed report payload — check input schema');
}

When the runner catches a NonRetryableError, it skips the remaining attempts and goes straight to the dead-letter log:

if (error instanceof NonRetryableError || error.name === 'NonRetryableError') {
  this.logDeadLetter(job, attempt, error, 'non_retryable');
  throw error;
}

When maxAttempts is exhausted through normal retries, the same dead-letter path fires:

// All attempts exhausted — emit dead-letter signal
this.logDeadLetter(job, this.options.maxAttempts, lastError!, 'max_attempts_exceeded');

The dead-letter log output is structured JSON:

{
  "level": "error",
  "event": "job.dead_letter",
  "jobId": "report-abc-123",
  "jobName": "generate-report",
  "attempt": 5,
  "reason": "max_attempts_exceeded",
  "error": "LLM provider timeout after 30000ms"
}

Filter on event = 'job.dead_letter' in Datadog, CloudWatch, or any structured log sink to get immediate alerts when jobs exhaust their retries. This is how you find out about silent failures before users report them.

Part 4: The Crash Test

The full lifecycle claim → execute → crash → recover is tested in worker.crash.test.ts. Here is the core scenario:

it('Orphaned job (stuck in processing) — recoverOrphans should reset to pending', async () => {
  const jobId = `crash_job_${Math.random().toString(36).substring(7)}`;

  // 1. Enqueue the job
  await store.enqueue({
    id: jobId,
    name: 'billing-sync',
    event: 'billing.subscription.created',
    payload: { tenantId: 'tenant_crash_1' },
    maxAttempts: 3,
  });

  // 2. Worker claims it — job is now in 'processing' state
  const claimedJob = await store.claim(jobId);
  expect(claimedJob!.state).toBe('processing');

  // 3. Simulate the crash: worker never calls complete() or fail()
  //    Backdate claimedAt to make it look like the crash happened 61 seconds ago
  const internalStore = store as any;
  const j = internalStore.jobs.get(jobId);
  j.claimedAt = new Date(Date.now() - 61_000).toISOString();

  // 4. Recovery scan runs (as it would on the next server boot)
  const recovered = await store.recoverOrphans(60_000);

  // 5. Job is back in 'pending' — available to be claimed and finished
  expect(recovered.length).toBe(1);
  expect(recovered[0].state).toBe('pending');

  // 6. A new worker can now claim and complete it
  const reClaimed = await store.claim(jobId);
  expect(reClaimed).not.toBeNull();
});

The recovery mechanism is straightforward: on server startup (and optionally on a periodic tick), recoverOrphans(timeoutMs) scans for jobs that have been in processing state longer than timeoutMs. Any job older than that threshold is assumed to belong to a dead worker and is reset to pending, preserving the attempt count.

A separate test covers the edge case where a crashed job has already exhausted its retries. This one is important — without it, you could end up endlessly re-queuing jobs that will never succeed:

it('Orphaned job at max attempts — must go to failed (not pending) after recovery', async () => {
  // Simulate 3 failed attempts
  await store.fail(jobId, 'Attempt 1 failed');
  await store.fail(jobId, 'Attempt 2 failed');
  await store.fail(jobId, 'Attempt 3 failed — maxAttempts reached');

  // Even if it ends up orphaned in 'processing' state...
  j.state = 'processing';
  j.claimedAt = new Date(Date.now() - 61_000).toISOString();

  const recovered = await store.recoverOrphans(60_000);

  // ...recovery must NOT re-queue it. It should stay 'failed'.
  expect(recovered.length).toBe(0);
  expect(finalState?.state).toBe('failed');
});

The full runner is also tested at the integration level with a simulated crash mid-execution:

it('RetryableJobRunner: crash on attempt 1, recover on attempt 2', async () => {
  let attempts = 0;

  const handler = vi.fn(async () => {
    attempts++;
    if (attempts <= 1) {
      throw new Error(`Simulated worker crash on attempt ${attempts}`);
    }
    return { ok: true as const };
  });

  const runner = new RetryableJobRunner(handler, {
    maxAttempts: 3,
    baseDelayMs: 1,
    maxDelayMs: 10,
  });

  await expect(runner.run(job)).resolves.toBeUndefined();
  expect(attempts).toBe(2); // 1 crash + 1 success
});

What This Protects In Practice

The silent job loss scenario from the top of this post is exactly what these components prevent:

User pays → webhook fires → generateReport is enqueued to PersistentJobStore before any async work starts
Job is persisted in Redis (or in-memory in development) with state pending
Worker claims it atomically — Lua script ensures only one worker gets it
Deploy happens mid-execution → process dies → job stays in processing
New process starts → recoverOrphans runs → job is reset to pending with attempt count intact
Worker claims it again → report is generated → job moves to completed

The user gets their report. You never know there was a crash. That is the point.

KeelStack Engine ships RetryableJobRunner, PersistentJobStore, and the crash recovery mechanism as part of the Layer 06 background job system. Zero configuration required — it runs with in-memory fallbacks locally and switches to Redis automatically when REDIS_URL is set.

Get KeelStack Engine →

Top comments (2)

Adarsh Kant • Mar 23

This is one of those problems that bites you exactly when it matters most — at scale. Silent job failures are the worst because your users notice before your monitoring does.

We hit this exact issue building AnveVoice (voice AI that takes real DOM actions on websites). When a voice command triggers a multi-step action — like filling a form and submitting it — each step needs to be durable. If the process dies mid-action, the user is left staring at a half-filled form with no feedback.

Our solution was similar: persistent task queues with idempotent operations. Every voice-triggered action gets broken into atomic steps with rollback capability. BullMQ + Redis has been rock solid for this.

The key insight from your article that resonates: it's not about preventing failures, it's about surviving them gracefully. Great writeup for any SaaS builder running background jobs.

Siddhant Jain • Mar 23

The AnveVoice use case is a harder variant — mid-action DOM state corruption is visible to the user in real time, not just a silent backend failure. Half-filled forms break user trust immediately.
The rollback piece is interesting. KeelStack handles orphan recovery at the job level — if a worker crashes mid-execution, the job resets to pending and retries from the start. But that assumes handlers are idempotent. For multi-step actions with side effects that can't be re-run cleanly, you'd need per-step checkpointing or compensating transactions, which is a different problem. Curious how you handle that with BullMQ — are you snapshotting DOM state before each step or tracking which steps completed?