DEV Community

Rajat Thakur
Rajat Thakur

Posted on

Why p-retry isn't enough for production and what to do instead

You're sending an email. The SMTP server hiccups. Your code throws. The email never sends. Nobody knows.

That's the silent failure problem. Most retry libraries solve half of it. They retry. But they don't answer the harder question:

What happens to the jobs that never succeed?

This article walks through three production failure patterns and how to fix all of them with job-retry, a Node.js package I built that handles backoff, timeouts, and dead letter queues in one place.


The three ways basic retry fails you

Problem 1: The Thundering Herd

Without jitter:

t = 0s  ┌─────────────────────────────────┐
        │  500 jobs all fail at once       │
        └─────────────────────────────────┘
                          │
t = 1s  ┌─────────────────────────────────┐
        │  500 jobs all retry at once  💥  │  <- server dies again
        └─────────────────────────────────┘
                          │
t = 2s  ┌─────────────────────────────────┐
        │  500 jobs all retry at once  💥  │  <- and again
        └─────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Your service comes back up and immediately gets hammered by every single job retrying at the exact same millisecond. It goes down again.


Problem 2: The Hanging Promise

await fetch('https://slow-api.com/data')
        
          ... 10 seconds pass
          ... 30 seconds pass
          ... 2 minutes pass
        
        
      nothing.   <- no error, no timeout, just hangs forever
Enter fullscreen mode Exit fullscreen mode

Your API call never resolves. The retry loop waits forever. Memory leaks. Server crashes.


Problem 3: Silent Loss

p-retry:

  attempt 1 -> fail
  attempt 2 -> fail
  attempt 3 -> fail
  attempt 4 -> fail
  attempt 5 -> fail
       │
       ▼
    throws <- you catch it, log it, move on
       │
       ▼
    job is gone forever 🗑️
    no record of what failed
    no way to replay it
    customer never got their email
Enter fullscreen mode Exit fullscreen mode

After p-retry gives up, the job disappears. No record of what failed, when, or why. No way to replay it after the issue is fixed.


The solution: job-retry

job-retry:

  attempt 1 -> fail -> wait (backoff + jitter)
  attempt 2 -> fail -> wait (backoff + jitter)
  attempt 3 -> fail -> wait (backoff + jitter)
  attempt 4 -> fail -> wait (backoff + jitter)
  attempt 5 -> fail
       │
       ▼
  ┌─────────────────────────────┐
  │       Dead Letter Queue     │  <- job saved with full context
  │  name, error, timestamp,    │
  │  attempts, original payload │
  └─────────────────────────────┘
       │
       ▼  (after you fix the underlying issue)
  dlq.retry(job.id, runner) -> job runs again ✅
Enter fullscreen mode Exit fullscreen mode

Install

npm install job-retry
Enter fullscreen mode Exit fullscreen mode

Zero dependencies for core usage. Only needs ioredis if you use the Redis DLQ backend.


Basic usage

import { JobRetry } from 'job-retry';

const runner = new JobRetry({
  attempts: 5,
  backoff: 'exponential',
  baseDelay: 1000,
  timeout: 5000,   // kill hanging attempts after 5s
  jitter: true,    // spread retries to avoid thundering herd
});

const result = await runner.run('sendEmail', () => sendEmail(user));
Enter fullscreen mode Exit fullscreen mode

That's the whole API. Wrap any async function with runner.run(). Everything else is handled automatically.


How exponential backoff and jitter works

baseDelay: 1000ms

Without jitter:          With jitter:

attempt 1 -> wait 1000ms  attempt 1 -> wait 1200ms  (random between 1000-2000)
attempt 2 -> wait 2000ms  attempt 2 -> wait 2750ms  (random between 2000-4000)
attempt 3 -> wait 4000ms  attempt 3 -> wait 5100ms  (random between 4000-8000)
attempt 4 -> wait 8000ms  attempt 4 -> wait 9800ms  (random between 8000-16000)

All 500 jobs fire         Jobs spread out across a
at exactly the same       time window, service
moment -> 💥              recovers gracefully -> ✅
Enter fullscreen mode Exit fullscreen mode
Strategy Formula Example (baseDelay: 1s)
fixed baseDelay 1s · 1s · 1s · 1s
linear baseDelay x attempt 1s · 2s · 3s · 4s
exponential baseDelay x 2^(n-1) 1s · 2s · 4s · 8s

Per-attempt timeout

const runner = new JobRetry({
  attempts: 3,
  timeout: 5000,  // each attempt gets maximum 5 seconds
});
Enter fullscreen mode Exit fullscreen mode
attempt 1:  ──────────────────────── timeout! (5s) -> retry
attempt 2:  ──────────────────────── timeout! (5s) -> retry
attempt 3:  ──────────────────────── timeout! (5s) -> DLQ
Enter fullscreen mode Exit fullscreen mode

No more hanging promises. Every attempt has a hard deadline.


Dead letter queue

When all attempts fail, the job is saved instead of dropped:

const runner = new JobRetry({
  attempts: 3,
  baseDelay: 1000,
  dlq: 'memory',   // swap for 'file' or 'redis' in production
  onFailure: (job) => console.error('Saved to DLQ:', job.name),
});

try {
  await runner.run('processOrder', () => callPaymentService(order));
} catch {
  // all attempts failed, but the job is safe in the DLQ
}

// Inspect what failed
const failed = await runner.dlq.getAll();
// [{
//   id:        'abc-123',
//   name:      'processOrder',
//   error:     'Payment service timeout',
//   timestamp: 1718380800000,
//   attempts:  3
// }]

// Once the payment service is fixed, replay it
await runner.dlq.retry(failed[0].id, runner);
Enter fullscreen mode Exit fullscreen mode

Three DLQ backends

┌─────────────────────────────────────────────────────────┐
│                   DLQBackend interface                   │
│  push · getAll · get · retry · remove · clear · size    │
└───────────────────────┬─────────────────────────────────┘
                        │  implemented by
          ┌─────────────┼──────────────┐
          ▼             ▼              ▼
    MemoryDLQ       FileDLQ       RedisDLQ
    (dev/test)   (single server) (production)
Enter fullscreen mode Exit fullscreen mode

Memory: for development

new JobRetry({ dlq: 'memory' })
// Stored in-process. Lost on restart. Perfect for dev and tests.
Enter fullscreen mode Exit fullscreen mode

File: for single-server apps

new JobRetry({
  dlq: 'file',
  dlqFilePath: './failed-jobs.json',
})
// Persisted to disk as JSON. Survives restarts.
Enter fullscreen mode Exit fullscreen mode

Redis: for production

import Redis from 'ioredis';

new JobRetry({
  dlq: 'redis',
  dlqRedisClient: new Redis(),
})
// Shared across all servers. Survives restarts. Production-ready.
Enter fullscreen mode Exit fullscreen mode

Hooks for full observability

const runner = new JobRetry({
  attempts: 5,
  baseDelay: 1000,

  onRetry: (error, attempt) => {
    // fired after each failed attempt except the last
    logger.warn('Job retry', { attempt, error: error.message });
    metrics.increment('jobs.retry');
  },

  onFailure: (job) => {
    // fired when job is moved to the DLQ
    logger.error('Job permanently failed', { name: job.name, id: job.id });
    metrics.increment('jobs.failed');
  },

  onSuccess: (result, attempts) => {
    // fired on success when at least one retry was needed
    logger.info('Job recovered', { attempts });
    metrics.increment('jobs.recovered');
  },
});
Enter fullscreen mode Exit fullscreen mode

Real-world example: email queue with file DLQ

import { JobRetry } from 'job-retry';

const mailer = new JobRetry({
  attempts: 5,
  backoff: 'exponential',
  baseDelay: 2000,
  timeout: 10_000,
  jitter: true,
  dlq: 'file',
  dlqFilePath: './failed-emails.json',
  onRetry:   (err, n) => console.warn(`Email retry ${n}: ${err.message}`),
  onFailure: (job)    => console.error(`Saved to DLQ: ${job.name}`),
});

export async function sendWelcomeEmail(user: User) {
  return mailer.run(`welcome:${user.id}`, () =>
    smtp.send({ to: user.email, subject: 'Welcome!', body: `Hi ${user.name}` })
  );
}

// Run this script after fixing your SMTP server
async function replayFailedEmails() {
  const failed = await mailer.dlq.getAll();
  console.log(`Replaying ${failed.length} failed email(s)...`);

  for (const job of failed) {
    try {
      await mailer.dlq.retry(job.id, mailer);
      console.log(`✅ Replayed: ${job.name}`);
    } catch {
      console.error(`❌ Still failing: ${job.name}`);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Error handling

import { MaxAttemptsExceededError } from 'job-retry';

try {
  await runner.run('myJob', fn);
} catch (err) {
  if (err instanceof MaxAttemptsExceededError) {
    console.log(`Failed after ${err.attempts} attempts`);
    console.log('Root cause:', err.cause);
  }
}
Enter fullscreen mode Exit fullscreen mode

All options

Option Type Default Description
attempts number 3 Max attempts before DLQ
backoff `'fixed' \ 'linear' \ 'exponential'`
baseDelay number 1000 Base delay in ms
timeout number none Per-attempt timeout in ms
jitter boolean false Randomise delay
dlq `'memory' \ 'file' \ 'redis' \
{% raw %}dlqFilePath string './job-retry-dlq.json' File backend path
dlqRedisClient Redis none ioredis client
onRetry (error, attempt) => void none Non-final failure hook
onFailure (job) => void none DLQ hook
onSuccess (result, attempts) => void none Recovery hook

TypeScript

Full types ship with the package, no @types/job-retry needed.

import type { RetryOptions, DLQEntry, DLQBackend, BackoffStrategy } from 'job-retry';
Enter fullscreen mode Exit fullscreen mode

Links

If this helped you, a star on GitHub means a lot. Drop a comment if you have questions or ideas for new backends.

Top comments (0)