You're sending an email. The SMTP server hiccups. Your code throws. The email never sends. Nobody knows.
That's the silent failure problem. Most retry libraries solve half of it. They retry. But they don't answer the harder question:
What happens to the jobs that never succeed?
This article walks through three production failure patterns and how to fix all of them with job-retry, a Node.js package I built that handles backoff, timeouts, and dead letter queues in one place.
The three ways basic retry fails you
Problem 1: The Thundering Herd
Without jitter:
t = 0s ┌─────────────────────────────────┐
│ 500 jobs all fail at once │
└─────────────────────────────────┘
│
t = 1s ┌─────────────────────────────────┐
│ 500 jobs all retry at once 💥 │ <- server dies again
└─────────────────────────────────┘
│
t = 2s ┌─────────────────────────────────┐
│ 500 jobs all retry at once 💥 │ <- and again
└─────────────────────────────────┘
Your service comes back up and immediately gets hammered by every single job retrying at the exact same millisecond. It goes down again.
Problem 2: The Hanging Promise
await fetch('https://slow-api.com/data')
│
│ ... 10 seconds pass
│ ... 30 seconds pass
│ ... 2 minutes pass
│
▼
nothing. <- no error, no timeout, just hangs forever
Your API call never resolves. The retry loop waits forever. Memory leaks. Server crashes.
Problem 3: Silent Loss
p-retry:
attempt 1 -> fail
attempt 2 -> fail
attempt 3 -> fail
attempt 4 -> fail
attempt 5 -> fail
│
▼
throws <- you catch it, log it, move on
│
▼
job is gone forever 🗑️
no record of what failed
no way to replay it
customer never got their email
After p-retry gives up, the job disappears. No record of what failed, when, or why. No way to replay it after the issue is fixed.
The solution: job-retry
job-retry:
attempt 1 -> fail -> wait (backoff + jitter)
attempt 2 -> fail -> wait (backoff + jitter)
attempt 3 -> fail -> wait (backoff + jitter)
attempt 4 -> fail -> wait (backoff + jitter)
attempt 5 -> fail
│
▼
┌─────────────────────────────┐
│ Dead Letter Queue │ <- job saved with full context
│ name, error, timestamp, │
│ attempts, original payload │
└─────────────────────────────┘
│
▼ (after you fix the underlying issue)
dlq.retry(job.id, runner) -> job runs again ✅
Install
npm install job-retry
Zero dependencies for core usage. Only needs ioredis if you use the Redis DLQ backend.
Basic usage
import { JobRetry } from 'job-retry';
const runner = new JobRetry({
attempts: 5,
backoff: 'exponential',
baseDelay: 1000,
timeout: 5000, // kill hanging attempts after 5s
jitter: true, // spread retries to avoid thundering herd
});
const result = await runner.run('sendEmail', () => sendEmail(user));
That's the whole API. Wrap any async function with runner.run(). Everything else is handled automatically.
How exponential backoff and jitter works
baseDelay: 1000ms
Without jitter: With jitter:
attempt 1 -> wait 1000ms attempt 1 -> wait 1200ms (random between 1000-2000)
attempt 2 -> wait 2000ms attempt 2 -> wait 2750ms (random between 2000-4000)
attempt 3 -> wait 4000ms attempt 3 -> wait 5100ms (random between 4000-8000)
attempt 4 -> wait 8000ms attempt 4 -> wait 9800ms (random between 8000-16000)
All 500 jobs fire Jobs spread out across a
at exactly the same time window, service
moment -> 💥 recovers gracefully -> ✅
| Strategy | Formula | Example (baseDelay: 1s) |
|---|---|---|
fixed |
baseDelay |
1s · 1s · 1s · 1s |
linear |
baseDelay x attempt |
1s · 2s · 3s · 4s |
exponential |
baseDelay x 2^(n-1) |
1s · 2s · 4s · 8s |
Per-attempt timeout
const runner = new JobRetry({
attempts: 3,
timeout: 5000, // each attempt gets maximum 5 seconds
});
attempt 1: ──────────────────────── timeout! (5s) -> retry
attempt 2: ──────────────────────── timeout! (5s) -> retry
attempt 3: ──────────────────────── timeout! (5s) -> DLQ
No more hanging promises. Every attempt has a hard deadline.
Dead letter queue
When all attempts fail, the job is saved instead of dropped:
const runner = new JobRetry({
attempts: 3,
baseDelay: 1000,
dlq: 'memory', // swap for 'file' or 'redis' in production
onFailure: (job) => console.error('Saved to DLQ:', job.name),
});
try {
await runner.run('processOrder', () => callPaymentService(order));
} catch {
// all attempts failed, but the job is safe in the DLQ
}
// Inspect what failed
const failed = await runner.dlq.getAll();
// [{
// id: 'abc-123',
// name: 'processOrder',
// error: 'Payment service timeout',
// timestamp: 1718380800000,
// attempts: 3
// }]
// Once the payment service is fixed, replay it
await runner.dlq.retry(failed[0].id, runner);
Three DLQ backends
┌─────────────────────────────────────────────────────────┐
│ DLQBackend interface │
│ push · getAll · get · retry · remove · clear · size │
└───────────────────────┬─────────────────────────────────┘
│ implemented by
┌─────────────┼──────────────┐
▼ ▼ ▼
MemoryDLQ FileDLQ RedisDLQ
(dev/test) (single server) (production)
Memory: for development
new JobRetry({ dlq: 'memory' })
// Stored in-process. Lost on restart. Perfect for dev and tests.
File: for single-server apps
new JobRetry({
dlq: 'file',
dlqFilePath: './failed-jobs.json',
})
// Persisted to disk as JSON. Survives restarts.
Redis: for production
import Redis from 'ioredis';
new JobRetry({
dlq: 'redis',
dlqRedisClient: new Redis(),
})
// Shared across all servers. Survives restarts. Production-ready.
Hooks for full observability
const runner = new JobRetry({
attempts: 5,
baseDelay: 1000,
onRetry: (error, attempt) => {
// fired after each failed attempt except the last
logger.warn('Job retry', { attempt, error: error.message });
metrics.increment('jobs.retry');
},
onFailure: (job) => {
// fired when job is moved to the DLQ
logger.error('Job permanently failed', { name: job.name, id: job.id });
metrics.increment('jobs.failed');
},
onSuccess: (result, attempts) => {
// fired on success when at least one retry was needed
logger.info('Job recovered', { attempts });
metrics.increment('jobs.recovered');
},
});
Real-world example: email queue with file DLQ
import { JobRetry } from 'job-retry';
const mailer = new JobRetry({
attempts: 5,
backoff: 'exponential',
baseDelay: 2000,
timeout: 10_000,
jitter: true,
dlq: 'file',
dlqFilePath: './failed-emails.json',
onRetry: (err, n) => console.warn(`Email retry ${n}: ${err.message}`),
onFailure: (job) => console.error(`Saved to DLQ: ${job.name}`),
});
export async function sendWelcomeEmail(user: User) {
return mailer.run(`welcome:${user.id}`, () =>
smtp.send({ to: user.email, subject: 'Welcome!', body: `Hi ${user.name}` })
);
}
// Run this script after fixing your SMTP server
async function replayFailedEmails() {
const failed = await mailer.dlq.getAll();
console.log(`Replaying ${failed.length} failed email(s)...`);
for (const job of failed) {
try {
await mailer.dlq.retry(job.id, mailer);
console.log(`✅ Replayed: ${job.name}`);
} catch {
console.error(`❌ Still failing: ${job.name}`);
}
}
}
Error handling
import { MaxAttemptsExceededError } from 'job-retry';
try {
await runner.run('myJob', fn);
} catch (err) {
if (err instanceof MaxAttemptsExceededError) {
console.log(`Failed after ${err.attempts} attempts`);
console.log('Root cause:', err.cause);
}
}
All options
| Option | Type | Default | Description |
|---|---|---|---|
attempts |
number |
3 |
Max attempts before DLQ |
backoff |
`'fixed' \ | 'linear' \ | 'exponential'` |
baseDelay |
number |
1000 |
Base delay in ms |
timeout |
number |
none | Per-attempt timeout in ms |
jitter |
boolean |
false |
Randomise delay |
dlq |
`'memory' \ | 'file' \ | 'redis' \ |
{% raw %}dlqFilePath
|
string |
'./job-retry-dlq.json' |
File backend path |
dlqRedisClient |
Redis |
none | ioredis client |
onRetry |
(error, attempt) => void |
none | Non-final failure hook |
onFailure |
(job) => void |
none | DLQ hook |
onSuccess |
(result, attempts) => void |
none | Recovery hook |
TypeScript
Full types ship with the package, no @types/job-retry needed.
import type { RetryOptions, DLQEntry, DLQBackend, BackoffStrategy } from 'job-retry';
Links
If this helped you, a star on GitHub means a lot. Drop a comment if you have questions or ideas for new backends.
Top comments (0)