- Book: TypeScript in Production
- Also by me: The TypeScript Library — the 5-book collection
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You are on call. The payments provider returns a brief 503, the kind of routine ten-second hiccup that fills any large provider's status page. Every node in your service hits it on the same wall-clock second. Every node has the same retry policy: wait one second, retry, then two, then four. A second later every node retries. Two seconds after that every node retries again. The provider, which was nearly recovered, is now hammered by a synchronised wave of clients who all chose the same backoff. Their on-call sees a second spike worse than the first.
This is the thundering-herd failure mode, and a naive setTimeout(retry, base * 2 ** n) loop is what causes it. The fix is jitter. The AWS Architecture blog post Exponential Backoff And Jitter by Marc Brooker is the canonical reference; the formulas there are now baked into every AWS SDK.
Why the naive retry collapses
The standard formula without jitter is sleep = base * 2 ** attempt. Every client computes the same number. If your fleet has 200 instances and the upstream blips for a second, all 200 retry at exactly base ms after the failure, then at 2*base, then at 4*base.
The retry traffic looks like a square wave on top of normal load. A service that was struggling to recover is now trying to recover and answer 200 simultaneous retries every few seconds. The cure is to spread retries across the backoff window. If every client picks a random delay inside the window, the retry traffic flattens into something the upstream can serve. The graphs in the AWS post show this directly: un-jittered is a sawtooth that overloads the server on every boundary; jittered is a smooth curve that tracks the recovery.
The three jitter strategies
The AWS post compares three formulas. All three start from the same exponential ceiling: cap (the maximum delay you will ever sleep) and base * 2 ** attempt (the un-jittered exponential value).
Full jitter picks a uniform random value across the entire range from zero to the capped exponential.
const expo = Math.min(cap, base * 2 ** attempt);
const sleep = Math.random() * expo;
Equal jitter keeps half the exponential value as a fixed floor and randomises the other half.
const expo = Math.min(cap, base * 2 ** attempt);
const sleep = expo / 2 + Math.random() * (expo / 2);
Decorrelated jitter does not use the attempt number at all. It feeds the previous sleep into the next one, growing the upper bound as a function of what just happened.
let sleep = base;
// each attempt:
sleep = Math.min(cap, base + Math.random() * (sleep * 3 - base));
Brooker's measurements (in the AWS post): full jitter does the least total work and gives the lowest server load, at the cost of slightly longer total time. Equal jitter is a small step worse on both axes. Decorrelated jitter completes slightly faster than full jitter while doing more total work — it is the trade-off pick when you care more about completion time than about load on the upstream.
For most services, full jitter is the right default. It is the simplest to reason about and gives the best server-side behaviour. Decorrelated jitter is worth picking when you care more about latency variance than peak server load. The wrapper below uses full jitter; decorrelated is a one-line swap.
The retry loop, modern-runtime style
The minimum viable retry loop is a for over attempt numbers with a try/catch and a sleep at the bottom. Three things turn it into something you would ship.
A real abort signal. AbortSignal.timeout(ms) returns a signal that aborts after the given milliseconds. It is implemented in Node.js, Deno, and Bun and is now baseline across the WinterCG runtimes. Pair it with AbortSignal.any([userSignal, timeoutSignal]) to honour external cancellation while enforcing a deadline. Older Node versions had reported issues with long-lived signal chains in AbortSignal.any, so check your minimum supported runtime's release notes before you rely on it.
A cancellable sleep. A setTimeout-based sleep that does not honour the abort signal will keep your retry loop alive past the deadline. Wrap setTimeout in a promise that listens for abort and rejects early.
function sleep(ms: number, signal?: AbortSignal): Promise<void> {
return new Promise((resolve, reject) => {
if (signal?.aborted) return reject(signal.reason);
const timer = setTimeout(() => {
signal?.removeEventListener("abort", onAbort);
resolve();
}, ms);
const onAbort = () => {
clearTimeout(timer);
reject(signal!.reason);
};
signal?.addEventListener("abort", onAbort, { once: true });
});
}
A give-up budget. Two ceilings. Max attempts caps the count. The deadline caps total wall-clock time across all attempts and sleeps. Most production retry policies want the deadline as the primary gate, with max attempts as a safety net. Six attempts at 30 seconds each will burn three minutes you may not have.
Idempotency keys: the part most retry libraries forget
Retrying a GET is safe. Retrying a POST is safe only if the server can dedupe. The pattern: generate a UUID before the first attempt and send the same UUID on every retry, in a header the server reads as "I already processed this; return the original result."
Stripe popularised this with its Idempotency-Key header. Every payment provider, plus AWS, plus most well-engineered internal services, support a variant. The contract is: same key plus same body returns the same response. Same key plus different body returns an error.
The retry wrapper has to thread the key through every attempt, not generate a new one per attempt:
const idempotencyKey = crypto.randomUUID();
await withRetry(
(attempt) => fetch(url, {
method: "POST",
headers: {
"Idempotency-Key": idempotencyKey,
"X-Retry-Attempt": String(attempt),
},
body: JSON.stringify(payload),
}),
{ maxAttempts: 5, baseMs: 100, capMs: 30_000 },
);
The wrapper does not generate the key. The caller does, because the caller knows the unit of work. The wrapper passes the attempt number into the operation so the caller can add it as a tracing header, and stays out of the way.
When NOT to retry
The most expensive bug in retry code is retrying an error that will never succeed. Validation failures retry forever and amplify load on a service that was correctly rejecting. Auth failures retry until the token actually expires, then retry the expired token. The shape of the answer is a shouldRetry predicate: any error the predicate rejects fails the call immediately.
Defaults that travel well:
-
Retry: network errors (
ECONNRESET,ECONNREFUSED,ETIMEDOUT,EAI_AGAIN), HTTP 408, 429, 500, 502, 503, 504. -
Honour the server's hint: HTTP 429 and 503 may include a
Retry-Afterheader. Treat it as a floor on the next sleep. - Do not retry: 400, 401, 403, 404, 409, 410, 422. The request is wrong; retrying makes it more wrong.
- It depends: 401 with a refreshable token is a pre-retry hook (refresh, then retry once). 409 on an idempotent submission means the server already has your request — read the response, do not re-submit.
p-retry and Cockatiel both let the operation opt out of retry from inside, but through different mechanisms: p-retry honours a thrown AbortError, Cockatiel composes a retry policy with handler predicates that decide whether a given error is retryable. The wrapper here follows the p-retry shape: throw a NonRetryableError and the loop exits.
The 100-line implementation
The full wrapper. It fits in 100 lines, covers full jitter, max-attempts plus deadline budget, abort-signal threading, Retry-After honouring, and the non-retryable opt-out.
// retry.ts — full-jitter exponential backoff with deadline + abort
export class NonRetryableError extends Error {
constructor(public cause: unknown) {
super("non-retryable");
}
}
export type RetryOptions = {
maxAttempts?: number; // default 5
baseMs?: number; // default 100
capMs?: number; // default 30_000
deadlineMs?: number; // total wall-clock budget
signal?: AbortSignal; // external cancellation
shouldRetry?: (err: unknown) => boolean;
onRetry?: (info: {
attempt: number;
delayMs: number;
error: unknown;
}) => void;
};
export type Attempt = {
attempt: number; // 1-indexed
signal: AbortSignal; // honour this in your fetch
};
const defaultShouldRetry = (err: unknown): boolean => {
if (err instanceof NonRetryableError) return false;
const code = (err as { code?: string })?.code;
if (code && ["ECONNRESET", "ECONNREFUSED", "ETIMEDOUT", "EAI_AGAIN"]
.includes(code)) return true;
const status = (err as { status?: number })?.status;
if (status && [408, 429, 500, 502, 503, 504].includes(status)) return true;
return false;
};
const sleep = (ms: number, signal: AbortSignal): Promise<void> =>
new Promise((resolve, reject) => {
if (signal.aborted) return reject(signal.reason);
const t = setTimeout(() => {
signal.removeEventListener("abort", onAbort);
resolve();
}, ms);
const onAbort = () => {
clearTimeout(t);
reject(signal.reason);
};
signal.addEventListener("abort", onAbort, { once: true });
});
const retryAfterMs = (err: unknown): number | null => {
const header = (err as { retryAfter?: string })?.retryAfter;
if (!header) return null;
const seconds = Number(header);
if (Number.isFinite(seconds)) return seconds * 1000;
const date = Date.parse(header);
return Number.isFinite(date) ? Math.max(0, date - Date.now()) : null;
};
That covers types, the default predicate, the cancellable sleep, and the Retry-After parser. The wrapper itself is the next 35 lines: it composes signals from the user's signal and the deadline, runs the operation in a 1-indexed loop, and applies full jitter with the Retry-After floor on every failure that the predicate accepts.
export async function withRetry<T>(
op: (a: Attempt) => Promise<T>,
opts: RetryOptions = {},
): Promise<T> {
const maxAttempts = opts.maxAttempts ?? 5;
const baseMs = opts.baseMs ?? 100;
const capMs = opts.capMs ?? 30_000;
const shouldRetry = opts.shouldRetry ?? defaultShouldRetry;
const deadline = opts.deadlineMs
? AbortSignal.timeout(opts.deadlineMs)
: null;
const signal: AbortSignal = deadline && opts.signal
? AbortSignal.any([opts.signal, deadline])
: (deadline ?? opts.signal ?? new AbortController().signal);
let lastError: unknown;
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
if (signal.aborted) throw signal.reason;
try {
return await op({ attempt, signal });
} catch (err) {
lastError = err;
if (attempt === maxAttempts) break;
if (!shouldRetry(err)) throw err;
const expo = Math.min(capMs, baseMs * 2 ** (attempt - 1));
const jittered = Math.random() * expo;
const hint = retryAfterMs(err);
const delayMs = hint !== null ? Math.max(jittered, hint) : jittered;
opts.onRetry?.({ attempt, delayMs, error: err });
await sleep(delayMs, signal);
}
}
throw lastError;
}
The body is roughly 60 lines; the rest is types, the default predicate, and the Retry-After parser. The signature is the part that matters: you pass an operation that takes an Attempt (so you can stamp the attempt number on the request and honour the abort signal) and a small options object.
To swap full jitter for decorrelated, replace the two lines that compute expo and jittered with the decorrelated formula and keep a prevSleep variable across attempts. To swap for equal jitter, change one line. The wrapper is small so the policy is the obvious thing to edit.
Production knobs that matter
baseMs. The first retry's expected wait is baseMs / 2 with full jitter. 50–200 ms for HTTP retries to a fast service, 500–1000 ms for a service whose median latency is in the hundreds of milliseconds. Going lower than 50 ms turns retries into a self-inflicted load test.
capMs. The largest single sleep. With baseMs=100 and no cap, attempt 10 wants 51.2 seconds; attempt 15 wants 27 minutes. The cap stops a runaway retry from waiting longer than the user will. 30 seconds for backend services; 5–10 seconds for user-facing paths.
maxAttempts. Five is fine. More than seven and you are almost certainly retrying something that should not be retried.
deadlineMs. The knob most retry libraries omit and the one you should always set. The right value is the upstream timeout of the call you are inside, minus a small margin. If your handler has 10 seconds to respond, your retry budget is 9 seconds, leaving 1 second for the response itself. That budget covers time spent on each attempt, not only the sleeps.
shouldRetry. Where you encode your service's idea of "transient." The default is a starting point; tune for each upstream. A payment provider with structured error codes distinguishing "card declined" from "service unavailable" deserves a predicate that reads the body, not just the status.
What we did not build
The 100-line wrapper does retries. Three things sit next to retries and deserve their own primitives:
- Circuit breaker. When a downstream is dead, retrying every call wastes time and hides the failure. Opens after N consecutive failures, fails fast for a cool-down window, then probes.
- Bulkhead. Limit concurrent calls to one downstream so a slow upstream cannot consume your whole event loop. A semaphore is enough.
- Hedging. Send a second copy of the request after a delay if the first has not responded, take whichever returns first. Useful for tail-latency-sensitive paths.
All three compose with the retry wrapper. Outside-in: bulkhead, circuit breaker, retry, hedging, the actual call.
Where to go next
Drop the 100 lines into lib/retry.ts. Pick one HTTP client call that today has either no retry or a naive one and wrap it. Set deadlineMs to your handler's deadline minus 200 ms. Set baseMs to a quarter of the upstream's median latency. Generate the idempotency key in the caller, thread it through, and watch the upstream's dashboards for a flatter retry curve the next time it blips.
For more, pin a copy of Marc Brooker's Exponential Backoff And Jitter, read Cockatiel's docs for the policy-composition idea, and read p-retry's source for a smaller reference implementation. The patterns are stable. What has shifted under your feet is the runtime primitives: AbortSignal.timeout, AbortSignal.any, and the WinterCG-aligned fetch.
The retry policy you ship today is the retry policy that will fire the next time something upstream blips. Make it one that does not make the blip worse.
If this was useful
A retry policy is a small library, but it is one where the difference between "works on a good day" and "works on a bad day" is exactly the production-engineering layer: abort signals, deadline budgets, idempotency keys, the discipline to not retry the unretryable. TypeScript in Production is the book for that layer: the build, packaging, and shipping work that makes a TypeScript codebase a TypeScript service.
The full TypeScript Library, five books:
- TypeScript Essentials — daily-driver TS across Node, Bun, Deno, and the browser: Amazon
- The TypeScript Type System — generics, mapped/conditional types, template literals, branded types: Amazon
- Kotlin and Java to TypeScript — bridge for JVM developers: Amazon
- PHP to TypeScript — bridge for modern PHP 8+ developers: Amazon
- TypeScript in Production — tsconfig, build, monorepos, library authoring: Amazon
Books 1 and 2 are the core path. Books 3 and 4 substitute for readers coming from JVM or PHP. Book 5 is for shipping TS at work.
All five books ship in ebook, paperback, and hardcover.

Top comments (0)