DEV Community

ShotaTanikawa
ShotaTanikawa

Posted on • Originally published at hookray.com

Webhook Retry Strategies (2026) — Idempotency, Backoff, Dead Letters

Stripe will retry your webhook up to 17 times over 3 days. GitHub up to 50 times over 8 hours. Square up to 70 times over 72 hours.

If your code charges a credit card or sends an email when those retries hit, you have a problem.

This is the field guide I wish someone had handed me on day one for building webhook receivers that survive retries. Four pillars: idempotency by event ID, 2xx-fast async processing, understanding each sender's retry policy, and dead-letter handling for the requests that never succeed. Code examples are Node.js + Postgres, but the patterns are language-agnostic.

Quick recipe: dedupe by event ID before doing real work, return 2xx fast, treat duplicates as no-ops, and set up a dead-letter queue for events that fail too many times. The rest is sender-specific tuning.

Why retries are unavoidable

Webhook senders (Stripe, GitHub, Shopify, etc.) decide an event was "delivered" based on whether your endpoint returned a 2xx HTTP status. Anything else — 4xx, 5xx, timeout, TCP reset, your laptop closed mid-deploy — is a "failure" and the sender will try again, often aggressively.

This means your handler is going to see the same event multiple times. Sometimes 2-3 times during a routine outage; up to 17 times for Stripe over 3 days; up to 50 times for GitHub. If your code charges a credit card or sends an email, naïve handling = duplicate charges, duplicate emails, angry customers.

The good news: the fix is mostly mechanical. Once you have idempotency-by-event-ID, retries become benign.

Pillar 1: Idempotency by event ID

Every webhook payload from a serious provider includes a unique event ID:

Provider Event ID field Format
Stripe id (top-level) evt_1ABC...
GitHub X-GitHub-Delivery (header) UUID v4
Shopify X-Shopify-Webhook-Id (header) UUID v4
Slack X-Slack-Request-Timestamp + body hash composite
Square event_id (in body) UUID v4
HubSpot eventId (per event in array) numeric
SendGrid sg_event_id (per event) base64

The pattern: persist the event ID before doing real work, in a unique-indexed table. If the insert fails because the ID already exists, you've seen this event before — return 200 OK and do nothing.

CREATE TABLE processed_webhook_events (
  event_id  TEXT PRIMARY KEY,
  source    TEXT NOT NULL,           -- 'stripe', 'github', etc.
  received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Enter fullscreen mode Exit fullscreen mode
// Express + node-postgres
import express from "express";
import { Pool } from "pg";

const pool = new Pool();
const app = express();

app.post(
  "/webhooks/stripe",
  express.raw({ type: "application/json" }),
  async (req, res) => {
    // 1. Verify signature first (always — never skip).
    const event = verifyStripeSignature(req); // throws if invalid

    // 2. Try to record the event ID. Unique constraint = dedupe.
    try {
      await pool.query(
        "INSERT INTO processed_webhook_events (event_id, source) VALUES ($1, $2)",
        [event.id, "stripe"],
      );
    } catch (err) {
      if ((err as { code?: string }).code === "23505") {
        // Duplicate — Stripe is retrying. We already handled this event.
        return res.json({ received: true, duplicate: true });
      }
      throw err;
    }

    // 3. Now do real work. If this throws, the row stays in the table
    //    but the event is unprocessed. See "transactional handlers" below.
    await handleStripeEvent(event);

    res.json({ received: true });
  },
);
Enter fullscreen mode Exit fullscreen mode

This is the most important pattern in the entire guide. Get this right and retries become free.

Transactional handlers (the next subtle bug)

The simple version above has a race: if handleStripeEvent throws after we recorded the event ID, retries see "duplicate" and skip the event — but the work never happened. Two fixes:

Option A — Mark events as pending then processed. Use a status column instead of pure existence:

ALTER TABLE processed_webhook_events
  ADD COLUMN status TEXT NOT NULL DEFAULT 'pending',
  ADD COLUMN processed_at TIMESTAMPTZ;
Enter fullscreen mode Exit fullscreen mode

On retry, if a row exists with status='pending', you know the previous attempt died mid-flight. Pick up the work and re-run it. If status='processed', return 200 immediately.

Option B — Wrap the insert + work in a single DB transaction. If the work throws, the insert rolls back, and the next retry sees no row. This is the cleanest pattern when your business logic is also DB-bound:

await pool.query("BEGIN");
try {
  await pool.query(
    "INSERT INTO processed_webhook_events (event_id, source) VALUES ($1, $2)",
    [event.id, "stripe"],
  );
  await handleStripeEvent(event); // does its own DB writes inside the txn
  await pool.query("COMMIT");
} catch (err) {
  await pool.query("ROLLBACK");
  if ((err as { code?: string }).code === "23505") {
    return res.json({ received: true, duplicate: true });
  }
  throw err; // sender will retry; rollback means we're idempotent
}
Enter fullscreen mode Exit fullscreen mode

Option B is the right answer when handlers stay inside one database. Option A is necessary when your handler does external API calls (sending an email, calling Slack, etc.) that can't be rolled back.

Pillar 2: Return 2xx fast — defer work to a queue

Most providers time out at 5-30 seconds. If you do all your processing inline, every slow handler == retry storm. The solution: acknowledge fast, work asynchronously.

app.post("/webhooks/stripe", async (req, res) => {
  const event = verifyStripeSignature(req);

  // Sync: dedupe + enqueue + return.
  await enqueueForProcessing(event); // this is BullMQ / SQS / DB-backed jobs
  res.status(200).json({ received: true });
});

// Worker process — handles the actual logic, retries on its own schedule.
worker.process("stripe-events", async (job) => {
  await handleStripeEvent(job.data);
});
Enter fullscreen mode Exit fullscreen mode

Trade-off: now you have two retry layers (the sender's, and your worker's). Make sure your worker also dedupes by event ID before doing real work — same pattern as Pillar 1.

Pillar 3: Understand each sender's retry policy

You can't tune your dead-letter strategy without knowing the upstream retry budget. The current (2026) policies:

Provider Retry attempts Backoff schedule Total window
Stripe Up to 17 Exponential, ~immediately → ~3 days 3 days
GitHub Up to 50 Exponential ~8 hours
Shopify Up to 19 Exponential, ~hours apart 48 hours
Slack Up to 3 1 min, 5 min, 30 min ~36 min
Twilio Configurable (3 default) Exponential per Webhook config varies
Square Up to 70 (!) Exponential, ~immediately → ~72 hours 72 hours
HubSpot Up to 10 Exponential ~8 hours
SendGrid Up to ~24 hours of retries Exponential 24 hours

Two things this table tells you:

  1. The retry windows are LONG. If Stripe gives you 3 days and Square gives you 72 hours, your handler stability matters over days, not seconds. A "blip" outage that lasts 30 minutes will resolve itself before any of these senders give up.
  2. Slack is the outlier. ~36 minutes of retries means a longer outage drops Slack events on the floor. If Slack signals are critical to your app, you need defensive replay tooling.

Source: each provider's published retry docs as of 2026-04. Re-verify before quoting in production.

Designing for the sender, not against it

A common anti-pattern: returning 4xx for "expected" failures (like a duplicate event you don't care to process). Stripe and most others stop retrying on 4xx; they treat it as "your endpoint rejected the event, that's terminal."

The right responses:

  • 2xx: "I have this. Please don't retry." Use even when you're skipping a duplicate or ignoring an event type.
  • 5xx or timeout: "I'm broken. Please retry." Use for transient infra problems.
  • 4xx: "Don't ever try again." Reserve for malformed requests or signature failures — explicit "stop retrying" intent.

If your endpoint returns 4xx during a partial outage, you'll silently lose events you actually wanted.

Pillar 4: Dead-letter handling

Eventually some events fail every retry. Maybe a customer was deleted between the event firing and your retry, maybe a downstream API changed schemas. You need:

  1. A dead-letter table that captures fully-failed events.
  2. An alert when something lands there.
  3. A manual replay path to reprocess after fixing the bug.
CREATE TABLE webhook_dead_letters (
  id BIGSERIAL PRIMARY KEY,
  source        TEXT NOT NULL,
  event_id      TEXT NOT NULL,
  raw_headers   JSONB NOT NULL,
  raw_body      BYTEA NOT NULL,         -- preserve EXACTLY what arrived
  last_error    TEXT NOT NULL,
  attempts      INT  NOT NULL,
  received_at   TIMESTAMPTZ NOT NULL,
  resolved_at   TIMESTAMPTZ
);
Enter fullscreen mode Exit fullscreen mode

Critical: store the raw bytes of the request, not the parsed JSON. When you fix the handler and want to replay, you need the exact bytes the signature was computed over — otherwise verification fails and you can't retry it cleanly.

Once a row lands here, alert (Slack, PagerDuty, email — whatever you use). Manual replay is then:

async function replayDeadLetter(id: number) {
  const row = await db.oneOrNone(
    "SELECT * FROM webhook_dead_letters WHERE id = $1",
    [id],
  );
  if (!row) throw new Error("not found");

  // Replay through the same handler — your idempotency table
  // ensures we don't double-process if it ALSO hit the live path.
  await processWebhook(row.source, row.raw_headers, row.raw_body);
  await db.none(
    "UPDATE webhook_dead_letters SET resolved_at = now() WHERE id = $1",
    [id],
  );
}
Enter fullscreen mode Exit fullscreen mode

Test your retry handling without waiting for production

The hardest part of this whole architecture is testing. Real Stripe retries happen on Stripe's schedule, days apart. You can't reliably write an integration test against "what happens on the 4th retry."

Two patterns that work:

Use HookRay to capture and replay

I built HookRay specifically because this loop drove me crazy:

  1. Get a free HookRay URL (no signup required for the first 100 captures).
  2. Point Stripe / GitHub / Shopify at the URL in their dashboard.
  3. Trigger a real event (test mode is fine).
  4. From HookRay's UI, click "Replay" — re-send the captured webhook to your local handler (localhost:3000 via tunnel, or HookRay Pro forwards directly).
  5. Click Replay 5 times in a row to verify your idempotency table catches duplicates.

This is the fastest "did my retry handling actually work?" loop I've found.

Provider CLIs

Stripe and GitHub both ship CLI tools that forward real events to localhost:

# Stripe — also supports `--resend-event-id evt_xxx` for replay
stripe listen --forward-to localhost:3000/webhooks/stripe

# GitHub
gh webhook forward --repo owner/repo --url localhost:3000/webhooks/github
Enter fullscreen mode Exit fullscreen mode

These are great for the happy path but they don't simulate the sender's retry storm — for that you need replay (HookRay or roll your own).

Common retry bugs (and how to spot them)

Bug 1: Returning 200 too early. You return 200 then crash before persisting the event. Sender thinks it's done; you lost the data. Fix: persist (or enqueue) before returning 200.

Bug 2: Idempotency on the wrong key. Using a synthetic key (like ${customer_id}_${event_type}) instead of the provider's event ID. Two distinct events with the same composite key collide; legitimate events get dropped as "duplicates." Fix: always dedupe on the provider's event ID.

Bug 3: Returning 4xx for expected duplicates. This stops retries, which sounds good — until you realize that all transient errors during the duplicate path also become 4xx. You silently break legitimate retry. Fix: return 200 OK with {duplicate: true} body for known duplicates; reserve 4xx for truly malformed requests.

Bug 4: Inline external API calls. Your handler calls Stripe's API to fetch related data, the call hangs for 30 seconds, the webhook times out, the sender retries, your handler hangs again, your queue fills up. Fix: enqueue + ack fast (Pillar 2).

Bug 5: Lost events during deploys. Your handler is mid-processing when the container is replaced. The event gets a 5xx (or worse, a half-completed write) and the sender retries. Without graceful shutdown handling, your retry table doesn't capture the in-flight ID. Fix: drain the in-flight queue before deploying, OR use the pending status pattern from Option A above.

Summary checklist

Before declaring your webhook handler "production-ready," verify:

  • [ ] Every handler dedupes by the provider's event ID
  • [ ] The dedupe table has a UNIQUE constraint on event_id
  • [ ] Either Option A (pending → processed) or Option B (transactional) is in use
  • [ ] Handlers return 2xx on duplicates, NOT 4xx
  • [ ] Long-running work is enqueued, not done inline
  • [ ] You have a dead-letter table that stores raw headers + raw body
  • [ ] You alert on dead-letter inserts (Slack/PagerDuty/email)
  • [ ] You have a tested manual-replay path
  • [ ] Your retry handling has been verified with HookRay replay or a similar tool

Pairs naturally with the security half of a robust receiver — see Webhook Signature Verification (HMAC-SHA256) in Node, Python, Ruby — 2026 Guide.

If you want a free webhook URL to capture real Stripe / GitHub / Shopify events and replay them at your local handler, grab one from HookRay — no signup, captures the raw payload + headers exactly as sent.


Drop a 🔖 if this saved you from a duplicate-charge incident, and tell me in the comments which sender's retry behavior has surprised you the most. I'm collecting horror stories for the v2 of this guide.

Top comments (0)