Muhammad Masad Ashraf

Posted on May 22 • Originally published at kolachitech.com

Stop Losing Shopify Webhooks: A Retry Strategy That Survives Real Outages

#shopify #webhooks #webdev #architecture

Stop Losing Shopify Webhooks: A Retry Strategy That Survives Real Outages

Here is a scenario every Shopify developer eventually lives through.

It is the middle of a deploy. Your endpoint is down for maybe 90 minutes. Nothing dramatic. But while you were shipping, three orders came in, an inventory sync fired, and a customer updated their address.

Shopify tried to deliver those webhooks. It retried. Then it gave up.

Those events are gone. Not delayed. Gone.

I have watched this exact thing happen, and the worst part is how quiet it is. No error in your logs. No alert. Just a slow drift between what Shopify knows and what your system knows. You find out days later when a customer asks where their order went.

So let us talk about how to actually fix this, properly, with a retry strategy that survives more than a quick hiccup.

First, Know What Shopify Actually Does

You cannot build a good retry layer until you know what the platform gives you for free.

When Shopify sends a webhook, it waits 5 seconds for your endpoint to respond. Any 2xx status code is a success. Anything else, including a timeout, is a failure.

After a failure, Shopify retries. And here is the part people get wrong, because the internet is full of outdated info:

As of the September 2024 policy update, Shopify retries a failed webhook up to 8 times over a 4-hour window with exponential backoff.

The old "19 retries over 48 hours" number you will find in older blog posts is dead. If your reliability code was written before late 2024, your timing assumptions are probably wrong.

A quick reference:

Behavior	Detail
Timeout	5 seconds per attempt
Success	Any 2xx status
Retries	Up to 8 attempts
Window	4 hours total
Backoff	Exponential
Payload	Original payload from trigger time
The scary part	Persistent failures delete the subscription

That last row is the one that ruins weekends. If your endpoint fails enough, Shopify does not just drop events. It removes the webhook subscription. New events stop firing entirely until you re-register. Silent and total.

The Problem With 4 Hours

Exponential backoff over 8 attempts in 4 hours front-loads everything. Most of your retries happen in the first 30 minutes. By hour 2, roughly 5 of your 8 attempts are spent.

So Shopify's retry system is genuinely good at one thing: surviving short, transient blips. A momentary network drop. A one-off timeout.

It is bad at the thing that actually hurts you: real outages.

A deploy that runs long
A downstream API that rate-limits you for hours
A traffic spike during a sale that pushes responses past 5 seconds
A botched HMAC secret rotation that 401s every webhook

In all of those, the 4-hour window runs out before you recover. Shopify did its job. Your data is still lost.

The conclusion is simple: you need your own retry layer. Shopify's is the floor, not the ceiling.

Rule Zero: Acknowledge Fast, Process Later

Before any retry logic, fix the most common mistake in Shopify webhook code: running business logic inside the webhook endpoint.

If your handler hits a database, calls an external API, or does anything heavy, you are gambling against that 5-second timeout. Cross it once and Shopify marks a perfectly good delivery as failed.

Your endpoint should do almost nothing:

// The entire job of your webhook endpoint
app.post("/webhooks/orders-create", async (req, res) => {
  // 1. Verify the HMAC signature
  if (!verifyHmac(req)) return res.sendStatus(401);

  // 2. Push the raw payload onto a queue
  await queue.add("orders-create", {
    body: req.rawBody,
    eventId: req.headers["x-shopify-event-id"],
    triggeredAt: req.headers["x-shopify-triggered-at"],
  });

  // 3. Respond immediately
  res.sendStatus(200);
});

That is it. A background worker pulls from the queue and does the real work. This one change eliminates the majority of timeout failures, and it turns retries into a calm internal concern instead of a race against a clock.

Building the Retry Layer

Your retry logic lives in the worker, not the endpoint. When processing fails, the worker decides: retry, or give up?

Step 1: Classify the error

Not every failure deserves a retry. Retrying a permanent error just burns resources.

Error type	Examples	Action
Transient	Timeout, 503, deadlock, rate limit	Retry with backoff
Permanent	Invalid payload, missing field, validation error	Straight to the dead letter queue

Retrying a malformed payload 8 times will not magically make it valid. Categorize first, act second.

Step 2: Exponential backoff with jitter

For transient errors, retry with increasing delays. Each retry waits longer than the last.

Attempt	Delay
1	30 seconds
2	2 minutes
3	8 minutes
4	30 minutes
5	2 hours
6	6 hours
7	24 hours

Notice the window stretches across days, not Shopify's 4 hours. That is the whole point of building your own layer.

But pure exponential backoff has a trap. If 500 webhooks fail at the same moment, they all retry at the same moment, and your recovering service gets hammered flat again. This is the thundering herd.

Fix it with jitter: add a small random offset to each delay. Instead of retrying at exactly 8 minutes, retry somewhere between 7 and 9. It spreads the load and smooths recovery.

function nextDelay(attempt) {
  const base = Math.min(30 * 2 ** attempt, 86400); // cap at 24h
  const jitter = base * (Math.random() * 0.3); // up to 30% jitter
  return base + jitter;
}

Step 3: Cap retries, then use a dead letter queue

Retries cannot run forever. Cap them, usually somewhere between 5 and 10 attempts. When a webhook exhausts its retries, it does not vanish. It moves to a dead letter queue.

The DLQ stores the failed webhook with full context: original payload, error message, attempt count, timestamp. Now you can:

Inspect it and find the root cause
Reprocess it once the bug is fixed
Alert when the DLQ grows past a normal threshold

That alert is your early warning system. A spike in DLQ volume tells you something is wrong before any customer notices.

The Duplicate Problem Nobody Mentions

Retries create a brand new problem: the same webhook arriving twice.

Shopify delivers a webhook, your slow response times out, Shopify retries, but your worker already processed the first one. Now you have a duplicate. Without protection, that means double inventory adjustments, duplicate emails, double-charged orders.

The fix is idempotency. Every Shopify webhook carries an X-Shopify-Event-Id header, and that value stays identical across every retry of the same event.

Use it as a dedup key:

async function processWebhook(job) {
  const { eventId } = job.data;

  if (await alreadyProcessed(eventId)) {
    return; // safe no-op
  }

  await doTheRealWork(job.data);
  await markProcessed(eventId);
}

A retry strategy without deduplication is incomplete. Build this in from day one.

Watch Out for Stale Payloads

One more subtlety. When Shopify retries, it sends the original payload from when the event was triggered, not the current state.

If an order changed three times during the retry window, a late retry still carries the first version. Apply it blindly and you overwrite newer data with older data.

Always check the X-Shopify-Triggered-At header against your own records. If your data is already newer, skip the stale update or fetch fresh data from the Admin API instead.

The Layer Even Retries Cannot Save You From

Here is the uncomfortable truth: even a perfect retry layer cannot recover an event Shopify never sent, or one it dropped after its own retries failed.

For that, you need reconciliation. Periodically poll the Shopify Admin API and compare its data against yours. Hourly for high-value topics like orders/*. Daily for lower-stakes data.

Reconciliation is the final safety net. It catches events lost to extended outages, dropped events, and gaps from a removed subscription.

Combine the three layers and you lose almost nothing:

Layer	Catches
Shopify retries	Short transient blips
Your retry layer + DLQ	Longer outages, downstream failures
Reconciliation	Everything else, including dropped events

Do Not Forget Monitoring

A retry strategy fails silently without eyes on it. Track these:

Webhook failure rate
DLQ size and growth
Retry volume
Subscription status (run a daily check, alert on any missing topic)
Processing latency creeping toward 5 seconds

Shopify's Dev Dashboard has a delivery metrics report with response codes and retry counts per topic. Use it alongside your own monitoring, not instead of it.

TL;DR

Shopify gives you 8 retries over 4 hours. Real outages last longer than that. So:

Acknowledge fast. Return 200 in under 5 seconds, queue the payload.
Classify errors. Transient gets retried, permanent goes to the DLQ.
Exponential backoff with jitter. Retry over days, not hours.
Idempotency. Use X-Shopify-Event-Id to kill duplicates.
Dead letter queue. Nothing gets lost, everything is replayable.
Reconcile. Poll the Admin API to catch what slipped through.
Monitor. Surface problems before customers do.

Build these layers once and your integration stops losing data, even on its worst day.

If you want the full guide, you can read it here.

How does your team handle webhook reliability? I would genuinely like to hear what has worked for you in the comments.

DEV Community

Stop Losing Shopify Webhooks: A Retry Strategy That Survives Real Outages

Stop Losing Shopify Webhooks: A Retry Strategy That Survives Real Outages

First, Know What Shopify Actually Does

The Problem With 4 Hours

Rule Zero: Acknowledge Fast, Process Later

Building the Retry Layer

Step 1: Classify the error

Step 2: Exponential backoff with jitter

Step 3: Cap retries, then use a dead letter queue

The Duplicate Problem Nobody Mentions

Watch Out for Stale Payloads

The Layer Even Retries Cannot Save You From

Do Not Forget Monitoring

TL;DR

Top comments (0)