Stop Losing Shopify Webhooks: A Retry Strategy That Survives Real Outages
Here is a scenario every Shopify developer eventually lives through.
It is the middle of a deploy. Your endpoint is down for maybe 90 minutes. Nothing dramatic. But while you were shipping, three orders came in, an inventory sync fired, and a customer updated their address.
Shopify tried to deliver those webhooks. It retried. Then it gave up.
Those events are gone. Not delayed. Gone.
I have watched this exact thing happen, and the worst part is how quiet it is. No error in your logs. No alert. Just a slow drift between what Shopify knows and what your system knows. You find out days later when a customer asks where their order went.
So let us talk about how to actually fix this, properly, with a retry strategy that survives more than a quick hiccup.
First, Know What Shopify Actually Does
You cannot build a good retry layer until you know what the platform gives you for free.
When Shopify sends a webhook, it waits 5 seconds for your endpoint to respond. Any 2xx status code is a success. Anything else, including a timeout, is a failure.
After a failure, Shopify retries. And here is the part people get wrong, because the internet is full of outdated info:
As of the September 2024 policy update, Shopify retries a failed webhook up to 8 times over a 4-hour window with exponential backoff.
The old "19 retries over 48 hours" number you will find in older blog posts is dead. If your reliability code was written before late 2024, your timing assumptions are probably wrong.
A quick reference:
| Behavior | Detail |
|---|---|
| Timeout | 5 seconds per attempt |
| Success | Any 2xx status |
| Retries | Up to 8 attempts |
| Window | 4 hours total |
| Backoff | Exponential |
| Payload | Original payload from trigger time |
| The scary part | Persistent failures delete the subscription |
That last row is the one that ruins weekends. If your endpoint fails enough, Shopify does not just drop events. It removes the webhook subscription. New events stop firing entirely until you re-register. Silent and total.
The Problem With 4 Hours
Exponential backoff over 8 attempts in 4 hours front-loads everything. Most of your retries happen in the first 30 minutes. By hour 2, roughly 5 of your 8 attempts are spent.
So Shopify's retry system is genuinely good at one thing: surviving short, transient blips. A momentary network drop. A one-off timeout.
It is bad at the thing that actually hurts you: real outages.
- A deploy that runs long
- A downstream API that rate-limits you for hours
- A traffic spike during a sale that pushes responses past 5 seconds
- A botched HMAC secret rotation that 401s every webhook
In all of those, the 4-hour window runs out before you recover. Shopify did its job. Your data is still lost.
The conclusion is simple: you need your own retry layer. Shopify's is the floor, not the ceiling.
Rule Zero: Acknowledge Fast, Process Later
Before any retry logic, fix the most common mistake in Shopify webhook code: running business logic inside the webhook endpoint.
If your handler hits a database, calls an external API, or does anything heavy, you are gambling against that 5-second timeout. Cross it once and Shopify marks a perfectly good delivery as failed.
Your endpoint should do almost nothing:
// The entire job of your webhook endpoint
app.post("/webhooks/orders-create", async (req, res) => {
// 1. Verify the HMAC signature
if (!verifyHmac(req)) return res.sendStatus(401);
// 2. Push the raw payload onto a queue
await queue.add("orders-create", {
body: req.rawBody,
eventId: req.headers["x-shopify-event-id"],
triggeredAt: req.headers["x-shopify-triggered-at"],
});
// 3. Respond immediately
res.sendStatus(200);
});
That is it. A background worker pulls from the queue and does the real work. This one change eliminates the majority of timeout failures, and it turns retries into a calm internal concern instead of a race against a clock.
Building the Retry Layer
Your retry logic lives in the worker, not the endpoint. When processing fails, the worker decides: retry, or give up?
Step 1: Classify the error
Not every failure deserves a retry. Retrying a permanent error just burns resources.
| Error type | Examples | Action |
|---|---|---|
| Transient | Timeout, 503, deadlock, rate limit | Retry with backoff |
| Permanent | Invalid payload, missing field, validation error | Straight to the dead letter queue |
Retrying a malformed payload 8 times will not magically make it valid. Categorize first, act second.
Step 2: Exponential backoff with jitter
For transient errors, retry with increasing delays. Each retry waits longer than the last.
| Attempt | Delay |
|---|---|
| 1 | 30 seconds |
| 2 | 2 minutes |
| 3 | 8 minutes |
| 4 | 30 minutes |
| 5 | 2 hours |
| 6 | 6 hours |
| 7 | 24 hours |
Notice the window stretches across days, not Shopify's 4 hours. That is the whole point of building your own layer.
But pure exponential backoff has a trap. If 500 webhooks fail at the same moment, they all retry at the same moment, and your recovering service gets hammered flat again. This is the thundering herd.
Fix it with jitter: add a small random offset to each delay. Instead of retrying at exactly 8 minutes, retry somewhere between 7 and 9. It spreads the load and smooths recovery.
function nextDelay(attempt) {
const base = Math.min(30 * 2 ** attempt, 86400); // cap at 24h
const jitter = base * (Math.random() * 0.3); // up to 30% jitter
return base + jitter;
}
Step 3: Cap retries, then use a dead letter queue
Retries cannot run forever. Cap them, usually somewhere between 5 and 10 attempts. When a webhook exhausts its retries, it does not vanish. It moves to a dead letter queue.
The DLQ stores the failed webhook with full context: original payload, error message, attempt count, timestamp. Now you can:
- Inspect it and find the root cause
- Reprocess it once the bug is fixed
- Alert when the DLQ grows past a normal threshold
That alert is your early warning system. A spike in DLQ volume tells you something is wrong before any customer notices.
The Duplicate Problem Nobody Mentions
Retries create a brand new problem: the same webhook arriving twice.
Shopify delivers a webhook, your slow response times out, Shopify retries, but your worker already processed the first one. Now you have a duplicate. Without protection, that means double inventory adjustments, duplicate emails, double-charged orders.
The fix is idempotency. Every Shopify webhook carries an X-Shopify-Event-Id header, and that value stays identical across every retry of the same event.
Use it as a dedup key:
async function processWebhook(job) {
const { eventId } = job.data;
if (await alreadyProcessed(eventId)) {
return; // safe no-op
}
await doTheRealWork(job.data);
await markProcessed(eventId);
}
A retry strategy without deduplication is incomplete. Build this in from day one.
Watch Out for Stale Payloads
One more subtlety. When Shopify retries, it sends the original payload from when the event was triggered, not the current state.
If an order changed three times during the retry window, a late retry still carries the first version. Apply it blindly and you overwrite newer data with older data.
Always check the X-Shopify-Triggered-At header against your own records. If your data is already newer, skip the stale update or fetch fresh data from the Admin API instead.
The Layer Even Retries Cannot Save You From
Here is the uncomfortable truth: even a perfect retry layer cannot recover an event Shopify never sent, or one it dropped after its own retries failed.
For that, you need reconciliation. Periodically poll the Shopify Admin API and compare its data against yours. Hourly for high-value topics like orders/*. Daily for lower-stakes data.
Reconciliation is the final safety net. It catches events lost to extended outages, dropped events, and gaps from a removed subscription.
Combine the three layers and you lose almost nothing:
| Layer | Catches |
|---|---|
| Shopify retries | Short transient blips |
| Your retry layer + DLQ | Longer outages, downstream failures |
| Reconciliation | Everything else, including dropped events |
Do Not Forget Monitoring
A retry strategy fails silently without eyes on it. Track these:
- Webhook failure rate
- DLQ size and growth
- Retry volume
- Subscription status (run a daily check, alert on any missing topic)
- Processing latency creeping toward 5 seconds
Shopify's Dev Dashboard has a delivery metrics report with response codes and retry counts per topic. Use it alongside your own monitoring, not instead of it.
TL;DR
Shopify gives you 8 retries over 4 hours. Real outages last longer than that. So:
- Acknowledge fast. Return 200 in under 5 seconds, queue the payload.
- Classify errors. Transient gets retried, permanent goes to the DLQ.
- Exponential backoff with jitter. Retry over days, not hours.
-
Idempotency. Use
X-Shopify-Event-Idto kill duplicates. - Dead letter queue. Nothing gets lost, everything is replayable.
- Reconcile. Poll the Admin API to catch what slipped through.
- Monitor. Surface problems before customers do.
Build these layers once and your integration stops losing data, even on its worst day.
If you want the full guide, you can read it here.
How does your team handle webhook reliability? I would genuinely like to hear what has worked for you in the comments.
Top comments (0)