Your Shopify integration fires perfectly in staging. Then production hits. Your server goes down for two minutes during a deploy. Shopify sends 40 webhook events. Your handler catches none of them.
That is real data loss. Orders not fulfilled. Inventory not updated. Customers not synced.
A webhook replay system is the fix. This guide walks you through building one that stores every incoming event, retries failures intelligently, and lets you replay any batch of events on demand.
Why Webhooks Fail in the First Place
Shopify retries failed webhooks up to 19 times over 48 hours. After that, you are on your own. Here is what causes failures:
| Failure Cause | Description |
|---|---|
| Server downtime | Endpoint offline when Shopify fires |
| Timeout | Handler takes longer than 5 seconds |
| Code errors | Bug in your handler throws an exception |
| Rate limiting | System overloaded, rejects the request |
| DB unavailability | Database is down when handler tries to write |
| Deployment gaps | Deploy restarts server at the wrong moment |
After 48 hours and 19 retries, Shopify flags your endpoint as failing. Eventually it stops sending events entirely.
A replay system gets you out of that hole.
The Core Architecture
A production webhook replay system has four stages:
Shopify
|
v
[Ingestion Endpoint] <-- ACK immediately, store raw payload
|
v
[Event Store (DB)] <-- durable record of every event
|
v
[Processing Queue] <-- background worker picks up events
|
v
[Handler] <-- idempotent processing logic
|
+-- Success --> mark 'succeeded'
+-- Failure --> retry with backoff --> Dead Letter Queue
The key rule: never process inside the ingestion endpoint. Shopify waits only 5 seconds. Acknowledge fast, process separately.
Stage 1: The Ingestion Endpoint
Your ingestion endpoint does exactly two things:
- Verify the HMAC signature
- Write the raw payload to the event store and return
200 OK
app.post('/webhooks', express.raw({ type: 'application/json' }), async (req, res) => {
const hmac = req.headers['x-shopify-hmac-sha256'];
const webhookId = req.headers['x-shopify-webhook-id'];
const topic = req.headers['x-shopify-topic'];
// Verify HMAC first
if (!verifyHmac(req.body, hmac)) {
return res.status(401).send('Unauthorized');
}
// Store immediately, respond fast
await db.query(`
INSERT INTO webhook_events (shopify_id, topic, payload, hmac, status)
VALUES ($1, $2, $3, $4, 'pending')
ON CONFLICT (shopify_id) DO NOTHING
`, [webhookId, topic, req.body.toString(), hmac]);
res.status(200).send('OK');
});
The ON CONFLICT DO NOTHING handles Shopify's own retries. Same event, same ID, no duplicate row.
Stage 2: The Event Store Schema
CREATE TABLE webhook_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
shopify_id VARCHAR(255) UNIQUE,
topic VARCHAR(100) NOT NULL,
payload JSONB NOT NULL,
hmac VARCHAR(255),
status VARCHAR(20) DEFAULT 'pending',
retry_count INT DEFAULT 0,
last_error TEXT,
created_at TIMESTAMPTZ DEFAULT NOW(),
processed_at TIMESTAMPTZ
);
CREATE INDEX idx_webhook_status ON webhook_events(status);
CREATE INDEX idx_webhook_topic ON webhook_events(topic);
CREATE INDEX idx_webhook_created ON webhook_events(created_at);
Index on status, topic, and created_at. These are the three axes you will query for replay.
Storage options by scale:
| Option | Best For |
|---|---|
| PostgreSQL | Most Shopify apps. Full SQL, JSONB indexing. |
| Redis Streams | High-throughput ingestion, consumer groups. |
| AWS SQS + DynamoDB | Managed infra, no ops overhead. |
| Kafka | Enterprise-scale event sourcing. |
Stage 3: The Processing Worker
async function processWebhookEvents() {
const events = await db.query(`
SELECT * FROM webhook_events
WHERE status = 'pending'
ORDER BY created_at ASC
LIMIT 50
FOR UPDATE SKIP LOCKED
`);
for (const event of events.rows) {
try {
await handleEvent(event.topic, JSON.parse(event.payload));
await db.query(`
UPDATE webhook_events
SET status = 'succeeded', processed_at = NOW()
WHERE id = $1
`, [event.id]);
} catch (err) {
const newRetryCount = event.retry_count + 1;
const newStatus = newRetryCount >= 5 ? 'dead' : 'pending';
await db.query(`
UPDATE webhook_events
SET status = $1, retry_count = $2, last_error = $3
WHERE id = $4
`, [newStatus, newRetryCount, err.message, event.id]);
}
}
}
FOR UPDATE SKIP LOCKED prevents multiple workers from grabbing the same event. Safe for concurrent processing.
Stage 4: Exponential Backoff
Do not retry immediately. Back off with jitter:
function getRetryDelay(retryCount) {
const baseDelays = [30, 120, 600, 3600, 21600]; // seconds
const base = (baseDelays[retryCount] || 21600) * 1000;
const jitter = Math.random() * 1000;
return base + jitter;
}
| Retry | Delay |
|---|---|
| 1 | 30 seconds |
| 2 | 2 minutes |
| 3 | 10 minutes |
| 4 | 1 hour |
| 5+ | 6 hours |
Jitter prevents retry storms when many events fail simultaneously.
Stage 5: The Replay Trigger
This is the part that earns the "replay" in "replay system." When you fix a bug and need to reprocess failed events:
async function replayEvents({ topic, status = 'failed', from, to, batchSize = 100 }) {
const events = await db.query(`
SELECT * FROM webhook_events
WHERE status = $1
AND ($2::text IS NULL OR topic = $2)
AND ($3::timestamptz IS NULL OR created_at >= $3)
AND ($4::timestamptz IS NULL OR created_at <= $4)
ORDER BY created_at ASC
LIMIT $5
`, [status, topic, from, to, batchSize]);
for (const event of events.rows) {
// Reset to pending so the worker picks it up
await db.query(`
UPDATE webhook_events
SET status = 'pending', retry_count = 0, last_error = NULL
WHERE id = $1
`, [event.id]);
}
return events.rowCount;
}
// Usage: replay all failed orders/paid from the last 24 hours
await replayEvents({
topic: 'orders/paid',
status: 'failed',
from: new Date(Date.now() - 86400000)
});
Idempotency: The Non-Negotiable Requirement
When you replay, your handler runs twice for the same payload. Without idempotency, you get duplicate orders, double charges, corrupted inventory.
Use the shopify_id (the X-Shopify-Webhook-Id) as your deduplication key:
async function handleOrderCreate(payload) {
const shopifyOrderId = payload.id;
// Check if already processed
const existing = await db.query(
'SELECT id FROM orders WHERE shopify_order_id = $1',
[shopifyOrderId]
);
if (existing.rows.length > 0) {
console.log(`Order ${shopifyOrderId} already processed. Skipping.`);
return;
}
// Safe to process
await db.query(
'INSERT INTO orders (shopify_order_id, ...) VALUES ($1, ...)',
[shopifyOrderId, ...]
);
}
No idempotency = no safe replay. Build this first.
Dead Letter Queue
Events that hit max retries go to DLQ status. Monitor these aggressively:
// Query DLQ events
const dlqEvents = await db.query(`
SELECT topic, COUNT(*) as count, MIN(created_at) as oldest
FROM webhook_events
WHERE status = 'dead'
GROUP BY topic
ORDER BY count DESC
`);
DLQ growth is not a metric to check weekly. Alert on it in real time. Every row is an event that never reached its destination.
When to Replay What
| Scenario | Action |
|---|---|
| Server down for 2 hours | Automatic retry handles it |
| Bug deployed and fixed | Manual replay by time range |
| New integration added | Manual replay to backfill history |
| Downstream API was down | Automatic retry with backoff |
| Data migration | Manual replay by topic and date |
Mistakes That Kill Replay Systems
Not storing the raw payload. Store the exact bytes Shopify sent. If you transform first, you cannot replay the original.
Processing in the ingestion endpoint. 5-second window. Any business logic risks timeout and a missed event.
No idempotency. Replay becomes dangerous instead of safe.
Unlimited retries. Cap at 5-7. Move the rest to DLQ.
Ignoring DLQ depth. Silent DLQ growth means silent data loss.
Tools to Accelerate This
| Tool | Role |
|---|---|
| BullMQ (Node.js) | Queue + retry + backoff out of the box |
| Sidekiq (Ruby) | Background jobs with retry logic |
| AWS SQS + Lambda | Fully managed, serverless |
| Temporal | Durable workflows with built-in replay |
| pg-boss | PostgreSQL-backed job queue |
Wrapping Up
The pattern is simple:
- Ingest fast, store everything
- Process asynchronously
- Retry with backoff
- Route terminal failures to DLQ
- Replay on demand after fixes Every layer compounds reliability. Start with the event store. Add the worker. Add retry logic. Then add the replay trigger.
Your Shopify store generates events constantly. A replay system makes sure none of them disappear.
Originally published at kolachitech.com
Top comments (0)