Building a Dead Letter Queue for Shopify Webhooks
Your Shopify webhook just failed. Network timeout. Database deadlock. Third-party API down.
The webhook payload? Gone forever.
Unless you have a dead letter queue.
This guide walks through building a production-ready DLQ system for Shopify webhooks. I'll show you the architecture, code examples, and mistakes to avoid based on real production systems.
Why Shopify's Built-in Retry Isn't Enough
Shopify retries failed webhooks up to 19 times over 48 hours. After that, they stop delivering to your endpoint.
Attempt 1: Immediate
Attempt 2: 1 second later
Attempt 3: 2 seconds later
...
Attempt 19: 48 hours later
Then: 💀 Webhook deleted forever
For production systems handling payments, orders, and inventory, this isn't acceptable. You need a safety net.
Architecture Overview
Here's the high-level flow:
Shopify → Your Webhook Endpoint → Primary Queue → Worker → Success ✓
↓ ↓
200 OK (always) Failure ✗
↓
Dead Letter Queue
↓
Retry Logic
Critical principle: Always return 200 OK to Shopify immediately. Process webhooks asynchronously.
The Stack
I'll use this stack for examples (adapt to your needs):
- Queue: Redis with Bull (Node.js) or AWS SQS
- Database: PostgreSQL for DLQ storage
- Runtime: Node.js with Express
- Monitoring: Custom dashboard + alerts
Implementation: Step by Step
1. Webhook Reception (Never Block Here)
// webhook-receiver.js
const express = require('express');
const crypto = require('crypto');
const { primaryQueue } = require('./queues');
const app = express();
app.post('/webhooks/shopify/:topic', express.raw({ type: 'application/json' }), async (req, res) => {
const topic = req.params.topic;
const hmac = req.get('X-Shopify-Hmac-Sha256');
const body = req.body;
// Verify webhook signature
if (!verifyWebhook(body, hmac)) {
return res.status(401).send('Invalid signature');
}
// Respond to Shopify IMMEDIATELY
res.status(200).send('OK');
// Queue for async processing (never blocks)
try {
await primaryQueue.add(topic, {
topic,
shopDomain: req.get('X-Shopify-Shop-Domain'),
webhookId: req.get('X-Shopify-Webhook-Id'),
payload: JSON.parse(body.toString()),
receivedAt: new Date().toISOString()
}, {
attempts: 6,
backoff: {
type: 'exponential',
delay: 10000 // Start with 10 seconds
}
});
} catch (error) {
// If queuing fails, write directly to DLQ
await saveToDLQ({
topic,
payload: JSON.parse(body.toString()),
error: 'Failed to queue webhook',
failureReason: error.message
});
}
});
function verifyWebhook(body, hmac) {
const hash = crypto
.createHmac('sha256', process.env.SHOPIFY_WEBHOOK_SECRET)
.update(body)
.digest('base64');
return hash === hmac;
}
2. Worker with DLQ Fallback
// webhook-worker.js
const { primaryQueue, dlqQueue } = require('./queues');
const { processWebhook } = require('./processors');
const { saveToDLQ, incrementRetryCount } = require('./dlq-storage');
primaryQueue.process(async (job) => {
const { topic, payload, webhookId, shopDomain } = job.data;
try {
// Your business logic here
await processWebhook(topic, payload, shopDomain);
// If we got here, processing succeeded
return { success: true, webhookId };
} catch (error) {
// Categorize the error
if (isTransientError(error)) {
// Let Bull retry with exponential backoff
throw error;
} else {
// Permanent error - move to DLQ immediately
await saveToDLQ({
webhookId,
topic,
shopDomain,
payload,
error: error.message,
stackTrace: error.stack,
retryCount: job.attemptsMade,
firstFailedAt: job.timestamp,
lastAttemptedAt: new Date().toISOString()
});
// Don't throw - we handled it
return { success: false, movedToDLQ: true };
}
}
});
// When all retries exhausted, Bull calls this
primaryQueue.on('failed', async (job, error) => {
if (job.attemptsMade >= job.opts.attempts) {
await saveToDLQ({
webhookId: job.data.webhookId,
topic: job.data.topic,
shopDomain: job.data.shopDomain,
payload: job.data.payload,
error: 'Max retries exceeded',
stackTrace: error.stack,
retryCount: job.attemptsMade,
firstFailedAt: job.timestamp,
lastAttemptedAt: new Date().toISOString()
});
}
});
function isTransientError(error) {
// Network issues, rate limits, timeouts
return error.code === 'ECONNREFUSED' ||
error.code === 'ETIMEDOUT' ||
error.statusCode === 429 ||
error.statusCode === 503 ||
error.message.includes('deadlock');
}
3. DLQ Storage Schema
-- dlq_webhooks table
CREATE TABLE dlq_webhooks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
webhook_id VARCHAR(255) UNIQUE,
topic VARCHAR(100) NOT NULL,
shop_domain VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
error_message TEXT,
stack_trace TEXT,
retry_count INTEGER DEFAULT 0,
first_failed_at TIMESTAMP NOT NULL,
last_attempted_at TIMESTAMP NOT NULL,
status VARCHAR(50) DEFAULT 'pending', -- pending, processing, resolved, abandoned
resolved_at TIMESTAMP,
notes TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Indexes for querying
CREATE INDEX idx_dlq_status ON dlq_webhooks(status);
CREATE INDEX idx_dlq_topic ON dlq_webhooks(topic);
CREATE INDEX idx_dlq_shop ON dlq_webhooks(shop_domain);
CREATE INDEX idx_dlq_first_failed ON dlq_webhooks(first_failed_at);
4. DLQ Storage Functions
// dlq-storage.js
const { Pool } = require('pg');
const pool = new Pool();
async function saveToDLQ(webhookData) {
const {
webhookId,
topic,
shopDomain,
payload,
error,
stackTrace,
retryCount = 0,
firstFailedAt,
lastAttemptedAt
} = webhookData;
const query = `
INSERT INTO dlq_webhooks (
webhook_id, topic, shop_domain, payload,
error_message, stack_trace, retry_count,
first_failed_at, last_attempted_at
) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
ON CONFLICT (webhook_id)
DO UPDATE SET
retry_count = dlq_webhooks.retry_count + 1,
last_attempted_at = $9,
error_message = $5,
stack_trace = $6
RETURNING id
`;
try {
const result = await pool.query(query, [
webhookId,
topic,
shopDomain,
JSON.stringify(payload),
error,
stackTrace,
retryCount,
firstFailedAt || new Date(),
lastAttemptedAt || new Date()
]);
// Alert if this is a new DLQ entry
if (result.rowCount > 0) {
await alertDLQEntry(webhookId, topic, shopDomain);
}
return result.rows[0];
} catch (err) {
console.error('Failed to save to DLQ:', err);
// This is serious - DLQ itself failed
// Log to external service (Sentry, etc.)
}
}
async function getDLQWebhooks(status = 'pending', limit = 100) {
const query = `
SELECT * FROM dlq_webhooks
WHERE status = $1
ORDER BY first_failed_at ASC
LIMIT $2
`;
const result = await pool.query(query, [status, limit]);
return result.rows;
}
async function markAsResolved(webhookId, notes = null) {
const query = `
UPDATE dlq_webhooks
SET status = 'resolved',
resolved_at = NOW(),
notes = $2
WHERE webhook_id = $1
`;
await pool.query(query, [webhookId, notes]);
}
module.exports = {
saveToDLQ,
getDLQWebhooks,
markAsResolved
};
5. Automated DLQ Retry Worker
// dlq-retry-worker.js
const { getDLQWebhooks, markAsResolved, saveToDLQ } = require('./dlq-storage');
const { processWebhook } = require('./processors');
async function processDLQWebhooks() {
const webhooks = await getDLQWebhooks('pending', 50);
for (const webhook of webhooks) {
try {
// Update status to prevent concurrent processing
await updateDLQStatus(webhook.webhook_id, 'processing');
// Try to process again
await processWebhook(
webhook.topic,
webhook.payload,
webhook.shop_domain
);
// Success! Mark as resolved
await markAsResolved(webhook.webhook_id, 'Auto-resolved by retry worker');
} catch (error) {
// Still failing
const newRetryCount = webhook.retry_count + 1;
if (newRetryCount >= 10) {
// Give up after 10 DLQ retries
await updateDLQStatus(webhook.webhook_id, 'abandoned');
await alertPermanentFailure(webhook);
} else {
// Save back to DLQ with updated count
await saveToDLQ({
...webhook,
retryCount: newRetryCount,
error: error.message,
stackTrace: error.stack,
lastAttemptedAt: new Date().toISOString()
});
}
}
}
}
// Run every 5 minutes
setInterval(processDLQWebhooks, 5 * 60 * 1000);
Monitoring Dashboard
// dlq-dashboard.js - Express route
app.get('/api/dlq/stats', async (req, res) => {
const stats = await pool.query(`
SELECT
status,
COUNT(*) as count,
MIN(first_failed_at) as oldest,
MAX(first_failed_at) as newest
FROM dlq_webhooks
GROUP BY status
`);
const byTopic = await pool.query(`
SELECT
topic,
COUNT(*) as count
FROM dlq_webhooks
WHERE status = 'pending'
GROUP BY topic
ORDER BY count DESC
`);
res.json({
stats: stats.rows,
byTopic: byTopic.rows
});
});
Alerting Setup
// alerting.js
const { sendSlackAlert } = require('./slack');
async function alertDLQEntry(webhookId, topic, shopDomain) {
// Check DLQ depth
const depth = await getDLQDepth();
if (depth > 100) {
await sendSlackAlert({
channel: '#alerts-critical',
text: `🚨 DLQ depth exceeds 100: ${depth} webhooks`,
priority: 'critical'
});
}
}
async function alertPermanentFailure(webhook) {
await sendSlackAlert({
channel: '#alerts-high',
text: `⚠️ Webhook permanently failed after 10 retries\nTopic: ${webhook.topic}\nShop: ${webhook.shop_domain}\nWebhook ID: ${webhook.webhook_id}`,
priority: 'high'
});
}
Testing Your DLQ
// test-dlq.js
describe('Dead Letter Queue', () => {
it('should save webhook to DLQ on processing failure', async () => {
const testWebhook = {
webhookId: 'test-123',
topic: 'orders/create',
shopDomain: 'test-shop.myshopify.com',
payload: { id: 123, total: 50.00 },
error: 'Database connection failed',
stackTrace: 'Error: ...',
retryCount: 3
};
await saveToDLQ(testWebhook);
const dlqWebhooks = await getDLQWebhooks('pending');
expect(dlqWebhooks).toHaveLength(1);
expect(dlqWebhooks[0].webhook_id).toBe('test-123');
});
it('should retry webhook from DLQ', async () => {
// Test retry logic
});
it('should mark webhook as abandoned after max retries', async () => {
// Test abandonment logic
});
});
Production Considerations
Idempotency
Store processed webhook IDs to prevent duplicate processing:
async function processWebhook(topic, payload, shopDomain) {
const webhookId = payload.id || payload.admin_graphql_api_id;
// Check if already processed
const alreadyProcessed = await redis.get(`processed:${webhookId}`);
if (alreadyProcessed) {
console.log(`Webhook ${webhookId} already processed, skipping`);
return;
}
// Process webhook
await yourBusinessLogic(payload);
// Mark as processed (expire after 7 days)
await redis.setex(`processed:${webhookId}`, 7 * 24 * 60 * 60, '1');
}
Exponential Backoff Configuration
const RETRY_CONFIG = {
attempts: 6,
backoff: {
type: 'exponential',
delay: 10000, // 10 seconds
options: {
maxDelay: 1800000, // 30 minutes max
}
}
};
// Results in retry schedule:
// 1. 10 seconds
// 2. 1 minute
// 3. 5 minutes
// 4. 15 minutes
// 5. 30 minutes
// 6. 30 minutes
Memory Management
For high-volume stores:
// Process DLQ in batches
async function processDLQBatch(batchSize = 50) {
const webhooks = await getDLQWebhooks('pending', batchSize);
// Process in parallel with concurrency limit
await Promise.all(
webhooks.map(webhook =>
limiter.schedule(() => retryWebhook(webhook))
)
);
}
// Use p-limit for concurrency control
const pLimit = require('p-limit');
const limiter = pLimit(10); // Max 10 concurrent
Common Mistakes to Avoid
❌ Processing webhooks synchronously in the HTTP handler
// DON'T DO THIS
app.post('/webhook', async (req, res) => {
await processWebhook(req.body); // Blocks!
res.status(200).send('OK'); // Too late, Shopify timed out
});
❌ No maximum retry limit
// DON'T DO THIS
while (!success) {
await retry(); // Infinite loop
}
❌ Not categorizing errors
// DON'T DO THIS
catch (error) {
throw error; // Always retry, even for permanent failures
}
Monitoring Checklist
Monitor these metrics:
- ✅ DLQ depth (alert > 100)
- ✅ Age of oldest message (alert > 4 hours)
- ✅ Retry success rate
- ✅ Error patterns (group by error type)
- ✅ Processing latency (p50, p95, p99)
- ✅ Webhook topics in DLQ (which events fail most)
Resources
Full Implementation
The complete working implementation with all the code above is available on our blog:
👉 Dead Letter Queue Shopify Webhooks: Complete Guide
What's your approach to webhook reliability? Drop your setup in the comments below.
Building Shopify integrations? We help e-commerce brands build fault-tolerant systems. Check out Kolachi Tech for more technical deep-dives.
Top comments (0)