DEV Community

Cover image for Building a Dead Letter Queue for Shopify Webhooks (Production-Ready Guide)
Muhammad Masad Ashraf
Muhammad Masad Ashraf

Posted on • Originally published at kolachitech.com

Building a Dead Letter Queue for Shopify Webhooks (Production-Ready Guide)

Building a Dead Letter Queue for Shopify Webhooks

Your Shopify webhook just failed. Network timeout. Database deadlock. Third-party API down.

The webhook payload? Gone forever.

Unless you have a dead letter queue.

This guide walks through building a production-ready DLQ system for Shopify webhooks. I'll show you the architecture, code examples, and mistakes to avoid based on real production systems.

Why Shopify's Built-in Retry Isn't Enough

Shopify retries failed webhooks up to 19 times over 48 hours. After that, they stop delivering to your endpoint.

Attempt 1: Immediate
Attempt 2: 1 second later
Attempt 3: 2 seconds later
...
Attempt 19: 48 hours later
Then: 💀 Webhook deleted forever
Enter fullscreen mode Exit fullscreen mode

For production systems handling payments, orders, and inventory, this isn't acceptable. You need a safety net.

Architecture Overview

Here's the high-level flow:

Shopify → Your Webhook Endpoint → Primary Queue → Worker → Success ✓
                    ↓                                    ↓
              200 OK (always)                        Failure ✗
                                                          ↓
                                                   Dead Letter Queue
                                                          ↓
                                                   Retry Logic
Enter fullscreen mode Exit fullscreen mode

Critical principle: Always return 200 OK to Shopify immediately. Process webhooks asynchronously.

The Stack

I'll use this stack for examples (adapt to your needs):

  • Queue: Redis with Bull (Node.js) or AWS SQS
  • Database: PostgreSQL for DLQ storage
  • Runtime: Node.js with Express
  • Monitoring: Custom dashboard + alerts

Implementation: Step by Step

1. Webhook Reception (Never Block Here)

// webhook-receiver.js
const express = require('express');
const crypto = require('crypto');
const { primaryQueue } = require('./queues');

const app = express();

app.post('/webhooks/shopify/:topic', express.raw({ type: 'application/json' }), async (req, res) => {
  const topic = req.params.topic;
  const hmac = req.get('X-Shopify-Hmac-Sha256');
  const body = req.body;

  // Verify webhook signature
  if (!verifyWebhook(body, hmac)) {
    return res.status(401).send('Invalid signature');
  }

  // Respond to Shopify IMMEDIATELY
  res.status(200).send('OK');

  // Queue for async processing (never blocks)
  try {
    await primaryQueue.add(topic, {
      topic,
      shopDomain: req.get('X-Shopify-Shop-Domain'),
      webhookId: req.get('X-Shopify-Webhook-Id'),
      payload: JSON.parse(body.toString()),
      receivedAt: new Date().toISOString()
    }, {
      attempts: 6,
      backoff: {
        type: 'exponential',
        delay: 10000 // Start with 10 seconds
      }
    });
  } catch (error) {
    // If queuing fails, write directly to DLQ
    await saveToDLQ({
      topic,
      payload: JSON.parse(body.toString()),
      error: 'Failed to queue webhook',
      failureReason: error.message
    });
  }
});

function verifyWebhook(body, hmac) {
  const hash = crypto
    .createHmac('sha256', process.env.SHOPIFY_WEBHOOK_SECRET)
    .update(body)
    .digest('base64');
  return hash === hmac;
}
Enter fullscreen mode Exit fullscreen mode

2. Worker with DLQ Fallback

// webhook-worker.js
const { primaryQueue, dlqQueue } = require('./queues');
const { processWebhook } = require('./processors');
const { saveToDLQ, incrementRetryCount } = require('./dlq-storage');

primaryQueue.process(async (job) => {
  const { topic, payload, webhookId, shopDomain } = job.data;

  try {
    // Your business logic here
    await processWebhook(topic, payload, shopDomain);

    // If we got here, processing succeeded
    return { success: true, webhookId };

  } catch (error) {
    // Categorize the error
    if (isTransientError(error)) {
      // Let Bull retry with exponential backoff
      throw error;
    } else {
      // Permanent error - move to DLQ immediately
      await saveToDLQ({
        webhookId,
        topic,
        shopDomain,
        payload,
        error: error.message,
        stackTrace: error.stack,
        retryCount: job.attemptsMade,
        firstFailedAt: job.timestamp,
        lastAttemptedAt: new Date().toISOString()
      });

      // Don't throw - we handled it
      return { success: false, movedToDLQ: true };
    }
  }
});

// When all retries exhausted, Bull calls this
primaryQueue.on('failed', async (job, error) => {
  if (job.attemptsMade >= job.opts.attempts) {
    await saveToDLQ({
      webhookId: job.data.webhookId,
      topic: job.data.topic,
      shopDomain: job.data.shopDomain,
      payload: job.data.payload,
      error: 'Max retries exceeded',
      stackTrace: error.stack,
      retryCount: job.attemptsMade,
      firstFailedAt: job.timestamp,
      lastAttemptedAt: new Date().toISOString()
    });
  }
});

function isTransientError(error) {
  // Network issues, rate limits, timeouts
  return error.code === 'ECONNREFUSED' ||
         error.code === 'ETIMEDOUT' ||
         error.statusCode === 429 ||
         error.statusCode === 503 ||
         error.message.includes('deadlock');
}
Enter fullscreen mode Exit fullscreen mode

3. DLQ Storage Schema

-- dlq_webhooks table
CREATE TABLE dlq_webhooks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  webhook_id VARCHAR(255) UNIQUE,
  topic VARCHAR(100) NOT NULL,
  shop_domain VARCHAR(255) NOT NULL,
  payload JSONB NOT NULL,
  error_message TEXT,
  stack_trace TEXT,
  retry_count INTEGER DEFAULT 0,
  first_failed_at TIMESTAMP NOT NULL,
  last_attempted_at TIMESTAMP NOT NULL,
  status VARCHAR(50) DEFAULT 'pending', -- pending, processing, resolved, abandoned
  resolved_at TIMESTAMP,
  notes TEXT,
  created_at TIMESTAMP DEFAULT NOW()
);

-- Indexes for querying
CREATE INDEX idx_dlq_status ON dlq_webhooks(status);
CREATE INDEX idx_dlq_topic ON dlq_webhooks(topic);
CREATE INDEX idx_dlq_shop ON dlq_webhooks(shop_domain);
CREATE INDEX idx_dlq_first_failed ON dlq_webhooks(first_failed_at);
Enter fullscreen mode Exit fullscreen mode

4. DLQ Storage Functions

// dlq-storage.js
const { Pool } = require('pg');
const pool = new Pool();

async function saveToDLQ(webhookData) {
  const {
    webhookId,
    topic,
    shopDomain,
    payload,
    error,
    stackTrace,
    retryCount = 0,
    firstFailedAt,
    lastAttemptedAt
  } = webhookData;

  const query = `
    INSERT INTO dlq_webhooks (
      webhook_id, topic, shop_domain, payload,
      error_message, stack_trace, retry_count,
      first_failed_at, last_attempted_at
    ) VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
    ON CONFLICT (webhook_id) 
    DO UPDATE SET
      retry_count = dlq_webhooks.retry_count + 1,
      last_attempted_at = $9,
      error_message = $5,
      stack_trace = $6
    RETURNING id
  `;

  try {
    const result = await pool.query(query, [
      webhookId,
      topic,
      shopDomain,
      JSON.stringify(payload),
      error,
      stackTrace,
      retryCount,
      firstFailedAt || new Date(),
      lastAttemptedAt || new Date()
    ]);

    // Alert if this is a new DLQ entry
    if (result.rowCount > 0) {
      await alertDLQEntry(webhookId, topic, shopDomain);
    }

    return result.rows[0];
  } catch (err) {
    console.error('Failed to save to DLQ:', err);
    // This is serious - DLQ itself failed
    // Log to external service (Sentry, etc.)
  }
}

async function getDLQWebhooks(status = 'pending', limit = 100) {
  const query = `
    SELECT * FROM dlq_webhooks
    WHERE status = $1
    ORDER BY first_failed_at ASC
    LIMIT $2
  `;

  const result = await pool.query(query, [status, limit]);
  return result.rows;
}

async function markAsResolved(webhookId, notes = null) {
  const query = `
    UPDATE dlq_webhooks
    SET status = 'resolved',
        resolved_at = NOW(),
        notes = $2
    WHERE webhook_id = $1
  `;

  await pool.query(query, [webhookId, notes]);
}

module.exports = {
  saveToDLQ,
  getDLQWebhooks,
  markAsResolved
};
Enter fullscreen mode Exit fullscreen mode

5. Automated DLQ Retry Worker

// dlq-retry-worker.js
const { getDLQWebhooks, markAsResolved, saveToDLQ } = require('./dlq-storage');
const { processWebhook } = require('./processors');

async function processDLQWebhooks() {
  const webhooks = await getDLQWebhooks('pending', 50);

  for (const webhook of webhooks) {
    try {
      // Update status to prevent concurrent processing
      await updateDLQStatus(webhook.webhook_id, 'processing');

      // Try to process again
      await processWebhook(
        webhook.topic,
        webhook.payload,
        webhook.shop_domain
      );

      // Success! Mark as resolved
      await markAsResolved(webhook.webhook_id, 'Auto-resolved by retry worker');

    } catch (error) {
      // Still failing
      const newRetryCount = webhook.retry_count + 1;

      if (newRetryCount >= 10) {
        // Give up after 10 DLQ retries
        await updateDLQStatus(webhook.webhook_id, 'abandoned');
        await alertPermanentFailure(webhook);
      } else {
        // Save back to DLQ with updated count
        await saveToDLQ({
          ...webhook,
          retryCount: newRetryCount,
          error: error.message,
          stackTrace: error.stack,
          lastAttemptedAt: new Date().toISOString()
        });
      }
    }
  }
}

// Run every 5 minutes
setInterval(processDLQWebhooks, 5 * 60 * 1000);
Enter fullscreen mode Exit fullscreen mode

Monitoring Dashboard

// dlq-dashboard.js - Express route
app.get('/api/dlq/stats', async (req, res) => {
  const stats = await pool.query(`
    SELECT
      status,
      COUNT(*) as count,
      MIN(first_failed_at) as oldest,
      MAX(first_failed_at) as newest
    FROM dlq_webhooks
    GROUP BY status
  `);

  const byTopic = await pool.query(`
    SELECT
      topic,
      COUNT(*) as count
    FROM dlq_webhooks
    WHERE status = 'pending'
    GROUP BY topic
    ORDER BY count DESC
  `);

  res.json({
    stats: stats.rows,
    byTopic: byTopic.rows
  });
});
Enter fullscreen mode Exit fullscreen mode

Alerting Setup

// alerting.js
const { sendSlackAlert } = require('./slack');

async function alertDLQEntry(webhookId, topic, shopDomain) {
  // Check DLQ depth
  const depth = await getDLQDepth();

  if (depth > 100) {
    await sendSlackAlert({
      channel: '#alerts-critical',
      text: `🚨 DLQ depth exceeds 100: ${depth} webhooks`,
      priority: 'critical'
    });
  }
}

async function alertPermanentFailure(webhook) {
  await sendSlackAlert({
    channel: '#alerts-high',
    text: `⚠️ Webhook permanently failed after 10 retries\nTopic: ${webhook.topic}\nShop: ${webhook.shop_domain}\nWebhook ID: ${webhook.webhook_id}`,
    priority: 'high'
  });
}
Enter fullscreen mode Exit fullscreen mode

Testing Your DLQ

// test-dlq.js
describe('Dead Letter Queue', () => {
  it('should save webhook to DLQ on processing failure', async () => {
    const testWebhook = {
      webhookId: 'test-123',
      topic: 'orders/create',
      shopDomain: 'test-shop.myshopify.com',
      payload: { id: 123, total: 50.00 },
      error: 'Database connection failed',
      stackTrace: 'Error: ...',
      retryCount: 3
    };

    await saveToDLQ(testWebhook);

    const dlqWebhooks = await getDLQWebhooks('pending');
    expect(dlqWebhooks).toHaveLength(1);
    expect(dlqWebhooks[0].webhook_id).toBe('test-123');
  });

  it('should retry webhook from DLQ', async () => {
    // Test retry logic
  });

  it('should mark webhook as abandoned after max retries', async () => {
    // Test abandonment logic
  });
});
Enter fullscreen mode Exit fullscreen mode

Production Considerations

Idempotency

Store processed webhook IDs to prevent duplicate processing:

async function processWebhook(topic, payload, shopDomain) {
  const webhookId = payload.id || payload.admin_graphql_api_id;

  // Check if already processed
  const alreadyProcessed = await redis.get(`processed:${webhookId}`);
  if (alreadyProcessed) {
    console.log(`Webhook ${webhookId} already processed, skipping`);
    return;
  }

  // Process webhook
  await yourBusinessLogic(payload);

  // Mark as processed (expire after 7 days)
  await redis.setex(`processed:${webhookId}`, 7 * 24 * 60 * 60, '1');
}
Enter fullscreen mode Exit fullscreen mode

Exponential Backoff Configuration

const RETRY_CONFIG = {
  attempts: 6,
  backoff: {
    type: 'exponential',
    delay: 10000, // 10 seconds
    options: {
      maxDelay: 1800000, // 30 minutes max
    }
  }
};

// Results in retry schedule:
// 1. 10 seconds
// 2. 1 minute
// 3. 5 minutes
// 4. 15 minutes
// 5. 30 minutes
// 6. 30 minutes
Enter fullscreen mode Exit fullscreen mode

Memory Management

For high-volume stores:

// Process DLQ in batches
async function processDLQBatch(batchSize = 50) {
  const webhooks = await getDLQWebhooks('pending', batchSize);

  // Process in parallel with concurrency limit
  await Promise.all(
    webhooks.map(webhook => 
      limiter.schedule(() => retryWebhook(webhook))
    )
  );
}

// Use p-limit for concurrency control
const pLimit = require('p-limit');
const limiter = pLimit(10); // Max 10 concurrent
Enter fullscreen mode Exit fullscreen mode

Common Mistakes to Avoid

Processing webhooks synchronously in the HTTP handler

// DON'T DO THIS
app.post('/webhook', async (req, res) => {
  await processWebhook(req.body); // Blocks!
  res.status(200).send('OK'); // Too late, Shopify timed out
});
Enter fullscreen mode Exit fullscreen mode

No maximum retry limit

// DON'T DO THIS
while (!success) {
  await retry(); // Infinite loop
}
Enter fullscreen mode Exit fullscreen mode

Not categorizing errors

// DON'T DO THIS
catch (error) {
  throw error; // Always retry, even for permanent failures
}
Enter fullscreen mode Exit fullscreen mode

Monitoring Checklist

Monitor these metrics:

  • ✅ DLQ depth (alert > 100)
  • ✅ Age of oldest message (alert > 4 hours)
  • ✅ Retry success rate
  • ✅ Error patterns (group by error type)
  • ✅ Processing latency (p50, p95, p99)
  • ✅ Webhook topics in DLQ (which events fail most)

Resources

Full Implementation

The complete working implementation with all the code above is available on our blog:
👉 Dead Letter Queue Shopify Webhooks: Complete Guide


What's your approach to webhook reliability? Drop your setup in the comments below.

Building Shopify integrations? We help e-commerce brands build fault-tolerant systems. Check out Kolachi Tech for more technical deep-dives.

Top comments (0)