Muhammad Masad Ashraf

Posted on May 22 • Originally published at kolachitech.com

How to Build a Shopify Webhook Replay System That Never Loses Events

#shopify #webhooks #javascript #backend

Your Shopify integration fires perfectly in staging. Then production hits. Your server goes down for two minutes during a deploy. Shopify sends 40 webhook events. Your handler catches none of them.

That is real data loss. Orders not fulfilled. Inventory not updated. Customers not synced.

A webhook replay system is the fix. This guide walks you through building one that stores every incoming event, retries failures intelligently, and lets you replay any batch of events on demand.

Why Webhooks Fail in the First Place

Shopify retries failed webhooks up to 19 times over 48 hours. After that, you are on your own. Here is what causes failures:

Failure Cause	Description
Server downtime	Endpoint offline when Shopify fires
Timeout	Handler takes longer than 5 seconds
Code errors	Bug in your handler throws an exception
Rate limiting	System overloaded, rejects the request
DB unavailability	Database is down when handler tries to write
Deployment gaps	Deploy restarts server at the wrong moment

After 48 hours and 19 retries, Shopify flags your endpoint as failing. Eventually it stops sending events entirely.

A replay system gets you out of that hole.

The Core Architecture

A production webhook replay system has four stages:

Shopify
   |
   v
[Ingestion Endpoint]   <-- ACK immediately, store raw payload
   |
   v
[Event Store (DB)]     <-- durable record of every event
   |
   v
[Processing Queue]     <-- background worker picks up events
   |
   v
[Handler]              <-- idempotent processing logic
   |
   +-- Success --> mark 'succeeded'
   +-- Failure --> retry with backoff --> Dead Letter Queue

The key rule: never process inside the ingestion endpoint. Shopify waits only 5 seconds. Acknowledge fast, process separately.

Stage 1: The Ingestion Endpoint

Your ingestion endpoint does exactly two things:

Verify the HMAC signature
Write the raw payload to the event store and return 200 OK

app.post('/webhooks', express.raw({ type: 'application/json' }), async (req, res) => {
  const hmac = req.headers['x-shopify-hmac-sha256'];
  const webhookId = req.headers['x-shopify-webhook-id'];
  const topic = req.headers['x-shopify-topic'];

  // Verify HMAC first
  if (!verifyHmac(req.body, hmac)) {
    return res.status(401).send('Unauthorized');
  }

  // Store immediately, respond fast
  await db.query(`
    INSERT INTO webhook_events (shopify_id, topic, payload, hmac, status)
    VALUES ($1, $2, $3, $4, 'pending')
    ON CONFLICT (shopify_id) DO NOTHING
  `, [webhookId, topic, req.body.toString(), hmac]);

  res.status(200).send('OK');
});

The ON CONFLICT DO NOTHING handles Shopify's own retries. Same event, same ID, no duplicate row.

Stage 2: The Event Store Schema

CREATE TABLE webhook_events (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  shopify_id      VARCHAR(255) UNIQUE,
  topic           VARCHAR(100) NOT NULL,
  payload         JSONB NOT NULL,
  hmac            VARCHAR(255),
  status          VARCHAR(20) DEFAULT 'pending',
  retry_count     INT DEFAULT 0,
  last_error      TEXT,
  created_at      TIMESTAMPTZ DEFAULT NOW(),
  processed_at    TIMESTAMPTZ
);

CREATE INDEX idx_webhook_status ON webhook_events(status);
CREATE INDEX idx_webhook_topic  ON webhook_events(topic);
CREATE INDEX idx_webhook_created ON webhook_events(created_at);

Index on status, topic, and created_at. These are the three axes you will query for replay.

Storage options by scale:

Option	Best For
PostgreSQL	Most Shopify apps. Full SQL, JSONB indexing.
Redis Streams	High-throughput ingestion, consumer groups.
AWS SQS + DynamoDB	Managed infra, no ops overhead.
Kafka	Enterprise-scale event sourcing.

Stage 3: The Processing Worker

async function processWebhookEvents() {
  const events = await db.query(`
    SELECT * FROM webhook_events
    WHERE status = 'pending'
    ORDER BY created_at ASC
    LIMIT 50
    FOR UPDATE SKIP LOCKED
  `);

  for (const event of events.rows) {
    try {
      await handleEvent(event.topic, JSON.parse(event.payload));

      await db.query(`
        UPDATE webhook_events
        SET status = 'succeeded', processed_at = NOW()
        WHERE id = $1
      `, [event.id]);

    } catch (err) {
      const newRetryCount = event.retry_count + 1;
      const newStatus = newRetryCount >= 5 ? 'dead' : 'pending';

      await db.query(`
        UPDATE webhook_events
        SET status = $1, retry_count = $2, last_error = $3
        WHERE id = $4
      `, [newStatus, newRetryCount, err.message, event.id]);
    }
  }
}

FOR UPDATE SKIP LOCKED prevents multiple workers from grabbing the same event. Safe for concurrent processing.

Stage 4: Exponential Backoff

Do not retry immediately. Back off with jitter:

function getRetryDelay(retryCount) {
  const baseDelays = [30, 120, 600, 3600, 21600]; // seconds
  const base = (baseDelays[retryCount] || 21600) * 1000;
  const jitter = Math.random() * 1000;
  return base + jitter;
}

Retry	Delay
1	30 seconds
2	2 minutes
3	10 minutes
4	1 hour
5+	6 hours

Jitter prevents retry storms when many events fail simultaneously.

Stage 5: The Replay Trigger

This is the part that earns the "replay" in "replay system." When you fix a bug and need to reprocess failed events:

async function replayEvents({ topic, status = 'failed', from, to, batchSize = 100 }) {
  const events = await db.query(`
    SELECT * FROM webhook_events
    WHERE status = $1
      AND ($2::text IS NULL OR topic = $2)
      AND ($3::timestamptz IS NULL OR created_at >= $3)
      AND ($4::timestamptz IS NULL OR created_at <= $4)
    ORDER BY created_at ASC
    LIMIT $5
  `, [status, topic, from, to, batchSize]);

  for (const event of events.rows) {
    // Reset to pending so the worker picks it up
    await db.query(`
      UPDATE webhook_events
      SET status = 'pending', retry_count = 0, last_error = NULL
      WHERE id = $1
    `, [event.id]);
  }

  return events.rowCount;
}

// Usage: replay all failed orders/paid from the last 24 hours
await replayEvents({
  topic: 'orders/paid',
  status: 'failed',
  from: new Date(Date.now() - 86400000)
});

Idempotency: The Non-Negotiable Requirement

When you replay, your handler runs twice for the same payload. Without idempotency, you get duplicate orders, double charges, corrupted inventory.

Use the shopify_id (the X-Shopify-Webhook-Id) as your deduplication key:

async function handleOrderCreate(payload) {
  const shopifyOrderId = payload.id;

  // Check if already processed
  const existing = await db.query(
    'SELECT id FROM orders WHERE shopify_order_id = $1',
    [shopifyOrderId]
  );

  if (existing.rows.length > 0) {
    console.log(`Order ${shopifyOrderId} already processed. Skipping.`);
    return;
  }

  // Safe to process
  await db.query(
    'INSERT INTO orders (shopify_order_id, ...) VALUES ($1, ...)',
    [shopifyOrderId, ...]
  );
}

No idempotency = no safe replay. Build this first.

Dead Letter Queue

Events that hit max retries go to DLQ status. Monitor these aggressively:

// Query DLQ events
const dlqEvents = await db.query(`
  SELECT topic, COUNT(*) as count, MIN(created_at) as oldest
  FROM webhook_events
  WHERE status = 'dead'
  GROUP BY topic
  ORDER BY count DESC
`);

DLQ growth is not a metric to check weekly. Alert on it in real time. Every row is an event that never reached its destination.

When to Replay What

Scenario	Action
Server down for 2 hours	Automatic retry handles it
Bug deployed and fixed	Manual replay by time range
New integration added	Manual replay to backfill history
Downstream API was down	Automatic retry with backoff
Data migration	Manual replay by topic and date

Mistakes That Kill Replay Systems

Not storing the raw payload. Store the exact bytes Shopify sent. If you transform first, you cannot replay the original.

Processing in the ingestion endpoint. 5-second window. Any business logic risks timeout and a missed event.

No idempotency. Replay becomes dangerous instead of safe.

Unlimited retries. Cap at 5-7. Move the rest to DLQ.

Ignoring DLQ depth. Silent DLQ growth means silent data loss.

Tools to Accelerate This

Tool	Role
BullMQ (Node.js)	Queue + retry + backoff out of the box
Sidekiq (Ruby)	Background jobs with retry logic
AWS SQS + Lambda	Fully managed, serverless
Temporal	Durable workflows with built-in replay
pg-boss	PostgreSQL-backed job queue

Wrapping Up

The pattern is simple:

Ingest fast, store everything
Process asynchronously
Retry with backoff
Route terminal failures to DLQ
Replay on demand after fixes Every layer compounds reliability. Start with the event store. Add the worker. Add retry logic. Then add the replay trigger.

Your Shopify store generates events constantly. A replay system makes sure none of them disappear.

Originally published at kolachitech.com

DEV Community