DEV Community

Cover image for Shopify Webhook Observability: Complete Implementation Guide
Muhammad Masad Ashraf
Muhammad Masad Ashraf

Posted on • Originally published at kolachitech.com

Shopify Webhook Observability: Complete Implementation Guide

Shopify Webhook Observability: Complete Implementation Guide

Webhooks are the nervous system of your Shopify integration. When they fail silently, data corrupts and you don't know why.

This guide covers production-ready webhook observability: structured logging, health metrics, error recovery, and debugging strategies.

The Problem

Webhooks are asynchronous. Your app receives events, processes them, and moves on. When something fails:

  • Shopify doesn't know your app crashed
  • You don't know a webhook failed
  • Data stays out of sync
  • Problems compound until customers complain

Without observability, you're flying blind.

What You'll Learn

  • Structured logging patterns for webhooks
  • Health metrics and dashboards
  • Error categorization and retry logic
  • Dead letter queue implementation
  • Debugging production incidents

Let's build visibility.


1. Structured Logging Foundation

All observability starts with logs.

Bad logs (unstructured text):

2024-06-04 14:23:45 Webhook received
2024-06-04 14:23:46 Processing
2024-06-04 14:23:47 Done
Enter fullscreen mode Exit fullscreen mode

Unstructured logs are difficult to search, filter, and aggregate.

Good logs (JSON structured format):

{
  "timestamp": "2024-06-04T14:23:45Z",
  "webhook_id": "gid://shopify/WebhookEventBridge/12345",
  "shop_id": "gid://shopify/Shop/67890",
  "event_type": "orders/create",
  "status": "success",
  "duration_ms": 245,
  "attempt": 1,
  "signature_valid": true
}
Enter fullscreen mode Exit fullscreen mode

Structured logs in JSON format enable filtering, aggregation, and analysis.

Entry Point Logging

Log immediately when webhook arrives. Here's a complete Express.js implementation:

const express = require('express');
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  transports: [
    new winston.transports.File({ filename: 'webhooks.log' })
  ]
});

app.post('/webhooks/shopify', (req, res) => {
  const webhookId = req.headers['x-shopify-webhook-id'];
  const shop = req.headers['x-shopify-shop-api-version'];
  const hmac = req.headers['x-shopify-hmac-sha256'];
  const topic = req.headers['x-shopify-topic'];

  logger.info('webhook_received', {
    webhook_id: webhookId,
    shop_id: shop,
    topic,
    timestamp: new Date().toISOString(),
    body_size: req.rawBody.length
  });

  res.status(202).send('Accepted');

  // Process async
  processWebhookAsync(req.body, webhookId, hmac);
});
Enter fullscreen mode Exit fullscreen mode

Why this approach: Return 202 immediately. Process asynchronously. Shopify's timeout is 5 seconds; don't waste it.

Signature Verification Logging

Verify Shopify's HMAC signature before processing any webhook:

const crypto = require('crypto');

function verifySignature(rawBody, signature, secret) {
  const hash = crypto
    .createHmac('sha256', secret)
    .update(rawBody, 'utf8')
    .digest('base64');

  return hash === signature;
}

async function processWebhookAsync(body, webhookId, signature) {
  const apiSecret = process.env.SHOPIFY_API_SECRET;

  if (!verifySignature(JSON.stringify(body), signature, apiSecret)) {
    logger.error('webhook_signature_invalid', {
      webhook_id: webhookId,
      provided_signature: signature.substring(0, 16) + '...'
    });
    return;
  }

  logger.debug('webhook_signature_verified', {
    webhook_id: webhookId
  });

  // Continue processing
}
Enter fullscreen mode Exit fullscreen mode

Never process unverified webhooks. Signature verification prevents unauthorized access.

Processing Step Logging

Log each stage of webhook processing with timing information. This helps identify where delays occur:

async function processWebhookAsync(body, webhookId, signature) {
  const startTime = Date.now();

  try {
    // Verify signature
    verifySignature(JSON.stringify(body), signature, apiSecret);

    // Parse webhook
    const webhook = parseWebhook(body);
    logger.debug('webhook_parsed', {
      webhook_id: webhookId,
      resource_id: webhook.id,
      resource_type: webhook.type
    });

    // Check duplicate
    const isDuplicate = await checkDuplicate(webhookId);
    if (isDuplicate) {
      logger.info('webhook_duplicate_detected', {
        webhook_id: webhookId,
        resource_id: webhook.id
      });
      return;
    }

    // Store to database
    await storeWebhook(webhook);
    logger.info('webhook_stored', {
      webhook_id: webhookId,
      resource_id: webhook.id
    });

    // Sync with systems
    await syncDownstream(webhook);
    logger.info('webhook_processed', {
      webhook_id: webhookId,
      duration_ms: Date.now() - startTime,
      resource_id: webhook.id
    });

  } catch (error) {
    logger.error('webhook_processing_failed', {
      webhook_id: webhookId,
      error: error.message,
      stack: error.stack,
      duration_ms: Date.now() - startTime
    });

    // Retry or dead letter queue
    await handleError(error, body, webhookId);
  }
}
Enter fullscreen mode Exit fullscreen mode

External Service Logging

Track interactions with third-party APIs and downstream services. Log both successful calls and failures with timing information:

async function syncDownstream(webhook) {
  const endpoints = [
    { name: 'inventory-sync', url: 'https://inventory.internal/sync' },
    { name: 'analytics-pipeline', url: 'https://analytics.internal/events' }
  ];

  for (const endpoint of endpoints) {
    const startTime = Date.now();
    try {
      const response = await fetch(endpoint.url, {
        method: 'POST',
        body: JSON.stringify(webhook),
        timeout: 5000
      });

      const duration = Date.now() - startTime;

      if (!response.ok) {
        logger.warn('external_service_error', {
          webhook_id: webhook.id,
          service: endpoint.name,
          status: response.status,
          duration_ms: duration
        });
      } else {
        logger.debug('external_service_success', {
          webhook_id: webhook.id,
          service: endpoint.name,
          duration_ms: duration
        });
      }
    } catch (error) {
      logger.error('external_service_timeout', {
        webhook_id: webhook.id,
        service: endpoint.name,
        error: error.message,
        duration_ms: Date.now() - startTime
      });

      // Store for retry
      await storeFailedSync(webhook.id, endpoint.name);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

2. Health Metrics and Dashboards

Track three key metrics:

Delivery Rate

Delivery Rate = (Webhooks Successfully Processed / Total Webhooks) x 100
Enter fullscreen mode Exit fullscreen mode

Target: 99.9% or higher.

Processing Latency

async function calculateLatencyMetrics() {
  const logs = await queryLogs({
    status: 'success',
    timeRange: '24h'
  });

  const durations = logs
    .map(log => log.duration_ms)
    .sort((a, b) => a - b);

  return {
    median: percentile(durations, 50),
    p95: percentile(durations, 95),
    p99: percentile(durations, 99),
    max: Math.max(...durations)
  };
}

function percentile(arr, p) {
  const index = Math.ceil((p / 100) * arr.length) - 1;
  return arr[Math.max(0, index)];
}
Enter fullscreen mode Exit fullscreen mode

Target: Median under 100ms, p95 under 500ms, p99 under 2s.

Error Rate by Type

async function getErrorMetrics() {
  const errorLogs = await queryLogs({
    status: 'error',
    timeRange: '24h'
  });

  const errorsByType = {};

  for (const log of errorLogs) {
    const type = log.error_type || 'unknown';
    errorsByType[type] = (errorsByType[type] || 0) + 1;
  }

  const total = errorLogs.length;

  return Object.entries(errorsByType).reduce((acc, [type, count]) => {
    acc[type] = {
      count,
      percentage: ((count / total) * 100).toFixed(2) + '%'
    };
    return acc;
  }, {});
}
Enter fullscreen mode Exit fullscreen mode

Track:

  • Timeout errors (slow external services)
  • Database errors (connection pool exhausted)
  • Validation errors (bad payload)
  • Rate limit errors (too many requests)
  • Service unavailable (external APIs down)

Building a Dashboard

Use your logging platform's dashboard feature (Datadog, Splunk, CloudWatch, etc.). Here are example queries to calculate key metrics:

Query 1: Calculate delivery rate percentage over the last 24 hours

SELECT count(*) as total,
       sum(case when status = 'success' then 1 else 0 end) as successful,
       (sum(case when status = 'success' then 1 else 0 end) * 100.0 / count(*)) as delivery_rate
FROM webhooks
WHERE timestamp > now() - interval '24 hours'
Enter fullscreen mode Exit fullscreen mode

Query 2: Calculate latency percentiles (50th, 95th, 99th) for successful webhooks

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY duration_ms) as p50,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY duration_ms) as p95,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY duration_ms) as p99
FROM webhooks
WHERE timestamp > now() - interval '24 hours'
  AND status = 'success'
Enter fullscreen mode Exit fullscreen mode

Query 3: Calculate error rate by error type

       count(*) as count,
       (count(*) * 100.0 / (SELECT count(*) FROM webhooks WHERE status = 'error')) as percentage
FROM webhooks
WHERE timestamp > now() - interval '24 hours'
  AND status = 'error'
GROUP BY error_type
ORDER BY count DESC
Enter fullscreen mode Exit fullscreen mode

3. Error Handling and Retry Logic

Categorize errors. Different errors need different responses.

Error Classification

Different error types require different handling strategies. This function classifies errors so you can respond appropriately:

const ErrorType = {
  TRANSIENT_TIMEOUT: 'transient_timeout',
  TRANSIENT_NETWORK: 'transient_network',
  PERMANENT_INVALID_DATA: 'permanent_invalid_data',
  PERMANENT_VALIDATION: 'permanent_validation',
  SYSTEMIC_SERVICE_DOWN: 'systemic_service_down',
  RATE_LIMITED: 'rate_limited',
  DATABASE_ERROR: 'database_error'
};

function classifyError(error, statusCode) {
  if (error.code === 'ECONNREFUSED' || error.code === 'ENOTFOUND') {
    return ErrorType.TRANSIENT_NETWORK;
  }

  if (error.code === 'ETIMEDOUT') {
    return ErrorType.TRANSIENT_TIMEOUT;
  }

  if (statusCode === 429) {
    return ErrorType.RATE_LIMITED;
  }

  if (statusCode === 503 || statusCode === 502) {
    return ErrorType.SYSTEMIC_SERVICE_DOWN;
  }

  if (statusCode === 400 || statusCode === 422) {
    return ErrorType.PERMANENT_INVALID_DATA;
  }

  if (error.type === 'ValidationError') {
    return ErrorType.PERMANENT_VALIDATION;
  }

  if (error.message.includes('database')) {
    return ErrorType.DATABASE_ERROR;
  }

  return 'unknown';
}
Enter fullscreen mode Exit fullscreen mode

Exponential Backoff Retry

Implement retry logic with exponential backoff and jitter to handle transient failures gracefully:

async function processWithRetry(webhook, maxAttempts = 3) {
  const delays = [1000, 2000, 4000]; // 1s, 2s, 4s

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      await processWebhook(webhook);

      logger.info('webhook_success', {
        webhook_id: webhook.id,
        attempt
      });
      return;

    } catch (error) {
      const errorType = classifyError(error, error.statusCode);

      // Don't retry permanent errors
      if (errorType.startsWith('permanent')) {
        logger.error('webhook_permanent_failure', {
          webhook_id: webhook.id,
          error_type: errorType,
          error: error.message,
          attempt
        });
        throw error;
      }

      // Last attempt?
      if (attempt === maxAttempts) {
        logger.error('webhook_max_retries_exceeded', {
          webhook_id: webhook.id,
          error_type: errorType,
          attempts: maxAttempts
        });
        throw error;
      }

      // Retry with backoff + jitter
      const baseDelay = delays[attempt - 1] || delays[delays.length - 1];
      const jitter = Math.random() * 1000;
      const totalDelay = baseDelay + jitter;

      logger.warn('webhook_retry', {
        webhook_id: webhook.id,
        error_type: errorType,
        attempt,
        next_retry_ms: Math.round(totalDelay)
      });

      await new Promise(resolve => setTimeout(resolve, totalDelay));
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Dead Letter Queue Pattern

Implement a dead letter queue to capture webhooks that fail after all retry attempts. This prevents data loss and enables async reprocessing:

// PostgreSQL schema
const schema = `
CREATE TABLE IF NOT EXISTS dead_letter_queue (
  id SERIAL PRIMARY KEY,
  webhook_id VARCHAR(255) UNIQUE NOT NULL,
  payload JSONB NOT NULL,
  error_type VARCHAR(255),
  error_message TEXT,
  attempts INT DEFAULT 1,
  first_failure_at TIMESTAMP DEFAULT NOW(),
  last_failure_at TIMESTAMP DEFAULT NOW(),
  status VARCHAR(50) DEFAULT 'pending',
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_dlq_status ON dead_letter_queue(status);
CREATE INDEX idx_dlq_first_failure ON dead_letter_queue(first_failure_at);
`;

// Store failed webhook
async function storeInDeadLetterQueue(webhook, error) {
  const errorType = classifyError(error, error.statusCode);

  try {
    await db.query(
      `INSERT INTO dead_letter_queue (webhook_id, payload, error_type, error_message)
       VALUES ($1, $2, $3, $4)
       ON CONFLICT (webhook_id) DO UPDATE SET
         attempts = attempts + 1,
         last_failure_at = NOW(),
         error_message = $4`,
      [webhook.id, JSON.stringify(webhook), errorType, error.message]
    );

    logger.warn('webhook_moved_to_dlq', {
      webhook_id: webhook.id,
      error_type: errorType
    });

  } catch (dbError) {
    logger.error('dlq_storage_failed', {
      webhook_id: webhook.id,
      error: dbError.message
    });
  }
}

// Reprocess dead letter queue
async function reprocessDeadLetterQueue() {
  const pendingWebhooks = await db.query(
    `SELECT * FROM dead_letter_queue
     WHERE status = 'pending'
     AND last_failure_at < NOW() - INTERVAL '1 hour'
     ORDER BY attempts ASC
     LIMIT 100`
  );

  for (const row of pendingWebhooks.rows) {
    try {
      await processWithRetry(row.payload);

      await db.query(
        `UPDATE dead_letter_queue SET status = 'processed' WHERE id = $1`,
        [row.id]
      );

      logger.info('dlq_webhook_reprocessed', {
        webhook_id: row.webhook_id,
        attempts: row.attempts
      });

    } catch (error) {
      logger.warn('dlq_reprocess_failed', {
        webhook_id: row.webhook_id,
        error: error.message
      });

      // Update last failure time, will retry again later
      await db.query(
        `UPDATE dead_letter_queue 
         SET last_failure_at = NOW()
         WHERE id = $1`,
        [row.id]
      );
    }
  }
}

// Run periodically (every 5 minutes)
setInterval(reprocessDeadLetterQueue, 5 * 60 * 1000);
Enter fullscreen mode Exit fullscreen mode

4. Detecting Duplicates

Shopify retries webhooks when it doesn't receive a 200 response within 5 seconds. You must detect and handle duplicate webhook deliveries:

// Redis store for duplicate detection
const redis = require('redis');
const client = redis.createClient();

async function checkAndMarkProcessed(webhookId) {
  const key = `webhook:${webhookId}`;

  // Set with 48-hour expiration (Shopify retry window)
  const wasProcessed = await client.set(
    key,
    Date.now(),
    'EX',
    48 * 60 * 60,
    'NX' // Only set if key doesn't exist
  );

  if (!wasProcessed) {
    return true; // Already processed
  }

  return false; // First time seeing this webhook
}

// In your webhook handler
async function processWebhookAsync(body, webhookId, signature) {
  const isDuplicate = await checkAndMarkProcessed(webhookId);

  if (isDuplicate) {
    logger.info('webhook_duplicate_ignored', {
      webhook_id: webhookId,
      topic: body.topic
    });
    return;
  }

  // Process new webhook
  await processWithRetry(body);
}
Enter fullscreen mode Exit fullscreen mode

Alternative (database-based duplicate detection using PostgreSQL unique constraint):

async function checkAndMarkProcessed(webhookId) {
  try {
    await db.query(
      `INSERT INTO processed_webhooks (webhook_id, processed_at)
       VALUES ($1, NOW())`,
      [webhookId]
    );
    return false; // First time
  } catch (error) {
    if (error.code === '23505') { // Unique constraint violation
      return true; // Already processed
    }
    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

5. Production Alerts

Set up automated alerts for critical metrics. These thresholds should trigger notifications to your team:

const alertThresholds = {
  deliveryRate: 0.99, // Alert if below 99%
  errorRate: 0.01, // Alert if above 1%
  p95Latency: 2000, // Alert if above 2s (ms)
  dlqSize: 100, // Alert if more than 100 pending
  noWebhooksIn: 3600 // Alert if no webhooks in 1 hour (seconds)
};

async function checkAlerts() {
  const metrics = await getMetrics();

  if (metrics.deliveryRate < alertThresholds.deliveryRate) {
    await sendAlert({
      severity: 'critical',
      title: 'Low webhook delivery rate',
      message: `Delivery rate: ${(metrics.deliveryRate * 100).toFixed(2)}%`,
      details: metrics
    });
  }

  if (metrics.p95Latency > alertThresholds.p95Latency) {
    await sendAlert({
      severity: 'warning',
      title: 'High webhook latency',
      message: `P95 latency: ${metrics.p95Latency}ms`,
      details: metrics
    });
  }

  if (metrics.dlqSize > alertThresholds.dlqSize) {
    await sendAlert({
      severity: 'warning',
      title: 'Dead letter queue growing',
      message: `${metrics.dlqSize} pending webhooks in DLQ`,
      details: metrics
    });
  }

  if (metrics.lastWebhookAge > alertThresholds.noWebhooksIn) {
    await sendAlert({
      severity: 'critical',
      title: 'No webhooks received',
      message: `Last webhook ${metrics.lastWebhookAge}s ago`,
      details: metrics
    });
  }
}

// Run every 5 minutes
setInterval(checkAlerts, 5 * 60 * 1000);
Enter fullscreen mode Exit fullscreen mode

6. Testing Your Setup

Implement health check endpoints and test webhook functionality to verify your observability infrastructure is working:

// Health check endpoint
app.get('/health/webhooks', async (req, res) => {
  const metrics = await getMetrics();

  const isHealthy = 
    metrics.deliveryRate >= 0.99 &&
    metrics.errorRate <= 0.01 &&
    metrics.p95Latency <= 2000 &&
    metrics.dlqSize <= 100;

  res.status(isHealthy ? 200 : 503).json({
    status: isHealthy ? 'healthy' : 'degraded',
    metrics,
    timestamp: new Date().toISOString()
  });
});

// Test webhook endpoint
app.post('/webhooks/test', (req, res) => {
  const testWebhook = {
    id: 'test-' + Date.now(),
    topic: 'test/event',
    shop_id: 'test-shop',
    created_at: new Date().toISOString(),
    data: { test: true }
  };

  logger.info('test_webhook_received', testWebhook);

  res.status(200).json({ success: true });
});
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Log structured data - JSON with context, not raw payloads
  2. Track three metrics - Delivery rate, latency, error rate
  3. Categorize errors - Different errors need different handling
  4. Implement retries - Exponential backoff with jitter
  5. Use dead letter queues - Prevent data loss
  6. Detect duplicates - Shopify retries; you must handle it
  7. Alert on thresholds - Know when things break
  8. Test thoroughly - Load test at production scale

Resources


Ready to implement? Start with structured logging, then add metrics, then implement retries and DLQ. Each piece builds on the previous.

What webhook patterns have you encountered? Share in the comments!

Top comments (0)