DEV Community

Asad Abdullah Zafar
Asad Abdullah Zafar

Posted on • Originally published at kolachitech.com

Queue Infrastructure for Shopify Apps: The Complete Developer Guide

Shopify gives your webhook endpoint 5 seconds to respond.

Miss it and the delivery is marked failed. Under load, that window disappears fast. I have seen well-built Shopify apps completely fall apart during a product launch — not because the logic was wrong, but because everything was running synchronously inside the handler.

This post breaks down how to fix it properly.


The Core Problem

A single orders/create event can require:

  • Inventory sync
  • Fulfillment creation
  • External ERP update
  • Customer notification

That is four operations, each with their own latency and failure modes, all competing inside a 5-second window Shopify is actively timing.

During a flash sale, hundreds of these arrive per minute simultaneously. Synchronous handling does not survive that.


The Fix: Async Queue Infrastructure

Every production Shopify app queue follows the same three-step contract:

Incoming Webhook
|
v
[ Validate HMAC ] --> Return 200 OK immediately
|
v
[ Enqueue Job ] --> Minimum payload only (IDs, not full objects)
|
v
[ Worker Process ] --> Business logic, retries, DLQ

Rule: Your HTTP layer never touches business logic. Your worker layer never touches HTTP.

The webhook handler does one thing — validate and enqueue. Everything else belongs to the worker.


Choosing the Right Queue

Queue Best Fit Delivery Ops Overhead
BullMQ (Redis) Node.js apps At-least-once Low
Amazon SQS FIFO AWS-native apps Exactly-once Very low
RabbitMQ Complex routing, multi-consumer At-least-once Medium
Sidekiq Ruby / Rails apps At-least-once Low

For most Node.js Shopify apps, BullMQ is the right default. Named queues, priority support, delayed jobs, exponential backoff, and a built-in dashboard (Bull Board) — all from a single Redis instance.


Job Design: What Goes in the Queue

Store the minimum. Reference everything else from your database.

await orderQueue.add(
  'process-order',
  {
    shop:       'your-store.myshopify.com',
    orderId:    payload.id,
    topic:      'orders/create',
    receivedAt: Date.now(),
  },
  {
    attempts:         5,
    backoff:          { type: 'exponential', delay: 2000 },
    removeOnComplete: 100,
    removeOnFail:     500,
  }
);
Enter fullscreen mode Exit fullscreen mode

Never push the full webhook payload into Redis. Store the ID, fetch the full object inside the worker. Large payloads in Redis memory cause silent bloat that degrades queue performance over time.


Handling Shopify API 429s Inside Workers

Shopify's GraphQL Admin API uses a cost-based bucket: 1,000 points, refilling at 50 points per second on standard plans.

Workers that ignore rate limits hammer the same endpoint and retry endlessly. Read the Retry-After header and use it:

worker.on('failed', async (job, err) => {
  if (err.statusCode === 429 && err.headers?.['retry-after']) {
    const delay = parseInt(err.headers['retry-after']) * 1000;
    await job.moveToDelayed(Date.now() + delay);
  }
});
Enter fullscreen mode Exit fullscreen mode

This respects Shopify's own backoff window rather than guessing at a delay.


Queue Segmentation: Do Not Mix Job Priorities

One shared queue creates priority inversion. A backlog of low-priority notification jobs will block high-priority order jobs from processing.

Run at least three separate queues:

Priority Job Types
High Orders, payments, fulfillments
Standard Inventory updates, product sync
Low Notifications, analytics events

Each queue gets its own concurrency setting and can be scaled independently.


Dead Letter Queues: Never Discard a Failed Job

Failed jobs fall into two categories:

Transient — network timeouts, rate limits, temporary API errors. Handle with exponential backoff.
Permanent — malformed data, logic errors, resource not found. Route to the DLQ after max retries.

Never silently discard a failed job. The DLQ is your audit trail. Every job that lands there represents a Shopify event that did not process — and potentially an order, inventory change, or fulfillment that needs manual recovery.


Production Monitoring: 5 Metrics That Matter

Metric Healthy Threshold Action If Breached
Queue depth Under 500 pending Scale workers horizontally
Job failure rate Under 1% Inspect DLQ, audit API errors
Worker concurrency Under 80% utilisation Pre-scale before peak events
Job latency (p99) Under 10 seconds Optimise job logic or add workers
DLQ depth 0 new jobs Investigate immediately

Export BullMQ metrics to Datadog or Prometheus and alert on queue depth before flash sale events — not during them.

Set alerts, not dashboards. Dashboards require someone to look at them. Alerts fire when something actually breaks.


Wrapping Up

Reliable Shopify queue infrastructure is not one decision. It is five deliberate ones made at every layer of your app — queue selection, job design, retry logic, segmentation, and observability.

Get any one of these wrong and a flash sale exposes it fast.

Full guide with component breakdowns, queue comparisons, and GraphQL worker optimisation here:
👉 https://kolachitech.com/shopify-queue-infrastructure/

Drop a comment if you want to go deeper on any of these patterns. Always happy to talk Shopify infrastructure.

Top comments (0)