Asad Abdullah Zafar

Posted on May 6 • Originally published at kolachitech.com

Queue Infrastructure for Shopify Apps: The Complete Developer Guide

#shopify #node #webdev #backend

Shopify gives your webhook endpoint 5 seconds to respond.

Miss it and the delivery is marked failed. Under load, that window disappears fast. I have seen well-built Shopify apps completely fall apart during a product launch — not because the logic was wrong, but because everything was running synchronously inside the handler.

This post breaks down how to fix it properly.

The Core Problem

A single orders/create event can require:

Inventory sync
Fulfillment creation
External ERP update
Customer notification

That is four operations, each with their own latency and failure modes, all competing inside a 5-second window Shopify is actively timing.

During a flash sale, hundreds of these arrive per minute simultaneously. Synchronous handling does not survive that.

The Fix: Async Queue Infrastructure

Every production Shopify app queue follows the same three-step contract:

Incoming Webhook
|
v
[ Validate HMAC ] --> Return 200 OK immediately
|
v
[ Enqueue Job ] --> Minimum payload only (IDs, not full objects)
|
v
[ Worker Process ] --> Business logic, retries, DLQ

Rule: Your HTTP layer never touches business logic. Your worker layer never touches HTTP.

The webhook handler does one thing — validate and enqueue. Everything else belongs to the worker.

Choosing the Right Queue

Queue	Best Fit	Delivery	Ops Overhead
BullMQ (Redis)	Node.js apps	At-least-once	Low
Amazon SQS FIFO	AWS-native apps	Exactly-once	Very low
RabbitMQ	Complex routing, multi-consumer	At-least-once	Medium
Sidekiq	Ruby / Rails apps	At-least-once	Low

For most Node.js Shopify apps, BullMQ is the right default. Named queues, priority support, delayed jobs, exponential backoff, and a built-in dashboard (Bull Board) — all from a single Redis instance.

Job Design: What Goes in the Queue

Store the minimum. Reference everything else from your database.

await orderQueue.add(
  'process-order',
  {
    shop:       'your-store.myshopify.com',
    orderId:    payload.id,
    topic:      'orders/create',
    receivedAt: Date.now(),
  },
  {
    attempts:         5,
    backoff:          { type: 'exponential', delay: 2000 },
    removeOnComplete: 100,
    removeOnFail:     500,
  }
);

Never push the full webhook payload into Redis. Store the ID, fetch the full object inside the worker. Large payloads in Redis memory cause silent bloat that degrades queue performance over time.

Handling Shopify API 429s Inside Workers

Shopify's GraphQL Admin API uses a cost-based bucket: 1,000 points, refilling at 50 points per second on standard plans.

Workers that ignore rate limits hammer the same endpoint and retry endlessly. Read the Retry-After header and use it:

worker.on('failed', async (job, err) => {
  if (err.statusCode === 429 && err.headers?.['retry-after']) {
    const delay = parseInt(err.headers['retry-after']) * 1000;
    await job.moveToDelayed(Date.now() + delay);
  }
});

This respects Shopify's own backoff window rather than guessing at a delay.

Queue Segmentation: Do Not Mix Job Priorities

One shared queue creates priority inversion. A backlog of low-priority notification jobs will block high-priority order jobs from processing.

Run at least three separate queues:

Priority	Job Types
High	Orders, payments, fulfillments
Standard	Inventory updates, product sync
Low	Notifications, analytics events

Each queue gets its own concurrency setting and can be scaled independently.

Dead Letter Queues: Never Discard a Failed Job

Failed jobs fall into two categories:

Transient — network timeouts, rate limits, temporary API errors. Handle with exponential backoff.
Permanent — malformed data, logic errors, resource not found. Route to the DLQ after max retries.

Never silently discard a failed job. The DLQ is your audit trail. Every job that lands there represents a Shopify event that did not process — and potentially an order, inventory change, or fulfillment that needs manual recovery.

Production Monitoring: 5 Metrics That Matter

Metric	Healthy Threshold	Action If Breached
Queue depth	Under 500 pending	Scale workers horizontally
Job failure rate	Under 1%	Inspect DLQ, audit API errors
Worker concurrency	Under 80% utilisation	Pre-scale before peak events
Job latency (p99)	Under 10 seconds	Optimise job logic or add workers
DLQ depth	0 new jobs	Investigate immediately

Export BullMQ metrics to Datadog or Prometheus and alert on queue depth before flash sale events — not during them.

Set alerts, not dashboards. Dashboards require someone to look at them. Alerts fire when something actually breaks.

Wrapping Up

Reliable Shopify queue infrastructure is not one decision. It is five deliberate ones made at every layer of your app — queue selection, job design, retry logic, segmentation, and observability.

Get any one of these wrong and a flash sale exposes it fast.

Full guide with component breakdowns, queue comparisons, and GraphQL worker optimisation here:
👉 https://kolachitech.com/shopify-queue-infrastructure/

Drop a comment if you want to go deeper on any of these patterns. Always happy to talk Shopify infrastructure.

DEV Community

Queue Infrastructure for Shopify Apps: The Complete Developer Guide

Top comments (0)