Shopify gives your webhook endpoint 5 seconds to respond.
Miss it and the delivery is marked failed. Under load, that window disappears fast. I have seen well-built Shopify apps completely fall apart during a product launch — not because the logic was wrong, but because everything was running synchronously inside the handler.
This post breaks down how to fix it properly.
The Core Problem
A single orders/create event can require:
- Inventory sync
- Fulfillment creation
- External ERP update
- Customer notification
That is four operations, each with their own latency and failure modes, all competing inside a 5-second window Shopify is actively timing.
During a flash sale, hundreds of these arrive per minute simultaneously. Synchronous handling does not survive that.
The Fix: Async Queue Infrastructure
Every production Shopify app queue follows the same three-step contract:
Incoming Webhook
|
v
[ Validate HMAC ] --> Return 200 OK immediately
|
v
[ Enqueue Job ] --> Minimum payload only (IDs, not full objects)
|
v
[ Worker Process ] --> Business logic, retries, DLQ
Rule: Your HTTP layer never touches business logic. Your worker layer never touches HTTP.
The webhook handler does one thing — validate and enqueue. Everything else belongs to the worker.
Choosing the Right Queue
| Queue | Best Fit | Delivery | Ops Overhead |
|---|---|---|---|
| BullMQ (Redis) | Node.js apps | At-least-once | Low |
| Amazon SQS FIFO | AWS-native apps | Exactly-once | Very low |
| RabbitMQ | Complex routing, multi-consumer | At-least-once | Medium |
| Sidekiq | Ruby / Rails apps | At-least-once | Low |
For most Node.js Shopify apps, BullMQ is the right default. Named queues, priority support, delayed jobs, exponential backoff, and a built-in dashboard (Bull Board) — all from a single Redis instance.
Job Design: What Goes in the Queue
Store the minimum. Reference everything else from your database.
await orderQueue.add(
'process-order',
{
shop: 'your-store.myshopify.com',
orderId: payload.id,
topic: 'orders/create',
receivedAt: Date.now(),
},
{
attempts: 5,
backoff: { type: 'exponential', delay: 2000 },
removeOnComplete: 100,
removeOnFail: 500,
}
);
Never push the full webhook payload into Redis. Store the ID, fetch the full object inside the worker. Large payloads in Redis memory cause silent bloat that degrades queue performance over time.
Handling Shopify API 429s Inside Workers
Shopify's GraphQL Admin API uses a cost-based bucket: 1,000 points, refilling at 50 points per second on standard plans.
Workers that ignore rate limits hammer the same endpoint and retry endlessly. Read the Retry-After header and use it:
worker.on('failed', async (job, err) => {
if (err.statusCode === 429 && err.headers?.['retry-after']) {
const delay = parseInt(err.headers['retry-after']) * 1000;
await job.moveToDelayed(Date.now() + delay);
}
});
This respects Shopify's own backoff window rather than guessing at a delay.
Queue Segmentation: Do Not Mix Job Priorities
One shared queue creates priority inversion. A backlog of low-priority notification jobs will block high-priority order jobs from processing.
Run at least three separate queues:
| Priority | Job Types |
|---|---|
| High | Orders, payments, fulfillments |
| Standard | Inventory updates, product sync |
| Low | Notifications, analytics events |
Each queue gets its own concurrency setting and can be scaled independently.
Dead Letter Queues: Never Discard a Failed Job
Failed jobs fall into two categories:
Transient — network timeouts, rate limits, temporary API errors. Handle with exponential backoff.
Permanent — malformed data, logic errors, resource not found. Route to the DLQ after max retries.
Never silently discard a failed job. The DLQ is your audit trail. Every job that lands there represents a Shopify event that did not process — and potentially an order, inventory change, or fulfillment that needs manual recovery.
Production Monitoring: 5 Metrics That Matter
| Metric | Healthy Threshold | Action If Breached |
|---|---|---|
| Queue depth | Under 500 pending | Scale workers horizontally |
| Job failure rate | Under 1% | Inspect DLQ, audit API errors |
| Worker concurrency | Under 80% utilisation | Pre-scale before peak events |
| Job latency (p99) | Under 10 seconds | Optimise job logic or add workers |
| DLQ depth | 0 new jobs | Investigate immediately |
Export BullMQ metrics to Datadog or Prometheus and alert on queue depth before flash sale events — not during them.
Set alerts, not dashboards. Dashboards require someone to look at them. Alerts fire when something actually breaks.
Wrapping Up
Reliable Shopify queue infrastructure is not one decision. It is five deliberate ones made at every layer of your app — queue selection, job design, retry logic, segmentation, and observability.
Get any one of these wrong and a flash sale exposes it fast.
Full guide with component breakdowns, queue comparisons, and GraphQL worker optimisation here:
👉 https://kolachitech.com/shopify-queue-infrastructure/
Drop a comment if you want to go deeper on any of these patterns. Always happy to talk Shopify infrastructure.
Top comments (0)