Most Shopify middleware works fine in staging. It breaks in production, during a sale, at 11pm on a Friday.
After auditing dozens of integrations, the failures are almost always the same. Not exotic bugs. Simple architecture decisions made when the system was small and never revisited.
The Pattern That Breaks Everything
Webhook received
→ Direct API call to ERP
→ ERP times out (3s)
→ Shopify retries
→ Duplicate event processed
→ Inventory corrupted
→ Weekend on-call incident
The fix is not a try-catch block. It is replacing the pattern entirely.
- Rule 1: Acknowledge the webhook in under 200ms. Process it asynchronously.
- Rule 2: Write state and outbox entries in one transaction. Let a worker handle external calls.
- Rule 3: Tag every event with an idempotency key before it touches anything.
Reference Architecture
[Shopify] [ERP] [WMS] [Marketplaces]
| | | |
v v v v
----- API Gateway / Webhook Receiver -----
|
[Message Broker]
|
+-----------------+-----------------+
| | |
[Transformer] [Router] [Audit Log]
| |
v v
[Domain Services: Orders, Inventory, Customers]
|
v
[Outbound Workers]
|
v
[Shopify Admin API] [3rd Party APIs]
Every layer has one job.
- Webhook receivers do not transform.
- Transformers do not call APIs.
- Workers do not own state.
Core Principles Before You Write Any Code
| Principle | What It Means | Why It Matters |
|---|---|---|
| Loose coupling | Services talk via events, not direct calls | One slow service does not block others |
| Idempotency | Same event applied twice = same result | Safe to retry without double charges |
| Backpressure | Upstream slows when consumers lag | No queue explosions or OOM crashes |
| Observability | Every event is traceable end-to-end | You can actually debug incidents |
| Graceful degradation | Non-critical features fail soft | Core checkout never goes down |
Building Blocks and Their Resilience Role
| Component | Role | Resilience Pattern |
|---|---|---|
| API Gateway | Receive and validate inbound traffic | Rate limiting, schema validation |
| Webhook Receiver | Accept Shopify events | Fast 200 ACK, async processing |
| Message Broker | Decouple producers from consumers | Durable queues, partitioning |
| Transformer | Map between data formats | Pure functions, no side effects |
| Domain Services | Hold canonical state | Postgres with row-level locking |
| Outbound Workers | Push updates to external systems | Retries, circuit breakers |
| Audit Log | Record every event | Append-only, 30-90 day retention |
Failure Modes You Must Design For
| Failure Mode | Trigger | Mitigation |
|---|---|---|
| Shopify rate limits | Burst traffic, bulk updates | Token bucket, respect X-Shopify-Shop-Api-Call-Limit
|
| Webhook retry storms | Shopify re-delivers for 48h | Idempotency keys + dedup table |
| Network timeouts | Flaky third-party API | Circuit breaker + exponential backoff |
| Bad payloads | Schema changes upstream | Strict validation + dead letter queue |
| Slow consumers | Heavy DB writes | Backpressure + queue partitioning |
| Region outage | Vendor failure | Multi-region failover + replay logs |
The 5 Resilience Patterns That Actually Work
1. Circuit Breaker
Watches error rates on outbound calls. Stops calling a failing service when errors spike above threshold. Gives the dependency time to recover.
CLOSED → (error rate > 50% over 30s) → OPEN
OPEN → (wait 60s) → HALF-OPEN
HALF-OPEN → (3 successful probes) → CLOSED
2. Exponential Backoff with Jitter
Naive retries create thundering herds. Add jitter to spread the load.
function getBackoffDelay(attempt) {
const base = Math.min(200 * Math.pow(2, attempt), 60000);
const jitter = base * 0.2 * Math.random();
return base + jitter;
}
// attempt 0 → ~200ms
// attempt 1 → ~400ms
// attempt 2 → ~800ms
// attempt 5 → ~6400ms (capped at 60s)
3. Bulkheads
Inventory updates run on a separate worker pool from order pushes. One clogged queue cannot block the other.
4. Timeouts on Everything
// Every outbound call needs a timeout. No exceptions.
const response = await fetch(url, {
signal: AbortSignal.timeout(5000) // 5s for sync calls
});
5. Dead Letter Queue
After N failed retries, move the event to a DLQ.
- Alert on growth
- Replay after fixing root cause
- Never let a bad event block the main pipeline
Idempotency: The Most Important Property
There is no exactly-once delivery.
There is only:
- at-least-once delivery
- plus idempotent consumers
async function processInventoryEvent(event) {
const alreadyProcessed = await redis.get(`idem:${event.event_id}`);
if (alreadyProcessed) return; // skip duplicate
await db.transaction(async (trx) => {
await trx('inventory')
.update({ quantity: event.absolute })
.where({
sku: event.sku,
location_id: event.location_id
});
await trx('outbox').insert({
payload: event,
status: 'pending'
});
});
await redis.setex(
`idem:${event.event_id}`,
604800,
'1'
); // 7-day TTL
}
The Outbox Pattern: One Rule That Eliminates Dual-Write Bugs
When you change internal state and need to notify an external system, never do both in the same step.
WITHOUT OUTBOX (dangerous)
1. UPDATE inventory SET quantity = 42
2. POST /shopify/inventory_levels
(fails? state wrong. success? maybe duplicate)
WITH OUTBOX (safe)
1. BEGIN TRANSACTION
UPDATE inventory SET quantity = 42
INSERT INTO outbox (payload, status) VALUES (...)
2. COMMIT
3. Separate worker polls outbox
→ calls Shopify
→ marks row done
If Shopify is slow, the row waits.
If the worker crashes, it restarts and retries.
Internal state is always consistent.
Distributed Shopify Inventory Sync as a Real Example
| Step | What Middleware Does |
|---|---|
| 1 | Receives inventory webhook from Shopify |
| 2 | Returns 200 immediately, publishes to broker |
| 3 | Transformer converts to canonical event schema |
| 4 | Inventory service applies delta with optimistic concurrency |
| 5 | Outbox records pending downstream push |
| 6 | Worker pushes to ERP, WMS, and marketplace channels |
| 7 | Reconciler validates full state every 5 minutes |
Observability Baseline
You cannot operate what you cannot see.
Four pillars, from day one:
| Pillar | Tool Options | Key Question Answered |
|---|---|---|
| Logs | Datadog, ELK, Loki | What happened to event X? |
| Metrics | Prometheus, CloudWatch | Is queue lag growing? |
| Traces | OpenTelemetry, Jaeger | Where did request Y spend time? |
| Alerts | PagerDuty, Opsgenie | What needs attention right now? |
Metrics that catch the most incidents
- Queue consumer lag (alert at > 30s)
- Dead letter queue depth (alert at > 100 unprocessed)
- Outbound API error rate (alert at > 2%)
- Webhook 5xx rate (alert at > 0.5%)
- DB connection pool saturation (alert at > 80%)
Race Condition Handling
Two webhooks for the same SKU arrive at the same millisecond.
Without protection, you lose one.
Option 1: Partition by key (preferred)
Hash SKU to a stream partition.
All events for one SKU process in order on one consumer.
Option 2: Optimistic concurrency
UPDATE inventory
SET quantity = 40,
version = version + 1
WHERE sku = 'TSHIRT-RED-M'
AND location_id = 'wh_LA'
AND version = 7;
Option 3: Distributed locks (last resort)
Serialize via Redis or ZooKeeper.
Works, but adds latency and contention.
Common Mistakes That Cost Merchants Real Money
- Calling Shopify or ERP APIs directly from webhook handlers
- Retries with no backoff (thundering herd on every incident)
- Infinite retries (burns rate limit quota silently)
- Logs without trace IDs (cannot debug incidents after the fact)
- No dead letter queue (one bad event blocks the whole pipeline)
- Ignoring Shopify schema changes in API responses
Build vs Buy
| Scenario | Recommendation |
|---|---|
| Simple ERP sync, low volume | Use a connector (Celigo, Pipe17) |
| Custom flows, mid volume | Hybrid: connector + custom services |
| Complex multichannel, high volume | Build a custom middleware platform |
| Regulated industry | Build with audit trail baked in |
The Full Architecture Guide
This post covers the core patterns.
The full guide on our blog goes deeper with:
- Complete component-by-component breakdown
- Database schema design notes
- Serverless vs container trade-offs
- REST vs GraphQL decision framework
- Caching layer design
- Phased implementation roadmap
👉 Read it here:
Designing Resilient Shopify Middleware
Final Thought
Resilience is not a feature you bolt on later.
It is an architectural property.
The systems that survive Black Friday are usually the systems that were designed to fail safely from day one.
Top comments (0)