If you have shipped anything against the Shopify GraphQL API, you have probably been burned by this: the request returns 200 OK, your logging looks clean, and yet the order never got created.
That is the part of GraphQL that catches people off guard. The transport layer and the application layer fail separately. A clean HTTP status tells you almost nothing.
Below are the eight patterns I lean on to build clients that hold up under real traffic. Think of it less as theory and more as the checklist I run through before calling an integration "done."
First, stop trusting the status code
REST trained us to read the HTTP status and move on. 404 = missing, 500 = server died, body is an afterthought.
GraphQL flips that. A response can succeed at the HTTP level while quietly failing in the payload:
{
"data": { "product": null },
"errors": [
{
"message": "Throttled",
"extensions": { "code": "THROTTLED" }
}
]
}
Two takeaways:
-
datacan hold partial results. -
errorsexplains what broke.
Your client has to read both, every time.
Classify before you react
Different failures want different responses. Retrying a validation error is just burning rate limit for nothing.
| Error type | Where it shows up | Retryable? | What to do |
|---|---|---|---|
| Network failure | Transport layer | Yes | Retry with backoff |
Throttling (THROTTLED / 429) |
Top-level errors
|
Yes | Wait, then retry |
| Server error (5xx) | HTTP status | Yes | Retry with backoff |
| User errors |
userErrors field |
No | Fix input, surface to user |
| Validation errors | Top-level errors
|
No | Fix the query |
| Auth errors | HTTP 401 / 403 | No | Refresh token / re-auth |
Rule of thumb: transient errors fix themselves with time, everything else needs a code or input change.
Pattern 1: Treat userErrors as a real failure
This is the one that bites everyone. Mutations report through two channels: top-level errors for system stuff, and userErrors inside the payload for business-logic stuff.
{
"data": {
"productUpdate": {
"product": null,
"userErrors": [
{ "field": ["title"], "message": "Title can't be blank" }
]
}
}
}
HTTP 200. Empty top-level errors. Still failed.
function handleMutation(response) {
const userErrors = response.data?.productUpdate?.userErrors || [];
if (userErrors.length > 0) {
// Business logic failure — do NOT retry.
return { success: false, errors: userErrors };
}
return { success: true, data: response.data.productUpdate.product };
}
Skip this check and you get silent data loss. Always treat a non-empty userErrors as a failure.
Pattern 2: Exponential backoff (with jitter)
Transient errors deserve a retry. Blind retries make everything worse — hammering a throttled API just deepens the throttle.
async function withRetry(fn, maxRetries = 5) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
if (!isRetryable(err) || attempt === maxRetries - 1) throw err;
const delay = Math.min(1000 * 2 ** attempt, 30000);
await sleep(delay + Math.random() * 200); // jitter
}
}
}
The Math.random() is not decoration. Without jitter, every client retries at the same instant and you get a thundering herd that re-crashes the API. The Math.min caps the delay so waits stay sane.
Pattern 3: Throttle before Shopify throttles you
Reactive retries are fine. Proactive throttling is better. Shopify hands you a query-cost budget on every response — read it:
{
"extensions": {
"cost": {
"requestedQueryCost": 101,
"throttleStatus": {
"maximumAvailable": 1000,
"currentlyAvailable": 899,
"restoreRate": 50
}
}
}
}
Track currentlyAvailable. When it dips low, slow yourself down instead of waiting for a THROTTLED slap.
Pattern 4: Circuit breakers for cascading failures
Sometimes the problem is upstream. Retrying a struggling Shopify just piles on load. A circuit breaker fails fast instead.
| State | Behavior | Transition |
|---|---|---|
| Closed | Requests flow | Opens after failure threshold |
| Open | Fail fast, no calls | Half-open after cooldown |
| Half-open | One test request | Closes on success, reopens on failure |
class CircuitBreaker {
constructor(threshold = 5, cooldown = 30000) {
this.failures = 0;
this.threshold = threshold;
this.cooldown = cooldown;
this.state = "closed";
this.openedAt = null;
}
async call(fn) {
if (this.state === "open") {
if (Date.now() - this.openedAt > this.cooldown) {
this.state = "half-open";
} else {
throw new Error("Circuit open");
}
}
try {
const result = await fn();
this.reset();
return result;
} catch (err) {
this.recordFailure();
throw err;
}
}
recordFailure() {
this.failures++;
if (this.failures >= this.threshold) {
this.state = "open";
this.openedAt = Date.now();
}
}
reset() {
this.failures = 0;
this.state = "closed";
}
}
Pattern 5: Use partial data, don't discard it
GraphQL can return some good fields alongside some failed ones. Don't throw the whole response away — log the bad parts, use the good ones.
function processResponse(response) {
const { data, errors } = response;
if (errors?.length && data) {
logErrors(errors);
return { partial: true, data };
}
if (errors?.length) return { partial: false, errors };
return { partial: false, data };
}
This matters most on bulk reads. One missing product should never nuke an entire catalog sync.
Pattern 6: Idempotency makes retries safe
Here is the trap: a mutation creates an order, the network times out before you get the response, your retry logic fires, and now you have two orders.
Idempotency keys fix this. Attach a unique key per operation, and the server returns the original result instead of running it again. Without this, every retry path you build is a duplicate-data risk.
Pattern 7: Dead letter queues for the ones that still fail
Some operations exhaust every retry. Don't drop them silently:
- Operation fails after all retries.
- Push the payload + error context to a dead letter queue.
- Alert the team.
- Fix the cause, then replay from the queue.
This is your safety net during outages. The difference between "we lost 40 minutes of orders" and "we replayed them once the API recovered."
Pattern 8: You can't fix what you can't see
Recovery without observability is guessing. Track these per operation:
| Metric | What it tells you |
|---|---|
| Error rate by type | Which failures dominate |
| Retry count | How hard the client is working |
| Throttle frequency | Whether you're over budget |
| Circuit breaker trips | When things are degrading |
| Query cost trends | Where your budget goes |
Log full context: query, variables, response, timing.
Putting it together
The patterns stack in a specific order:
- Classify the error.
- Apply idempotency keys before any retry.
- Retry transient errors with backoff + jitter.
- Throttle proactively using cost data.
- Break the circuit when failures cascade.
- Queue the unrecoverable ones.
- Monitor all of it.
Build this in from day one. Retrofitting recovery into a live system costs far more than designing it up front — usually right when you can least afford it (hi, Black Friday).
I originally wrote a longer version of this with extra implementation notes on the Kolachi Tech blog: Error Recovery Patterns in Shopify GraphQL. What does your retry + idempotency setup look like? Curious how other people handle the dead-letter replay step.
Top comments (0)