Muhammad Masad Ashraf

Posted on Jun 29 • Originally published at kolachitech.com

8 Error Recovery Patterns That Keep Shopify GraphQL Clients Alive

#shopify #graphql #javascript #webdev

If you have shipped anything against the Shopify GraphQL API, you have probably been burned by this: the request returns 200 OK, your logging looks clean, and yet the order never got created.

That is the part of GraphQL that catches people off guard. The transport layer and the application layer fail separately. A clean HTTP status tells you almost nothing.

Below are the eight patterns I lean on to build clients that hold up under real traffic. Think of it less as theory and more as the checklist I run through before calling an integration "done."

First, stop trusting the status code

REST trained us to read the HTTP status and move on. 404 = missing, 500 = server died, body is an afterthought.

GraphQL flips that. A response can succeed at the HTTP level while quietly failing in the payload:

{
  "data": { "product": null },
  "errors": [
    {
      "message": "Throttled",
      "extensions": { "code": "THROTTLED" }
    }
  ]
}

Two takeaways:

data can hold partial results.
errors explains what broke.

Your client has to read both, every time.

Classify before you react

Different failures want different responses. Retrying a validation error is just burning rate limit for nothing.

Error type	Where it shows up	Retryable?	What to do
Network failure	Transport layer	Yes	Retry with backoff
Throttling (`THROTTLED` / 429)	Top-level `errors`	Yes	Wait, then retry
Server error (5xx)	HTTP status	Yes	Retry with backoff
User errors	`userErrors` field	No	Fix input, surface to user
Validation errors	Top-level `errors`	No	Fix the query
Auth errors	HTTP 401 / 403	No	Refresh token / re-auth

Rule of thumb: transient errors fix themselves with time, everything else needs a code or input change.

Pattern 1: Treat `userErrors` as a real failure

This is the one that bites everyone. Mutations report through two channels: top-level errors for system stuff, and userErrors inside the payload for business-logic stuff.

{
  "data": {
    "productUpdate": {
      "product": null,
      "userErrors": [
        { "field": ["title"], "message": "Title can't be blank" }
      ]
    }
  }
}

HTTP 200. Empty top-level errors. Still failed.

function handleMutation(response) {
  const userErrors = response.data?.productUpdate?.userErrors || [];
  if (userErrors.length > 0) {
    // Business logic failure — do NOT retry.
    return { success: false, errors: userErrors };
  }
  return { success: true, data: response.data.productUpdate.product };
}

Skip this check and you get silent data loss. Always treat a non-empty userErrors as a failure.

Pattern 2: Exponential backoff (with jitter)

Transient errors deserve a retry. Blind retries make everything worse — hammering a throttled API just deepens the throttle.

async function withRetry(fn, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (!isRetryable(err) || attempt === maxRetries - 1) throw err;
      const delay = Math.min(1000 * 2 ** attempt, 30000);
      await sleep(delay + Math.random() * 200); // jitter
    }
  }
}

The Math.random() is not decoration. Without jitter, every client retries at the same instant and you get a thundering herd that re-crashes the API. The Math.min caps the delay so waits stay sane.

Pattern 3: Throttle before Shopify throttles you

Reactive retries are fine. Proactive throttling is better. Shopify hands you a query-cost budget on every response — read it:

{
  "extensions": {
    "cost": {
      "requestedQueryCost": 101,
      "throttleStatus": {
        "maximumAvailable": 1000,
        "currentlyAvailable": 899,
        "restoreRate": 50
      }
    }
  }
}

Track currentlyAvailable. When it dips low, slow yourself down instead of waiting for a THROTTLED slap.

Pattern 4: Circuit breakers for cascading failures

Sometimes the problem is upstream. Retrying a struggling Shopify just piles on load. A circuit breaker fails fast instead.

State	Behavior	Transition
Closed	Requests flow	Opens after failure threshold
Open	Fail fast, no calls	Half-open after cooldown
Half-open	One test request	Closes on success, reopens on failure

class CircuitBreaker {
  constructor(threshold = 5, cooldown = 30000) {
    this.failures = 0;
    this.threshold = threshold;
    this.cooldown = cooldown;
    this.state = "closed";
    this.openedAt = null;
  }

  async call(fn) {
    if (this.state === "open") {
      if (Date.now() - this.openedAt > this.cooldown) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit open");
      }
    }
    try {
      const result = await fn();
      this.reset();
      return result;
    } catch (err) {
      this.recordFailure();
      throw err;
    }
  }

  recordFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = "open";
      this.openedAt = Date.now();
    }
  }

  reset() {
    this.failures = 0;
    this.state = "closed";
  }
}

Pattern 5: Use partial data, don't discard it

GraphQL can return some good fields alongside some failed ones. Don't throw the whole response away — log the bad parts, use the good ones.

function processResponse(response) {
  const { data, errors } = response;
  if (errors?.length && data) {
    logErrors(errors);
    return { partial: true, data };
  }
  if (errors?.length) return { partial: false, errors };
  return { partial: false, data };
}

This matters most on bulk reads. One missing product should never nuke an entire catalog sync.

Pattern 6: Idempotency makes retries safe

Here is the trap: a mutation creates an order, the network times out before you get the response, your retry logic fires, and now you have two orders.

Idempotency keys fix this. Attach a unique key per operation, and the server returns the original result instead of running it again. Without this, every retry path you build is a duplicate-data risk.

Pattern 7: Dead letter queues for the ones that still fail

Some operations exhaust every retry. Don't drop them silently:

Operation fails after all retries.
Push the payload + error context to a dead letter queue.
Alert the team.
Fix the cause, then replay from the queue.

This is your safety net during outages. The difference between "we lost 40 minutes of orders" and "we replayed them once the API recovered."

Pattern 8: You can't fix what you can't see

Recovery without observability is guessing. Track these per operation:

Metric	What it tells you
Error rate by type	Which failures dominate
Retry count	How hard the client is working
Throttle frequency	Whether you're over budget
Circuit breaker trips	When things are degrading
Query cost trends	Where your budget goes

Log full context: query, variables, response, timing.

Putting it together

The patterns stack in a specific order:

Classify the error.
Apply idempotency keys before any retry.
Retry transient errors with backoff + jitter.
Throttle proactively using cost data.
Break the circuit when failures cascade.
Queue the unrecoverable ones.
Monitor all of it.

Build this in from day one. Retrofitting recovery into a live system costs far more than designing it up front — usually right when you can least afford it (hi, Black Friday).

I originally wrote a longer version of this with extra implementation notes on the Kolachi Tech blog: Error Recovery Patterns in Shopify GraphQL. What does your retry + idempotency setup look like? Curious how other people handle the dead-letter replay step.

DEV Community

8 Error Recovery Patterns That Keep Shopify GraphQL Clients Alive

First, stop trusting the status code

Classify before you react

Pattern 1: Treat `userErrors` as a real failure

Pattern 2: Exponential backoff (with jitter)

Pattern 3: Throttle before Shopify throttles you

Pattern 4: Circuit breakers for cascading failures

Pattern 5: Use partial data, don't discard it

Pattern 6: Idempotency makes retries safe

Pattern 7: Dead letter queues for the ones that still fail

Pattern 8: You can't fix what you can't see

Putting it together

Top comments (0)

First, stop trusting the status code

Classify before you react

Pattern 1: Treat userErrors as a real failure

Pattern 2: Exponential backoff (with jitter)

Pattern 3: Throttle before Shopify throttles you

Pattern 4: Circuit breakers for cascading failures

Pattern 5: Use partial data, don't discard it

Pattern 6: Idempotency makes retries safe

Pattern 7: Dead letter queues for the ones that still fail

Pattern 8: You can't fix what you can't see

Putting it together

Pattern 1: Treat `userErrors` as a real failure