DEV Community

Cover image for 8 Error Recovery Patterns That Keep Shopify GraphQL Clients Alive
Muhammad Masad Ashraf
Muhammad Masad Ashraf

Posted on • Originally published at kolachitech.com

8 Error Recovery Patterns That Keep Shopify GraphQL Clients Alive

If you have shipped anything against the Shopify GraphQL API, you have probably been burned by this: the request returns 200 OK, your logging looks clean, and yet the order never got created.

That is the part of GraphQL that catches people off guard. The transport layer and the application layer fail separately. A clean HTTP status tells you almost nothing.

Below are the eight patterns I lean on to build clients that hold up under real traffic. Think of it less as theory and more as the checklist I run through before calling an integration "done."

First, stop trusting the status code

REST trained us to read the HTTP status and move on. 404 = missing, 500 = server died, body is an afterthought.

GraphQL flips that. A response can succeed at the HTTP level while quietly failing in the payload:

{
  "data": { "product": null },
  "errors": [
    {
      "message": "Throttled",
      "extensions": { "code": "THROTTLED" }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Two takeaways:

  • data can hold partial results.
  • errors explains what broke.

Your client has to read both, every time.

Classify before you react

Different failures want different responses. Retrying a validation error is just burning rate limit for nothing.

Error type Where it shows up Retryable? What to do
Network failure Transport layer Yes Retry with backoff
Throttling (THROTTLED / 429) Top-level errors Yes Wait, then retry
Server error (5xx) HTTP status Yes Retry with backoff
User errors userErrors field No Fix input, surface to user
Validation errors Top-level errors No Fix the query
Auth errors HTTP 401 / 403 No Refresh token / re-auth

Rule of thumb: transient errors fix themselves with time, everything else needs a code or input change.

Pattern 1: Treat userErrors as a real failure

This is the one that bites everyone. Mutations report through two channels: top-level errors for system stuff, and userErrors inside the payload for business-logic stuff.

{
  "data": {
    "productUpdate": {
      "product": null,
      "userErrors": [
        { "field": ["title"], "message": "Title can't be blank" }
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

HTTP 200. Empty top-level errors. Still failed.

function handleMutation(response) {
  const userErrors = response.data?.productUpdate?.userErrors || [];
  if (userErrors.length > 0) {
    // Business logic failure — do NOT retry.
    return { success: false, errors: userErrors };
  }
  return { success: true, data: response.data.productUpdate.product };
}
Enter fullscreen mode Exit fullscreen mode

Skip this check and you get silent data loss. Always treat a non-empty userErrors as a failure.

Pattern 2: Exponential backoff (with jitter)

Transient errors deserve a retry. Blind retries make everything worse — hammering a throttled API just deepens the throttle.

async function withRetry(fn, maxRetries = 5) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      if (!isRetryable(err) || attempt === maxRetries - 1) throw err;
      const delay = Math.min(1000 * 2 ** attempt, 30000);
      await sleep(delay + Math.random() * 200); // jitter
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The Math.random() is not decoration. Without jitter, every client retries at the same instant and you get a thundering herd that re-crashes the API. The Math.min caps the delay so waits stay sane.

Pattern 3: Throttle before Shopify throttles you

Reactive retries are fine. Proactive throttling is better. Shopify hands you a query-cost budget on every response — read it:

{
  "extensions": {
    "cost": {
      "requestedQueryCost": 101,
      "throttleStatus": {
        "maximumAvailable": 1000,
        "currentlyAvailable": 899,
        "restoreRate": 50
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Track currentlyAvailable. When it dips low, slow yourself down instead of waiting for a THROTTLED slap.

Pattern 4: Circuit breakers for cascading failures

Sometimes the problem is upstream. Retrying a struggling Shopify just piles on load. A circuit breaker fails fast instead.

State Behavior Transition
Closed Requests flow Opens after failure threshold
Open Fail fast, no calls Half-open after cooldown
Half-open One test request Closes on success, reopens on failure
class CircuitBreaker {
  constructor(threshold = 5, cooldown = 30000) {
    this.failures = 0;
    this.threshold = threshold;
    this.cooldown = cooldown;
    this.state = "closed";
    this.openedAt = null;
  }

  async call(fn) {
    if (this.state === "open") {
      if (Date.now() - this.openedAt > this.cooldown) {
        this.state = "half-open";
      } else {
        throw new Error("Circuit open");
      }
    }
    try {
      const result = await fn();
      this.reset();
      return result;
    } catch (err) {
      this.recordFailure();
      throw err;
    }
  }

  recordFailure() {
    this.failures++;
    if (this.failures >= this.threshold) {
      this.state = "open";
      this.openedAt = Date.now();
    }
  }

  reset() {
    this.failures = 0;
    this.state = "closed";
  }
}
Enter fullscreen mode Exit fullscreen mode

Pattern 5: Use partial data, don't discard it

GraphQL can return some good fields alongside some failed ones. Don't throw the whole response away — log the bad parts, use the good ones.

function processResponse(response) {
  const { data, errors } = response;
  if (errors?.length && data) {
    logErrors(errors);
    return { partial: true, data };
  }
  if (errors?.length) return { partial: false, errors };
  return { partial: false, data };
}
Enter fullscreen mode Exit fullscreen mode

This matters most on bulk reads. One missing product should never nuke an entire catalog sync.

Pattern 6: Idempotency makes retries safe

Here is the trap: a mutation creates an order, the network times out before you get the response, your retry logic fires, and now you have two orders.

Idempotency keys fix this. Attach a unique key per operation, and the server returns the original result instead of running it again. Without this, every retry path you build is a duplicate-data risk.

Pattern 7: Dead letter queues for the ones that still fail

Some operations exhaust every retry. Don't drop them silently:

  1. Operation fails after all retries.
  2. Push the payload + error context to a dead letter queue.
  3. Alert the team.
  4. Fix the cause, then replay from the queue.

This is your safety net during outages. The difference between "we lost 40 minutes of orders" and "we replayed them once the API recovered."

Pattern 8: You can't fix what you can't see

Recovery without observability is guessing. Track these per operation:

Metric What it tells you
Error rate by type Which failures dominate
Retry count How hard the client is working
Throttle frequency Whether you're over budget
Circuit breaker trips When things are degrading
Query cost trends Where your budget goes

Log full context: query, variables, response, timing.

Putting it together

The patterns stack in a specific order:

  1. Classify the error.
  2. Apply idempotency keys before any retry.
  3. Retry transient errors with backoff + jitter.
  4. Throttle proactively using cost data.
  5. Break the circuit when failures cascade.
  6. Queue the unrecoverable ones.
  7. Monitor all of it.

Build this in from day one. Retrofitting recovery into a live system costs far more than designing it up front — usually right when you can least afford it (hi, Black Friday).


I originally wrote a longer version of this with extra implementation notes on the Kolachi Tech blog: Error Recovery Patterns in Shopify GraphQL. What does your retry + idempotency setup look like? Curious how other people handle the dead-letter replay step.

Top comments (0)