DEV Community

Cover image for How I Designed Retry Logic for APIs That Fail in 6 Different Ways
Olamide Olaniyan
Olamide Olaniyan

Posted on

How I Designed Retry Logic for APIs That Fail in 6 Different Ways

I used to think retry logic meant this:

try {
  return await fetchData();
} catch {
  return await fetchData();
}
Enter fullscreen mode Exit fullscreen mode

That version works right up until production reminds you that APIs fail in more than one way.

Sometimes you get rate limited.

Sometimes the server returns a 500.

Sometimes the TCP connection dies halfway through the response.

Sometimes the request technically succeeds, but the payload is missing the field you needed.

Sometimes the upstream is healthy and your process is the one timing out too aggressively.

And the worst one: sometimes retrying is exactly the wrong thing to do.

If you're building anything on third-party APIs, retry logic is not a tiny helper function. It is reliability policy.

So this is the retry model I use now: the six failures I treat differently, the JavaScript and Python implementations, and where a data layer like SociaVault makes this much easier because I can focus on application retries instead of scraping chaos.

The Six Failures That Matter

Here is the model I keep in my head.

1. Rate limit failures

HTTP 429.

These want backoff. Fast retries just make you look rude.

2. Temporary server failures

HTTP 500, 502, 503, 504.

Usually retryable. But not forever.

3. Network failures

Socket hangups, DNS hiccups, connection resets, TLS failures.

Also usually retryable if the request is idempotent.

4. Timeout failures

Could be upstream slowness, oversized responses, or just a timeout that is too strict.

These need careful thresholds, not blind retries.

5. Validation failures

The request succeeded, but the response shape is wrong or incomplete.

Sometimes retryable. Often a data contract problem.

6. Client mistakes

HTTP 400, 401, 403, 404 caused by bad params, expired auth, or bad assumptions.

Do not retry these by default. Fix the request.

That classification alone made my systems much more stable.

Before that, I was treating every error as a generic exception and wondering why nothing improved.

The Rule That Changed Everything

This is the rule I build around now:

Retry only when a second attempt has a real chance of succeeding without changing the request meaningfully.

That sounds obvious, but it filters out a lot of bad behavior.

If you send a malformed request five times in a row, that is not resilience. That is noise.

JavaScript Version: A Retry Wrapper With Error Policy

This version is the one I reach for most often in Node services.

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

function isRetryable(error) {
  const status = error?.response?.status;
  const code = error?.code;

  if ([429, 500, 502, 503, 504].includes(status)) {
    return true;
  }

  if ([
    'ECONNRESET',
    'ETIMEDOUT',
    'ECONNREFUSED',
    'EAI_AGAIN',
    'UND_ERR_CONNECT_TIMEOUT',
  ].includes(code)) {
    return true;
  }

  return false;
}

function getRetryDelay(attempt, error) {
  const retryAfter = error?.response?.headers?.['retry-after'];
  if (retryAfter) {
    return Number(retryAfter) * 1000;
  }

  const base = Math.min(1000 * 2 ** (attempt - 1), 15000);
  const jitter = Math.floor(Math.random() * 250);
  return base + jitter;
}

async function requestWithPolicy(requestFn, options = {}) {
  const {
    maxRetries = 4,
    validate = null,
    onRetry = null,
  } = options;

  let lastError;

  for (let attempt = 1; attempt <= maxRetries + 1; attempt++) {
    try {
      const result = await requestFn();

      if (validate && !validate(result)) {
        const validationError = new Error('Response validation failed');
        validationError.code = 'INVALID_RESPONSE';
        throw validationError;
      }

      return result;
    } catch (error) {
      lastError = error;
      const retryable = isRetryable(error) || error.code === 'INVALID_RESPONSE';

      if (!retryable || attempt > maxRetries) {
        throw lastError;
      }

      const delay = getRetryDelay(attempt, error);

      if (onRetry) {
        await onRetry({ attempt, delay, error });
      }

      await sleep(delay);
    }
  }

  throw lastError;
}

async function fetchProfile(handle) {
  return requestWithPolicy(
    async () => {
      const response = await fetch(
        `https://api.sociavault.com/v1/scrape/tiktok/profile?handle=${encodeURIComponent(handle)}`,
        {
          headers: {
            'X-API-Key': process.env.SOCIAVAULT_API_KEY,
          },
        }
      );

      if (!response.ok) {
        const error = new Error(`Request failed with ${response.status}`);
        error.response = {
          status: response.status,
          headers: Object.fromEntries(response.headers.entries()),
        };
        throw error;
      }

      return response.json();
    },
    {
      maxRetries: 3,
      validate: json => Boolean(json?.data),
      onRetry: ({ attempt, delay, error }) => {
        console.log(`Retry ${attempt} in ${delay}ms because: ${error.message}`);
      },
    }
  );
}

fetchProfile('creator_handle')
  .then(result => console.log(result.data))
  .catch(error => console.error('Final failure:', error.message));
Enter fullscreen mode Exit fullscreen mode

The part I care about most is the policy split.

Retries are not hard. Correct retries are hard.

Python Version: Same Policy, Same Outcome

In Python, the same pattern works well with requests.

import random
import time

import requests


def is_retryable(error):
    status = getattr(error.response, 'status_code', None)
    if status in {429, 500, 502, 503, 504}:
        return True

    message = str(error).lower()
    network_terms = ['timed out', 'connection aborted', 'connection reset', 'temporary failure']
    return any(term in message for term in network_terms)


def get_retry_delay(attempt, error):
    if getattr(error.response, 'headers', None):
        retry_after = error.response.headers.get('Retry-After')
        if retry_after and retry_after.isdigit():
            return int(retry_after)

    base = min(2 ** (attempt - 1), 15)
    jitter = random.uniform(0, 0.25)
    return base + jitter


def request_with_policy(request_fn, max_retries=4, validate=None):
    last_error = None

    for attempt in range(1, max_retries + 2):
        try:
            result = request_fn()

            if validate and not validate(result):
                raise ValueError('Response validation failed')

            return result
        except Exception as error:
            last_error = error
            retryable = is_retryable(error) or isinstance(error, ValueError)

            if not retryable or attempt > max_retries:
                raise last_error

            delay = get_retry_delay(attempt, error)
            print(f'Retry {attempt} in {delay:.2f}s because: {error}')
            time.sleep(delay)

    raise last_error


def fetch_profile(handle):
    def do_request():
        response = requests.get(
            'https://api.sociavault.com/v1/scrape/tiktok/profile',
            params={'handle': handle},
            headers={'X-API-Key': 'YOUR_API_KEY'},
            timeout=20,
        )
        response.raise_for_status()
        return response.json()

    return request_with_policy(
        do_request,
        max_retries=3,
        validate=lambda json_data: bool(json_data.get('data')),
    )


print(fetch_profile('creator_handle').get('data'))
Enter fullscreen mode Exit fullscreen mode

The Biggest Retry Mistake I See

It is not missing backoff.

It is retrying without observability.

If you do not log:

  • attempt count
  • delay used
  • error type
  • final failure reason

then your retry logic becomes a black box that hides system problems until they get expensive.

Retries should reduce noise, not bury it.

Another Mistake: No Idempotency Boundary

If your retry wrapper is used for mutating operations, be careful.

GET requests are usually easy to retry.

POST/PUT/PATCH operations need more thought.

If the first request succeeded upstream but your client timed out before seeing the response, retrying a mutation can create duplicate work or duplicate billing.

That is why I separate:

  • read retries
  • write retries
  • idempotent write retries with request IDs

Do not treat those the same.

Honest Alternatives

There are a few legitimate alternatives depending on the stack.

Library-managed retries

Good if you want faster setup.

Less good if you need custom validation or better logs.

Queue-based retries

Great for background jobs and batch pipelines.

Usually better than inline retries for large workloads.

Circuit breakers

Great when you need to stop hammering a degraded service entirely.

If your API dependency gets flaky often, add one.

For most app-level fetches, I still like an explicit retry policy wrapper plus good logging.

Where SociaVault Fits

This is where I like using SociaVault: as the public social data layer in front of my app logic.

That way my retry logic is focused on application reliability, not on fighting broken browser sessions, proxy rotation, and scraping volatility from scratch.

That is a much better engineering trade.

If your real product is monitoring, analytics, or workflow automation, that separation matters a lot.

Final Take

Retry logic is not about "try again".

It is about deciding which failures deserve another attempt and which ones deserve a fix instead.

Once I started treating retries as policy instead of boilerplate, my systems got calmer fast.

Fewer noisy failures. Fewer pointless retries. Better logs. Better recovery.

And if you want to spend more of that effort on product reliability instead of collection plumbing, SociaVault is a good layer to build on.

webdev #api #javascript #python #backend

Top comments (0)