How I Stopped Retry Storms from Destroying My Scraping Budget

#webdev #ai #javascript #automation

TL;DR: Retry loops across multiple actor runs caused a $240 overnight
bill. I built ProceedGate to detect storms
in real-time and block them before the compute accumulates.
Free tier, 10-line Node.js integration.Last month I watched a scraping job quietly burn $240 overnight.

The target site started returning 403s around 2 AM. The agent retried. Got another 403. Retried again. By morning it had made 600+ identical requests to a URL that was never going to respond — and I paid for every single one of them in compute units.

The frustrating part: I had already set maxRetries: 3. It didn't matter. The URL kept getting requeued across multiple runs, and each new run reset the counter. Three retries × 200 actor runs = 600 requests.

The Real Problem

Most retry-limiting advice assumes retries happen in a single session. They don't. In a distributed scraping setup, the same blocked URL can bounce between runs, queues, and workers indefinitely. Your maxRetries setting only sees a slice of the full picture.

What you actually need is something that tracks patterns across your entire workspace — not just within a single run.

The pattern that causes 90% of bill spikes:

Run 1: URL → 403 → retry → 403 → retry → 403 → fail → requeue
Run 2: URL → 403 → retry → 403 → retry → 403 → fail → requeue
Run 3: ...repeat 200 times...

Each run thinks it only retried 3 times. The bill says otherwise.

What Actually Helps

After debugging this for a while, here's what works:

1. Hash the request, not just the URL

Group requests by action + URL + query params. A scraper hitting /product?id=123 and /product?id=456 are different patterns. /blocked-page requested 50 times is a storm.

2. Track patterns cross-run, not just in-session

You need a persistent store that survives across actor runs. A simple Redis counter works: increment on each request, expire after 60 seconds. If the same URL hash hits 10+ times in a minute — it's a storm.

3. Block upstream, not downstream

The mistake I made: trying to fix this inside the scraper. By the time the scraper knows it's in a storm, the compute is already running. The block needs to happen before the actor starts — at the queue level.

4. Alert in real-time, not post-mortem

Google Sheets cost monitoring is useful for weekly reviews. But by the time Sheets catches a spike, you've already paid for it.

What I Built

After hitting this problem one too many times, I built ProceedGate — a lightweight gate that sits outside the agent loop and blocks retry storms before the bill accumulates.

It works like this:

Agent action → ProceedGate → ✅ allowed (proceed_token issued)
                           → ⚠️ friction (retry #4–10, warning)
                           → 🚫 blocked (storm, >10/min)

The gate tracks request pattern hashes across your entire workspace. It doesn't care which run or which actor triggered the request — if the same pattern hits 10+ times in 60 seconds, it hard-blocks.

Integration with Node.js/Crawlee takes about 10 lines:

import { createProceedGateClient, requireGateStepOk } from '@proceedgate/node';

const client = createProceedGateClient({
  apiKey: process.env.PROCEEDGATE_API_KEY,
  actor: { id: 'my-scraping-agent' },
});

// Before each request:
await requireGateStepOk(client, {
  policyId: 'retry_friction_v1',
  action: 'web_scrape',
  context: {
    attempt_in_window: retryCount,
    task_hash: urlHash,
    cost_estimate: 0.01,
  },
});

If it's a storm, requireGateStepOk throws and the actor stops — before the compute accumulates.

The Result

The same scraping setup that burned $240 in one night now hard-stops within 60 seconds of hitting a storm pattern. The bill for that scenario: $0.30 in compute.

Free tier available at proceedgate.dev (5,000 checks/month, no card required).

Open source Node SDK: github.com/loquit-doru/proceedgate-node-sdk

Full docs: proceedgate.dev/docs

If you've dealt with this problem differently, curious to hear your approach in the comments.