Polling Async Web Data Jobs Without Burning Your API Quota

#rag #webscraping #architecture #async

Your RAG pipeline looks fine in staging, then someone asks about a product price, policy update, or news item that changed three hours ago. The model answers from yesterday's vector index because the crawler runs nightly. Moving to live web retrieval fixes the freshness problem, but it introduces another one: slow, failure-prone network work now sits in the path of your AI system.

Treat web retrieval as a job, not a request

A synchronous scrape is tempting because it is easy to reason about:

request comes in -> fetch page -> parse page -> answer user

That falls apart when one target site takes 20 seconds, rate limits you, or returns a CAPTCHA page. Your worker sits there waiting, and your upstream timeout decides the result.

The more reliable pattern is:

submit job -> receive job_id -> poll status -> read result

This decouples the caller from the work. The orchestrator can submit several jobs, persist their IDs, and resume later if a process restarts. Most scraping and extraction APIs that do non-trivial work use some version of this shape.

For structured extraction tasks that return job IDs, Wire follows the same submit, poll, retrieve pattern, so the polling loop you write here applies beyond plain HTML scraping.

A typical submission response should look boring:

{
  "job_id": "job_123",
  "status": "pending"
}

The important part is that the API returns quickly. The expensive work happens elsewhere.

Poll slowly enough to matter

The easiest mistake is polling every 500ms because it feels responsive. If most jobs finish in 3 to 15 seconds, those early polls only return this:

{
  "job_id": "job_123",
  "status": "processing"
}

You gain nothing, but you spend rate-limit budget and sometimes billable API calls.

A practical default is:

Wait 3 to 5 seconds before the first poll for normal HTTP scraping.
Wait around 10 seconds before the first poll for browser-rendered pages.
Use exponential backoff after that.
Cap the interval so a long job does not disappear for too long.

Here is a TypeScript implementation I would be comfortable putting behind a queue worker:

class JobTimeoutError extends Error {}
class JobFailedError extends Error {}

type JobStatus<T> =
  | { status: 'pending' | 'processing' }
  | { status: 'completed'; result: T }
  | { status: 'failed'; error?: string };

async function sleep(ms: number) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function pollJob<T>(
  jobId: string,
  getStatus: (jobId: string) => Promise<JobStatus<T>>,
  options = {
    initialDelayMs: 5000,
    maxDelayMs: 60000,
    maxAttempts: 10
  }
): Promise<T> {
  let delay = options.initialDelayMs;

  for (let attempt = 0; attempt < options.maxAttempts; attempt++) {
    await sleep(delay);

    const job = await getStatus(jobId);

    if (job.status === 'completed') {
      return job.result;
    }

    if (job.status === 'failed') {
      throw new JobFailedError(job.error ?? `Job ${jobId} failed`);
    }

    delay = Math.min(delay * 2, options.maxDelayMs);
  }

  throw new JobTimeoutError(`Job ${jobId} did not finish after ${options.maxAttempts} polls`);
}

With a 5 second initial delay and a 60 second cap, the loop polls at roughly 5s, 10s, 20s, 40s, then 60s intervals. That is usually enough for web extraction jobs without hammering the status endpoint.

Separate retryable failures from permanent ones

Retries are not a moral good. They help when the failure is transient. They waste money and hide bugs when the request is wrong.

I usually classify failures like this:

Symptom	Likely cause	Action
`502 Bad Gateway`, `503 Service Unavailable`	Upstream or provider issue	Retry with backoff
Network timeout	Temporary network failure	Retry with backoff
`429 Too Many Requests`	Rate limit	Wait for reset, then retry
`400 Bad Request`	Malformed payload	Abort and fix the caller
`401 Unauthorized`	Bad or expired credentials	Abort and alert
`403 Forbidden`	Blocked, missing auth, or wrong region	Usually abort, sometimes retry with different routing

The failure mode matters. If your payload is invalid JSON, five retries produce five invalid requests. If credentials expired, retries just delay the alert that someone needs to refresh the session.

Add a circuit breaker per target

Even good retry logic can hurt you when a whole domain changes behavior. Maybe the site moved prices behind client-side rendering. Maybe it started returning login pages. Maybe your parser now throws Unexpected token < in JSON at position 0 because it expected an API response and got HTML.

A small circuit breaker keeps that from taking down the rest of the pipeline:

type BreakerState = {
  failures: number;
  openedUntil?: number;
};

const breakers = new Map<string, BreakerState>();

function canFetch(domain: string, now = Date.now()) {
  const state = breakers.get(domain);
  return !state?.openedUntil || state.openedUntil <= now;
}

function recordFailure(domain: string, now = Date.now()) {
  const state = breakers.get(domain) ?? { failures: 0 };
  state.failures += 1;

  if (state.failures >= 3) {
    state.openedUntil = now + 5 * 60 * 1000;
  }

  breakers.set(domain, state);
}

function recordSuccess(domain: string) {
  breakers.delete(domain);
}

Start with 3 consecutive failures and a 5 minute cool-off. Tune it from production data. The goal is not to avoid every failed request. The goal is to stop one broken target from consuming your whole worker pool.

Store job IDs like production data

If a job matters, persist its ID with the original request, target URL, tenant, attempt count, and deadline. Logs are not enough. When a downstream workflow times out, you want to answer basic questions:

Did we submit the job?
Is it still processing?
Did it fail permanently?
Which tenant or source caused the retry storm?

A simple table is often enough:

create table web_fetch_jobs (
  id text primary key,
  tenant_id text not null,
  target_url text not null,
  status text not null,
  attempts integer not null default 0,
  deadline_at timestamptz not null,
  created_at timestamptz not null default now(),
  updated_at timestamptz not null default now()
);

Once you have this, you can run polling from a queue, recover after deploys, and inspect stuck jobs without guessing.

The practical next step is to audit any live retrieval path that still blocks on a single HTTP request. Replace it with job submission, persisted IDs, bounded polling, and explicit failure classification.

The full breakdown is here if you want the complete picture.