Async Scraping Jobs Are Usually a Better Fit for RAG Ingestion Than Blocking Requests

#ai #rag #webscraping #architecture

A RAG system that depends on web data usually fails in a boring way: the page changed, your index did not, and the model answers confidently from stale context. The first fix people reach for is often a cron job plus a scraper. That works until page loads take 12 seconds, a target site returns 429s, and your app starts holding open worker threads for data that should have been fetched in the background.

Why synchronous scraping breaks down

A synchronous scraper looks simple:

const html = await fetchPage(url);
const data = extractProduct(html);
await vectorStore.upsert(data);

That code hides a few production problems.

If fetchPage(url) needs a browser, the request might spend several seconds waiting for JavaScript, network calls, cookie banners, or anti-bot checks. If you run this inside an API request handler, the user waits. If you run enough of them in parallel, your workers sit around holding memory and open sockets.

The failure modes are also awkward:

TimeoutError: Navigation timeout of 30000 ms exceeded
HTTPError: Response code 429 (Too Many Requests)
Error: selector ".price" did not match any elements
ProtocolError: Target closed

Each error needs a different response. A timeout might need a retry. A 429 might need backoff or a different proxy. A missing selector might mean the site redesigned its markup, or it might mean the product is unavailable. Treating all of those as try again later creates bad data and noisy queues.

For AI pipelines, the bigger issue is freshness. If your ingestion job runs hourly, the model can be wrong for up to an hour even when every component is technically healthy. For inventory, shipment status, event schedules, job listings, or pricing, that gap matters.

The async job pattern

Async scraping APIs use a submit, poll, retrieve flow. You submit work, get a job ID, and fetch the result later. The important part is not the API shape itself. The important part is that page retrieval no longer blocks your application flow.

A minimal version looks like this:

# Submit work
curl -X POST https://scraper.example.com/jobs \
  -H "Authorization: Bearer $SCRAPER_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/products/123","render":true}'

# Response
# { "jobId": "job_8f31", "status": "queued" }

Then poll by ID:

curl https://scraper.example.com/jobs/job_8f31 \
  -H "Authorization: Bearer $SCRAPER_TOKEN"

# Possible response
# { "jobId": "job_8f31", "status": "running" }

# Completed response
# {
#   "jobId": "job_8f31",
#   "status": "completed",
#   "markdown": "# Product 123\n\nPrice: $42",
#   "html": "<html>...</html>",
#   "durationMs": 8420
# }

In application code, keep polling separate from request handling:

async function waitForScrape(jobId, { timeoutMs = 60000 } = {}) {
  const started = Date.now();
  let delay = 2000;

  while (Date.now() - started < timeoutMs) {
    const res = await fetch(`https://scraper.example.com/jobs/${jobId}`, {
      headers: { Authorization: `Bearer ${process.env.SCRAPER_TOKEN}` }
    });

    if (!res.ok) {
      throw new Error(`status poll failed: ${res.status}`);
    }

    const job = await res.json();

    if (job.status === "completed") return job;

    if (job.status === "failed") {
      throw new Error(`scrape failed: ${job.reason || "unknown reason"}`);
    }

    await new Promise(resolve => setTimeout(resolve, delay));
    delay = Math.min(delay * 1.5, 10000);
  }

  throw new Error(`scrape job ${jobId} timed out after ${timeoutMs}ms`);
}

This gives you a clean boundary. Your ingestion worker can submit 500 URLs, store job IDs, and update records as jobs finish. Your user-facing app does not need to keep browsers open or know how proxy retries work.

For this specific async ingestion pattern, Wire exposes web extraction as jobs that return IDs for polling, which matches the way background RAG updates usually need to run.

What you still need to build

Async jobs do not remove all pipeline work. They move the browser and network mess behind an API, but you still need to decide how your system treats results.

At minimum, store job state in your own database:

create table scrape_jobs (
  id text primary key,
  source_url text not null,
  status text not null,
  submitted_at timestamptz not null default now(),
  completed_at timestamptz,
  error text,
  content_hash text
);

The content_hash matters. If the page content did not change, do not re-embed it and write duplicate vectors. A simple hash over normalized Markdown is often enough:

import crypto from "node:crypto";

function hashContent(markdown) {
  return crypto
    .createHash("sha256")
    .update(markdown.replace(/\s+/g, " ").trim())
    .digest("hex");
}

You also need dead-letter handling. If a job fails three times, stop retrying it in the hot path and send it somewhere visible. Otherwise one broken source can consume your queue forever.

if (job.attempts >= 3) {
  await db.deadLetters.insert({
    sourceUrl: job.sourceUrl,
    reason: job.error,
    failedAt: new Date()
  });
  return;
}

Selector drift is a separate problem

Async scraping helps with latency and reliability, but it does not automatically fix extraction logic. If your parser depends on this:

const price = $(".product-price .amount").text();

then a redesign can silently produce an empty string. That is worse than a hard failure because bad data enters the index.

Prefer extraction contracts that validate output before ingestion:

const ProductSchema = z.object({
  name: z.string().min(1),
  price: z.string().min(1),
  availability: z.enum(["in_stock", "out_of_stock", "unknown"])
});

const parsed = ProductSchema.safeParse(extracted);

if (!parsed.success) {
  throw new Error(parsed.error.message);
}

Whether the extraction comes from selectors, readability output, or an AI extraction step, validate it before it reaches your vector store.

Where async jobs fit

Async scraping jobs fit background ingestion, scheduled refreshes, change detection, and batch updates. They do not fit cases where a chat response must include a page fetched milliseconds ago. A typical scrape job may take several seconds, especially with browser rendering.

If the user can wait, fine. If not, use cached content and update it behind the scenes.

The practical pattern is simple: keep scraping out of request handlers, treat each scrape as a job with state, validate the extracted shape, hash content before embedding, and dead-letter repeated failures instead of retrying forever.

The full breakdown is here if you want the complete picture.