When a scraping platform is too much for an LLM workflow

#webscraping #llm #api #architecture

You start with a simple requirement: give the model fresh data from a web page. Then the implementation grows into browser automation, job queues, dataset exports, retry handling, selector maintenance, and a parser that exists only to turn someone else's output into the JSON your LLM actually needs.

That mismatch is common. Scraping platforms solve a broad problem. LLM workflows usually need a narrower thing: take a URL or search query, extract a few fields, and pass structured data into the next model call.

The shape of the problem

A lot of scraping tooling is built around orchestration:

pick or write a scraper
configure inputs
run it as a job
wait for completion
fetch results from storage
normalize whatever format the scraper returned
pass that into your app

That model makes sense for scheduled crawls, large data pipelines, and teams maintaining many scrapers. If you run nightly competitor monitoring across thousands of pages, orchestration is not overhead. It is the product.

For an LLM feature, it can be too much.

Most agent or RAG flows want something closer to this:

user asks question
→ fetch relevant web data
→ extract structured fields
→ validate shape
→ call the model with clean context

The painful part is usually not the HTTP request. It is everything around it: client-side rendered pages returning empty HTML, CSS selectors drifting, different scrapers returning different schemas, and async jobs failing somewhere three layers away from the code that needs the result.

A failure often looks boring but expensive:

{
  "title": "",
  "price": null,
  "description": "Sign in to continue"
}

Your pipeline does not crash. It just gives the model bad context. Then the model confidently answers from garbage.

Prefer a narrow extraction contract

For LLM workflows, the useful abstraction is not "run a scraper". It is "return this kind of structured object or fail clearly".

That means your integration should have a contract like this:

type ExtractArticleResult = {
  url: string;
  title: string;
  author?: string;
  published_at?: string;
  markdown: string;
};

Once you define the shape, the rest of the pipeline becomes easier to reason about. You can validate it, cache it, embed it, summarize it, or pass it as tool output.

Wire by Anakin exposes this kind of submit-and-poll extraction flow over REST, returning structured JSON for LLM and agent workflows without requiring an SDK or actor lifecycle management: https://anakin.io/wire.

The specific provider matters less than the pattern: keep extraction behind a small interface and make the rest of your app depend on typed data, not scraper internals.

A concrete submit-and-poll wrapper

Many extraction APIs are asynchronous because pages may require rendering, retries, or remote execution. Do not let that async lifecycle leak throughout your app. Wrap it once.

Here is a minimal TypeScript example:

type JobResponse = {
  status: "processing";
  job_id: string;
  poll_url: string;
};

type CompletedJob<T> = {
  status: "completed";
  data: T;
  execution_ms?: number;
};

type FailedJob = {
  status: "failed";
  error_code: string;
  message: string;
};

async function submitExtraction(params: {
  apiKey: string;
  actionId: string;
  payload: Record<string, unknown>;
}): Promise<JobResponse> {
  const res = await fetch("https://api.anakin.io/v1/holocron/task", {
    method: "POST",
    headers: {
      "X-API-Key": params.apiKey,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      action_id: params.actionId,
      params: params.payload
    })
  });

  if (res.status !== 202) {
    throw new Error(`Extraction submit failed: ${res.status} ${await res.text()}`);
  }

  return res.json();
}

async function pollJob<T>(apiKey: string, pollUrl: string): Promise<T> {
  const deadline = Date.now() + 30_000;

  while (Date.now() < deadline) {
    const res = await fetch(`https://api.anakin.io${pollUrl}`, {
      headers: { "X-API-Key": apiKey }
    });

    if (!res.ok) {
      throw new Error(`Polling failed: ${res.status} ${await res.text()}`);
    }

    const body = (await res.json()) as CompletedJob<T> | FailedJob | { status: "processing" };

    if (body.status === "completed") return body.data;

    if (body.status === "failed") {
      throw new Error(`${body.error_code}: ${body.message}`);
    }

    await new Promise(resolve => setTimeout(resolve, 1000));
  }

  throw new Error("Extraction timed out after 30s");
}

The important part is not the URL. It is the boundary.

Your agent code should call something like extractArticle(url) and receive an ExtractArticleResult. It should not know about job IDs, polling URLs, retries, browser sessions, or dataset storage.

Validate before the model sees it

Do not pass extracted data straight into a prompt. Validate it first.

import { z } from "zod";

const ArticleSchema = z.object({
  url: z.string().url(),
  title: z.string().min(1),
  author: z.string().optional(),
  published_at: z.string().optional(),
  markdown: z.string().min(200)
});

const raw = await pollJob<unknown>(apiKey, job.poll_url);
const article = ArticleSchema.parse(raw);

This catches the common silent failures: login walls, empty rendered output, cookie banners, blocked requests, or pages where the extractor found navigation text instead of article content.

A schema error is much cheaper than a bad model answer.

When a bigger platform is still the right choice

A direct extraction API is not always better. It is better for a specific class of problem.

Use a scraping orchestration platform when you need:

scheduled batch jobs
shared scraper reuse across teams
dataset storage and historical exports
proxy management at large scale
multi-step scraping pipelines
webhooks and job chaining

Tools like Apify fit this world well. Actor marketplaces, queues, storage, and proxy controls are useful when scraping is a platform concern inside your company.

Use the narrower API pattern when web data is just one tool inside an LLM system. In that case, every extra concept becomes something your agent code has to handle or hide.

Keep the scraper out of your agent logic

The practical rule I use is simple: the model should never depend on a scraping provider's native output format.

Put a small adapter in front of it:

provider response
→ adapter
→ validated domain object
→ LLM prompt or tool result

That lets you swap providers, add fallbacks, cache successful extractions, and test your agent without hitting the network.

A good next step is to pick one existing LLM workflow that fetches web data and write down the exact JSON shape it needs. Then add validation at that boundary. You will usually find that the scraping code can get smaller once the contract is explicit.