When an Actor Platform Is Too Much for an LLM Scraping Task

#webscraping #llm #api #architecture

You start with a simple feature: give an LLM a URL, extract the useful data, and pass structured fields into the next prompt or tool call. Then the scraping layer grows its own lifecycle. You have runs, datasets, queues, retries, webhooks, SDK objects, and output formats that differ by target site. None of that is wrong, but it may be the wrong shape for the job.

The real decision is lifecycle shape

There are two common models for web extraction tools.

The first is the orchestration platform model. You configure a scraper, run it as a job, store output in a dataset, and maybe chain it with another job. Apify is a good example of this model with Actors, scheduled runs, dataset storage, proxies, and a large marketplace.

The second is the direct extraction API model. You send a request, get a job ID if it is async, poll or wait for the result, and receive JSON you can pass downstream.

For LLM workflows, this distinction matters more than the brand of the tool. An agent usually wants a narrow contract:

{
  "url": "https://example.com/product/123",
  "fields": ["title", "price", "availability", "description"]
}

And it wants a predictable response:

{
  "title": "Mechanical Keyboard",
  "price": "$129.00",
  "availability": "in_stock",
  "description": "Hot-swappable keyboard with...",
  "source_url": "https://example.com/product/123"
}

If your integration has to understand scraper-specific datasets, pagination, run logs, and output schemas before it can hand data to the model, you have added a pipeline where you may only need a tool call.

Wire is one example of the direct extraction API shape for LLM workflows: submit a task over REST, poll by job ID, and receive structured JSON back.

Where orchestration helps

Orchestration platforms make sense when the scraping system is a product in its own right.

Use that model when you need:

scheduled nightly or weekly crawls
reusable scrapers shared across teams
dataset storage and historical comparisons
proxy configuration across many geographies
chained jobs such as scrape, enrich, dedupe, export
operational dashboards for non-agent workflows

In that world, a queue is not overhead. It is the system. You want retries, logs, concurrency controls, and storage because your application depends on repeatable batch execution.

The tradeoff is integration surface area. A small LLM feature can end up with code like this:

type ScrapeRun = {
  runId: string;
  datasetId: string;
};

type DatasetItem = Record<string, unknown>;

async function getFirstDatasetItem(run: ScrapeRun): Promise<DatasetItem> {
  const dataset = await fetchDataset(run.datasetId);
  const items = await dataset.listItems();

  if (!items.length) {
    throw new Error(`Scrape finished but dataset ${run.datasetId} was empty`);
  }

  return items[0];
}

The failure mode is also awkward. The job can succeed while the dataset contains no useful item. Your application sees undefined, or worse, passes an empty object into the LLM and gets a confident answer based on missing data.

That is not a tool failure in the strict sense. It is a contract failure.

A better contract for agent tools

For an LLM tool, design around the data contract first and the scraping provider second.

A simple interface might look like this:

type ExtractionRequest = {
  targetUrl: string;
  schema: Record<string, "string" | "number" | "boolean" | "string[]">;
};

type ExtractionResult = {
  status: "completed" | "failed";
  data?: Record<string, unknown>;
  error?: {
    code: string;
    message: string;
  };
  metadata: {
    sourceUrl: string;
    executionMs?: number;
  };
};

Then hide the provider behind an adapter:

async function extractForAgent(req: ExtractionRequest): Promise<ExtractionResult> {
  const response = await fetch(process.env.EXTRACTION_API_URL!, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
      "X-API-Key": process.env.EXTRACTION_API_KEY!
    },
    body: JSON.stringify(req)
  });

  if (response.status === 401) {
    return {
      status: "failed",
      error: { code: "AUTH_REQUIRED", message: "Invalid or missing API key" },
      metadata: { sourceUrl: req.targetUrl }
    };
  }

  if (!response.ok) {
    return {
      status: "failed",
      error: { code: "EXTRACTION_FAILED", message: await response.text() },
      metadata: { sourceUrl: req.targetUrl }
    };
  }

  const data = await response.json();

  return {
    status: "completed",
    data,
    metadata: { sourceUrl: req.targetUrl }
  };
}

The important part is not the exact endpoint. The important part is that the agent sees one stable contract. It does not know about Actors, datasets, run IDs, browser sessions, or selector logic.

If the provider requires async polling, keep that inside the adapter too:

async function pollJob(jobUrl: string, apiKey: string) {
  for (let attempt = 0; attempt < 20; attempt++) {
    const res = await fetch(jobUrl, { headers: { "X-API-Key": apiKey } });
    const body = await res.json();

    if (body.status === "completed") return body.data;
    if (body.status === "failed") throw new Error(body.error?.message ?? "Job failed");

    await new Promise(resolve => setTimeout(resolve, 1000));
  }

  throw new Error("Extraction timed out after 20 seconds");
}

This prevents a common bug in agent systems: letting provider-specific states leak into prompt logic. Your LLM should not need to reason about RUNNING, READY, TIMED-OUT, and dataset item count = 0 unless those states mean something to the user.

Structured output beats raw HTML for most LLM paths

Raw HTML is useful if you are building a crawler or debugging selectors. It is less useful when the next step is a function call, embedding job, or RAG chunk.

A selector-based scraper can fail quietly when a site changes markup:

const price = document.querySelector(".product-price")?.textContent;

// Later:
if (!price) {
  // Was the product free, unavailable, blocked, or did the selector break?
}

That ambiguity matters. If the extraction layer cannot distinguish “field not present” from “page blocked” from “selector stale”, your LLM workflow has to guess.

A better failure response looks like this:

{
  "status": "failed",
  "error": {
    "code": "EXTRACTION_FAILED",
    "message": "Could not locate price field on rendered page"
  }
}

That gives your application something deterministic to do: retry, ask for a different URL, fall back to another provider, or tell the user what happened.

Wire exposes structured error states for authentication, credit, and execution failures, which fits this adapter pattern better than parsing multi-stage job logs in the agent layer.

How to choose without overfitting to one tool

Pick an orchestration platform if your main problem is operating scraping workflows at scale. Pick a direct extraction API if your main problem is feeding structured web data into an LLM with as little glue code as possible.

The quick test is this: if you removed the LLM from your application, would you still need queues, scheduled runs, stored datasets, and shared scraper assets? If yes, orchestration probably belongs in your architecture. If no, keep the extraction layer small and make the contract boring.

A practical next step: write the ExtractionRequest and ExtractionResult types for your app before choosing a provider, then build one adapter around them and make sure failed extraction never reaches your prompt as empty data.