Anakin

Posted on Jun 1

When AI Agents Should Stop Using Browsers for Web Data

#ai #automation #webscraping #api

You start with Playwright because it works. The agent needs data from a page, so you give it a browser, navigate to the URL, wait for selectors, extract text, and move on. Then the workflow grows from 3 pages to 300. Suddenly you are debugging Navigation timeout of 30000 ms exceeded, Target closed, rate limits, memory pressure, and a queue full of half-dead browser sessions.

The problem is not that browser automation is bad. It is that a browser is often the wrong abstraction for getting structured data into an AI system.

Browsers are useful, but expensive

A headless browser gives you high fidelity. It runs JavaScript, stores cookies, clicks buttons, submits forms, follows client-side routing, and sees the page roughly as a user would.

That matters for:

multi-step login flows
pages that render all useful data client-side
visual testing
checkout or booking flows
workflows that need clicks, hovers, uploads, or form submission

But if your agent only needs titles, prices, comments, abstracts, availability, or review text, a full browser can become unnecessary overhead.

Each browser session consumes CPU and memory. Parallel extraction means parallel browser contexts or instances. You also have to manage lifecycle issues: browser startup, page crashes, timeouts, proxy assignment, retries, and cleanup.

A typical browser-based extraction loop looks like this:

import { chromium } from "playwright";

const browser = await chromium.launch();
const page = await browser.newPage();

try {
  await page.goto("https://example.com/product/123", {
    waitUntil: "networkidle",
    timeout: 30_000,
  });

  const title = await page.locator("h1").innerText();
  const price = await page.locator("[data-testid='price']").innerText();

  console.log({ title, price });
} finally {
  await browser.close();
}

This is fine for a small number of pages. At scale, the failure modes stack up. networkidle may never happen because analytics requests keep running. A selector change breaks extraction. A page crash loses the whole session. If the agent launches many of these calls at once, infrastructure becomes part of the reasoning loop whether you wanted it or not.

Structured extraction is a better default when the shape is known

For many agent workflows, the goal is not to interact with the site. The goal is to turn web content into typed data.

Instead of giving the agent a browser, give it an extraction API that returns JSON:

{
  "title": "Example product",
  "price": 42.99,
  "currency": "USD",
  "availability": "in_stock"
}

That changes the agent’s job. It no longer has to reason about selectors, loading states, cookie banners, or whether a button is visible. It receives data and decides what to do with it.

Wire is Anakin’s API layer for web actions, including catalog-based extractors that return structured data without making the agent manage browser sessions directly.

The broader pattern is what matters: move brittle web interaction out of the prompt and into deterministic infrastructure. Let code handle retries, parsing, authentication, and rate limits. Let the model handle ranking, summarizing, planning, or answering the user.

This works best when:

the data shape is predictable
you need many pages per workflow
visual fidelity does not matter
the agent can run jobs asynchronously
failed extractions can be retried or skipped

It works poorly when the site requires complex interaction, CAPTCHA solving, dynamic user-specific flows, or visual verification. In those cases, use a browser. Do not pretend JSON extraction replaces the entire browser automation stack.

Use async jobs instead of long HTTP requests

Long-running extraction should usually be asynchronous. A synchronous request ties your agent to an HTTP timeout. If the page is slow, the proxy retries, or the extraction spans multiple pages, you either block the agent or fail the request.

A better pattern is:

Submit a job.
Get a job_id.
Continue other work.
Poll until the job reaches completed or failed.
Feed the result back into the agent.

The API shape usually looks like this:

curl -X POST https://api.example.com/v1/tasks \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "action": "extract.product",
    "params": {
      "url": "https://example.com/product/123"
    }
  }'

# 202 Accepted
# { "job_id": "job_abc123", "status": "processing" }

Then poll:

curl https://api.example.com/v1/jobs/job_abc123 \
  -H "Authorization: Bearer $API_KEY"

# 200 OK
# {
#   "status": "completed",
#   "data": {
#     "title": "Example product",
#     "price": 42.99
#   }
# }

Wire uses this same async job style for heavier web actions: submit work, receive a job id, and poll for structured results while the agent continues other tasks.

The important part is not the vendor-specific endpoint. It is the control flow. Agents often run several web lookups in parallel, and async jobs make that manageable.

Polling needs backoff and terminal states

Do not poll every 100ms. Also do not poll forever.

Your polling code should handle:

processing as non-terminal
completed as success
failed as terminal failure
429 with Retry-After
transient 5xx errors
a maximum attempt count or deadline

Example:

import time
import httpx

TERMINAL = {"completed", "failed"}


def poll_job(job_id: str, api_key: str, deadline_seconds: int = 120):
    url = f"https://api.example.com/v1/jobs/{job_id}"
    headers = {"Authorization": f"Bearer {api_key}"}
    started = time.monotonic()
    attempt = 0

    while time.monotonic() - started < deadline_seconds:
        resp = httpx.get(url, headers=headers, timeout=10)

        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", "1"))
            time.sleep(retry_after)
            continue

        if 500 <= resp.status_code < 600:
            time.sleep(min(2 ** attempt, 30))
            attempt += 1
            continue

        resp.raise_for_status()
        payload = resp.json()
        status = payload.get("status")

        if status == "completed":
            return payload["data"]

        if status == "failed":
            raise RuntimeError(payload.get("error", "extraction failed"))

        time.sleep(min(2 ** attempt, 30))
        attempt += 1

    raise TimeoutError(f"job {job_id} did not finish within {deadline_seconds}s")

This code is boring on purpose. The agent should not improvise retry policy in natural language. Put that behavior in normal application code where you can test it.

Pick the lowest-fidelity tool that works

A useful rule: use the lowest-fidelity web access method that gives correct data.

Start with direct APIs if the site provides them. Use structured extraction when the data is public or session-backed but predictable. Use browser automation when interaction or rendering fidelity is required.

Mixing all three is normal, but keep the boundaries clear. If every extraction path eventually falls back to a browser, you still need browser infrastructure. If most tasks return JSON and only a few need Playwright, your system gets simpler and cheaper to operate.

A practical next step: take one existing browser-based extraction in your agent stack and log what it actually uses from the page. If it only reads a handful of fields, replace that path with a JSON-producing function and keep the browser version as a fallback.

Top comments (1)

tokozen • Jun 1

API based access over browsers anyday!