AI agents don't always need browsers to read the web

#ai #webscraping #architecture #python

You build an agent that needs data from a website. The first version uses Playwright because it works locally: open the page, wait for the selector, read the DOM. Then you run it in production and the failures change. Jobs hang because a page never reaches networkidle. Containers run out of memory when several browsers start at once. A single slow site blocks the whole agent loop.

The problem usually is not the model. It is the shape of the web access layer.

Browser automation is a high-fidelity tool, not a default

Playwright and Puppeteer are good when you need a browser. They execute client-side JavaScript, keep cookies, click buttons, fill forms, handle navigation, and give you screenshots when something goes wrong.

That fidelity costs you CPU, memory, and time.

A common failure looks like this:

TimeoutError: page.waitForSelector: Timeout 30000ms exceeded.

Or, under load:

Error: browserType.launch: Target page, context or browser has been closed

Those are not unusual edge cases. They happen because a browser session is a stateful resource. If an agent needs to fetch 200 pages, you either run those pages serially and wait a long time, or you run browsers in parallel and pay for the resources. You also need queues, retries, proxy handling, session cleanup, and observability around every job.

Use browser automation when the task really needs browser behavior: interactive forms, visual validation, authenticated flows that depend on client-side state, or data that only appears after complex JavaScript execution.

If the agent just needs structured text, metadata, prices, reviews, or search results, a browser is often too much machinery.

Prefer an async extraction boundary

A better pattern for many agent workflows is: submit an extraction job, get a job ID, keep doing other work, then poll for a typed result.

That boundary matters because web requests have unpredictable duration. Your agent should not block its reasoning loop on a single HTTP request that might take 45 seconds or fail because of a rate limit.

The generic flow looks like this:

POST /tasks
-> 202 Accepted { "job_id": "abc", "status": "processing" }

GET /jobs/abc
-> { "status": "completed", "data": { ... } }

Wire uses this kind of async job model for web extraction actions, returning a job_id first and structured results later when the extraction completes.

Here is a small poller that works for this style of API. The important parts are terminal states, Retry-After, and a maximum wait so your agent does not loop forever.

import time
import httpx

class JobFailed(Exception):
    pass

def poll_job(job_url: str, headers: dict, max_attempts: int = 12) -> dict:
    for attempt in range(max_attempts):
        response = httpx.get(job_url, headers=headers, timeout=10)

        if response.status_code == 429:
            retry_after = response.headers.get("Retry-After")
            delay = int(retry_after) if retry_after else min(2 ** attempt, 30)
            time.sleep(delay)
            continue

        response.raise_for_status()
        payload = response.json()

        status = payload.get("status")

        if status == "completed":
            return payload["data"]

        if status == "failed":
            raise JobFailed(payload.get("error", "job failed"))

        if status != "processing":
            raise RuntimeError(f"unknown job status: {status}")

        time.sleep(min(2 ** attempt, 30))

    raise TimeoutError("job did not complete before max_attempts")

This pattern also makes retries safer. Retrying a synchronous scrape can accidentally duplicate side effects if the target workflow submits a form or changes state. Polling a job ID is usually idempotent. You can call GET /jobs/{id} as many times as needed without starting the work again.

Return data the agent can use directly

Agents are bad places to hide parsing logic.

If your tool returns raw HTML, the model has to infer structure every time. That increases token usage and makes failures harder to diagnose. A better extraction layer returns typed fields:

{
  "title": "Example thread title",
  "author": "user123",
  "score": 2847,
  "comments": [
    {
      "author": "another_user",
      "body": "I hit the same issue with polling.",
      "created_at": "2026-01-12T10:15:00Z"
    }
  ]
}

This gives your agent a stable contract. The LLM can reason over score, comments, and created_at instead of guessing where those values live in the DOM.

The tradeoff is that typed extraction depends on schemas. When a site changes markup or behavior, someone still has to update the extractor. The benefit is that the fix happens in one place instead of inside every agent prompt or parser.

Wire exposes a catalog of site-specific actions with declared auth requirements and credit costs, which is one way to move platform-specific parsing and session handling out of the agent code.

Even if you build this yourself, the same design applies: define actions with input parameters, output schemas, retry behavior, and authentication mode. Treat web access like an API contract, not like an improvised browser script.

Be explicit about when to fall back to a browser

API-first extraction is not a replacement for all browser automation.

Use an extraction API when:

The data is mostly text or structured records
You need many pages per workflow
Latency and resource usage matter
The agent can continue working while jobs run
The output schema is predictable

Use a browser when:

The workflow requires clicks, forms, or navigation state
You need screenshots or visual assertions
The page depends on complex client-side rendering
CAPTCHA or interactive auth is part of the flow
You are testing the user experience itself

Mixing both is fine, but do it deliberately. For example, use API extraction for 95% of product pages and reserve Playwright for the few sites where data only appears after an interactive flow. What you want to avoid is sending every request through a browser because the first prototype did.

The architecture is the useful part

For production agents, the useful pattern is not “scrape without a browser” as a slogan. It is a concrete boundary:

Submit work asynchronously
Poll with backoff
Respect Retry-After
Return typed JSON
Track terminal failure states
Keep browser automation for tasks that need browser fidelity

A practical next step: take one existing Playwright scraper, measure how often it only reads static or semi-structured data, and replace that path with an async job interface that returns a typed JSON schema. Keep the browser path only for the cases that actually need it.