Anakin

Posted on Jun 23

Network-layer scraping is fast, but the maintenance cost is the real decision

#webscraping #api #automation #backend

If you have ever shipped a Puppeteer or Playwright scraper, you know the usual failure mode: it works locally, then production gets slower, a selector changes, a cookie modal appears, or a page loads differently in headless mode. A lot of teams eventually notice that the browser is not where the useful data comes from. The frontend usually calls a JSON endpoint in the background, and that endpoint is what you actually wanted.

The useful trick: replay the request, not the UI

Open DevTools, go to the Network tab, filter by fetch or xhr, then use the site normally. On many modern sites, the page is just a client around internal HTTP, GraphQL, or RPC calls.

For example, a product search page might call something like this:

curl 'https://example.com/api/search?q=headphones&page=1' \
  -H 'accept: application/json' \
  -H 'user-agent: Mozilla/5.0' \
  -H 'x-client-version: web-2026.06.12'

That request might return structured data directly:

{
  "items": [
    { "id": "p_123", "name": "Wireless headphones", "price": 7999, "in_stock": true }
  ],
  "next_page": 2
}

Calling that endpoint from your backend is usually much faster than rendering a page. You avoid DOM waits, image loading, layout work, and most selector problems.

A minimal implementation might look like this:

async function searchProducts(query, page = 1) {
  const res = await fetch(
    `https://example.com/api/search?q=${encodeURIComponent(query)}&page=${page}`,
    {
      headers: {
        accept: 'application/json',
        'user-agent': 'Mozilla/5.0',
        'x-client-version': 'web-2026.06.12'
      }
    }
  );

  if (!res.ok) {
    throw new Error(`Search failed: ${res.status} ${await res.text()}`);
  }

  const body = await res.json();

  if (!Array.isArray(body.items)) {
    throw new Error(`Unexpected schema: ${JSON.stringify(body).slice(0, 300)}`);
  }

  return body.items.map(item => ({
    id: item.id,
    name: item.name,
    priceCents: item.price,
    inStock: item.in_stock
  }));
}

That pattern is simple, cheap, and often good enough. It is also where the real maintenance problem starts.

The endpoint is not a public API

Internal endpoints change without notice. The site owner does not consider your caller a customer of that API.

Common breakages look like this:

401 Unauthorized
{"error":"missing session"}

or:

403 Forbidden
{"message":"invalid client fingerprint"}

or worse, the request still returns 200 OK, but the response shape changes:

{
  "products": [
    { "sku": "p_123", "displayName": "Wireless headphones" }
  ]
}

Your code does not crash at the HTTP layer. It crashes later because body.items is undefined, or it silently writes incomplete data because price moved somewhere else.

This is the tradeoff with network-layer scraping. You remove browser flakiness, but you take ownership of an undocumented API contract.

Generated endpoints are good for exploration

Tools that generate an API from a URL are useful when you need quick coverage. Give the tool a page and a description, wait a short time, and you get an endpoint that extracts fields from the underlying calls.

That is a good fit for:

one-off research
prototypes
read-only extraction
long-tail sites you may never call again
validating whether the target site exposes useful JSON at all

The downside is operational. If the generated endpoint captured /api/v2/search and the site moves to /api/v3/query, someone has to regenerate or repair it. If the site adds a required header, rotates a token format, or changes pagination, your pipeline owns that breakage unless the provider maintains it for you.

For production systems, maintenance matters more than initial generation speed. Wire is one example of the maintained-catalog approach, where the caller keeps using the same action while the provider handles changes in the target site's internal calls.

That model is less flexible for obscure sites, but it fits recurring workflows better.

Read actions and write actions are not the same problem

Reading from an internal endpoint is relatively safe to reason about. You send query params, receive JSON, validate the schema, and store the result.

Writing is harder. A POST request often carries CSRF tokens, session-bound IDs, idempotency keys, hidden form fields, or state derived from previous requests.

A job application flow might look like this:

GET  /jobs/12345
GET  /api/applications/bootstrap?job_id=12345
POST /api/applications/draft
POST /api/applications/draft/resume
POST /api/applications/submit

If you replay only the final POST, you may get:

409 Conflict
{"error":"draft_not_initialized"}

If you reuse an old token, you may get:

422 Unprocessable Entity
{"error":"csrf_token_expired"}

If you retry a non-idempotent action after a timeout, you may submit twice.

This is why read-only extraction and web action automation should not share the same risk model. Reading a price wrong is bad. Sending the same customer message twice is a different class of bug. For workflows that submit, post, message, bid, or apply, Wire treats read and write actions as cataloged operations rather than leaving each caller to reconstruct the browser flow.

Even if you build this yourself, make write actions explicit in your codebase. Do not hide them behind a generic fetchPageData() helper.

What I would test before depending on network replay

For any target site, I would answer these questions before calling it production-ready:

Does the endpoint require an authenticated session?
How long do session cookies or tokens last?
Does the response schema stay stable across pages, accounts, and regions?
What status code appears when the site blocks or challenges the request?
Can the request be retried safely?
Do you have a fixture test that fails when the response shape changes?

A simple schema check catches a lot:

function assertSearchResponse(body) {
  if (!body || !Array.isArray(body.items)) {
    throw new Error('Expected body.items to be an array');
  }

  for (const item of body.items) {
    if (typeof item.id !== 'string') throw new Error('Missing item.id');
    if (typeof item.name !== 'string') throw new Error('Missing item.name');
    if (typeof item.price !== 'number') throw new Error('Missing item.price');
  }
}

Run that against real responses on a schedule. If the test fails before your customer-facing job fails, you bought yourself time.

The practical decision

Use browser automation when you need to interact with a UI that has no useful network calls, or when visual state matters.

Use direct network replay when the site exposes structured background requests and you can tolerate maintaining an undocumented contract.

Use generated endpoints when speed matters more than long-term stability.

Use maintained integrations when the same action runs repeatedly, failure has business impact, or the workflow writes back to the target site.

A good next step is to pick one flaky browser scraper, inspect its XHR calls, replay the smallest useful request with curl, then add a schema test around the response. That will tell you quickly whether network replay reduces the problem or just moves the complexity somewhere else.

Top comments (3)

Roberto Kerber • Jun 28

Agree the maintenance cost is the real tradeoff, not the speed. Hitting the JSON API directly is my default when one exists - way less brittle than DOM selectors. The catch I hit: the same brand can run different stacks per region. OLX is Next.js NEXT_DATA in Brazil but a clean REST API in Europe, so "network-layer" isn't one decision, it's per-target. How do you catch it when an upstream endpoint silently changes shape - monitoring on the response schema, or just wait for the run to break?

Anakin • Jul 8

The OLX point is a good one. Per-region stack variance means it's one decision per target per region, not one. Same brand, separate contracts.

On silent schema changes: don't wait for runs to break. Schema validation at the boundary is the minimum. Assert the shape you observed, classify failures by type (transport vs. contract violation vs. empty result). A field moving from items[].price to items[].pricing.current still returns 200 the whole time.

The harder part is who owns the repair. Monitoring catches the drift, but someone still has to fix it. For DIY integrations that's always you - which is the actual maintenance cost. It's why we built Wire the way we did: when an upstream endpoint changes headers, schema, or auth flow, Wire repairs the contract and the caller's integration keeps working.

Roberto Kerber • Jul 10

Classifying failures by type is the part I underrated. The bucket that still catches me is "empty result", because it isn't one thing. On targets with anti-bot, a block doesn't always announce itself - you get a 200, valid schema, zero items. Transport is fine, the contract holds, the run reports success and returns nothing. Indistinguishable from a genuinely empty search unless you assert on something outside the payload, like a canary query you know should return rows.

The per-region variance compounds that: a schema assertion written against OLX BR passes happily against the EU endpoint, because it validates the shape the BR stack emits. Two contracts wearing one brand name. I pin assertions to target plus region now, never to the brand.

And agreed that ownership of the repair is the real cost - it doesn't disappear, it just moves. Where do you draw that line in Wire: does it detect the drift and hand it back, or does it attempt the repair itself?