DEV Community

Anakin
Anakin

Posted on

Network-layer scraping is fast, but the maintenance cost is the real decision

If you have ever shipped a Puppeteer or Playwright scraper, you know the usual failure mode: it works locally, then production gets slower, a selector changes, a cookie modal appears, or a page loads differently in headless mode. A lot of teams eventually notice that the browser is not where the useful data comes from. The frontend usually calls a JSON endpoint in the background, and that endpoint is what you actually wanted.

The useful trick: replay the request, not the UI

Open DevTools, go to the Network tab, filter by fetch or xhr, then use the site normally. On many modern sites, the page is just a client around internal HTTP, GraphQL, or RPC calls.

For example, a product search page might call something like this:

curl 'https://example.com/api/search?q=headphones&page=1' \
  -H 'accept: application/json' \
  -H 'user-agent: Mozilla/5.0' \
  -H 'x-client-version: web-2026.06.12'
Enter fullscreen mode Exit fullscreen mode

That request might return structured data directly:

{
  "items": [
    { "id": "p_123", "name": "Wireless headphones", "price": 7999, "in_stock": true }
  ],
  "next_page": 2
}
Enter fullscreen mode Exit fullscreen mode

Calling that endpoint from your backend is usually much faster than rendering a page. You avoid DOM waits, image loading, layout work, and most selector problems.

A minimal implementation might look like this:

async function searchProducts(query, page = 1) {
  const res = await fetch(
    `https://example.com/api/search?q=${encodeURIComponent(query)}&page=${page}`,
    {
      headers: {
        accept: 'application/json',
        'user-agent': 'Mozilla/5.0',
        'x-client-version': 'web-2026.06.12'
      }
    }
  );

  if (!res.ok) {
    throw new Error(`Search failed: ${res.status} ${await res.text()}`);
  }

  const body = await res.json();

  if (!Array.isArray(body.items)) {
    throw new Error(`Unexpected schema: ${JSON.stringify(body).slice(0, 300)}`);
  }

  return body.items.map(item => ({
    id: item.id,
    name: item.name,
    priceCents: item.price,
    inStock: item.in_stock
  }));
}
Enter fullscreen mode Exit fullscreen mode

That pattern is simple, cheap, and often good enough. It is also where the real maintenance problem starts.

The endpoint is not a public API

Internal endpoints change without notice. The site owner does not consider your caller a customer of that API.

Common breakages look like this:

401 Unauthorized
{"error":"missing session"}
Enter fullscreen mode Exit fullscreen mode

or:

403 Forbidden
{"message":"invalid client fingerprint"}
Enter fullscreen mode Exit fullscreen mode

or worse, the request still returns 200 OK, but the response shape changes:

{
  "products": [
    { "sku": "p_123", "displayName": "Wireless headphones" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Your code does not crash at the HTTP layer. It crashes later because body.items is undefined, or it silently writes incomplete data because price moved somewhere else.

This is the tradeoff with network-layer scraping. You remove browser flakiness, but you take ownership of an undocumented API contract.

Generated endpoints are good for exploration

Tools that generate an API from a URL are useful when you need quick coverage. Give the tool a page and a description, wait a short time, and you get an endpoint that extracts fields from the underlying calls.

That is a good fit for:

  • one-off research
  • prototypes
  • read-only extraction
  • long-tail sites you may never call again
  • validating whether the target site exposes useful JSON at all

The downside is operational. If the generated endpoint captured /api/v2/search and the site moves to /api/v3/query, someone has to regenerate or repair it. If the site adds a required header, rotates a token format, or changes pagination, your pipeline owns that breakage unless the provider maintains it for you.

For production systems, maintenance matters more than initial generation speed. Wire is one example of the maintained-catalog approach, where the caller keeps using the same action while the provider handles changes in the target site's internal calls.

That model is less flexible for obscure sites, but it fits recurring workflows better.

Read actions and write actions are not the same problem

Reading from an internal endpoint is relatively safe to reason about. You send query params, receive JSON, validate the schema, and store the result.

Writing is harder. A POST request often carries CSRF tokens, session-bound IDs, idempotency keys, hidden form fields, or state derived from previous requests.

A job application flow might look like this:

GET  /jobs/12345
GET  /api/applications/bootstrap?job_id=12345
POST /api/applications/draft
POST /api/applications/draft/resume
POST /api/applications/submit
Enter fullscreen mode Exit fullscreen mode

If you replay only the final POST, you may get:

409 Conflict
{"error":"draft_not_initialized"}
Enter fullscreen mode Exit fullscreen mode

If you reuse an old token, you may get:

422 Unprocessable Entity
{"error":"csrf_token_expired"}
Enter fullscreen mode Exit fullscreen mode

If you retry a non-idempotent action after a timeout, you may submit twice.

This is why read-only extraction and web action automation should not share the same risk model. Reading a price wrong is bad. Sending the same customer message twice is a different class of bug. For workflows that submit, post, message, bid, or apply, Wire treats read and write actions as cataloged operations rather than leaving each caller to reconstruct the browser flow.

Even if you build this yourself, make write actions explicit in your codebase. Do not hide them behind a generic fetchPageData() helper.

What I would test before depending on network replay

For any target site, I would answer these questions before calling it production-ready:

  • Does the endpoint require an authenticated session?
  • How long do session cookies or tokens last?
  • Does the response schema stay stable across pages, accounts, and regions?
  • What status code appears when the site blocks or challenges the request?
  • Can the request be retried safely?
  • Do you have a fixture test that fails when the response shape changes?

A simple schema check catches a lot:

function assertSearchResponse(body) {
  if (!body || !Array.isArray(body.items)) {
    throw new Error('Expected body.items to be an array');
  }

  for (const item of body.items) {
    if (typeof item.id !== 'string') throw new Error('Missing item.id');
    if (typeof item.name !== 'string') throw new Error('Missing item.name');
    if (typeof item.price !== 'number') throw new Error('Missing item.price');
  }
}
Enter fullscreen mode Exit fullscreen mode

Run that against real responses on a schedule. If the test fails before your customer-facing job fails, you bought yourself time.

The practical decision

Use browser automation when you need to interact with a UI that has no useful network calls, or when visual state matters.

Use direct network replay when the site exposes structured background requests and you can tolerate maintaining an undocumented contract.

Use generated endpoints when speed matters more than long-term stability.

Use maintained integrations when the same action runs repeatedly, failure has business impact, or the workflow writes back to the target site.

A good next step is to pick one flaky browser scraper, inspect its XHR calls, replay the smallest useful request with curl, then add a schema test around the response. That will tell you quickly whether network replay reduces the problem or just moves the complexity somewhere else.

Top comments (0)