DEV Community

Anakin
Anakin

Posted on

Before you automate a browser, check the network tab

You need data from a website, so the first instinct is often Playwright, Puppeteer, Selenium, or an AI browser agent. That works, but it is easy to miss something important: the browser probably did not create the data you want. It fetched it from an internal API and rendered it.

Open DevTools on a product page, job board, travel site, or finance dashboard. Filter the Network tab to fetch or xhr. Reload the page. A lot of the useful stuff is already JSON.

The browser is often the expensive part

A browser automation script has real costs:

  • Chromium has to start or be kept warm
  • The page has to load assets you may not care about
  • Selectors break when the UI changes
  • Timeouts happen when ads, modals, or lazy loading behave differently
  • If an LLM agent drives the browser, every step may involve screenshots, DOM snapshots, and model calls

For example, this Playwright script can extract product data from rendered HTML:

import { chromium } from "playwright";

const browser = await chromium.launch();
const page = await browser.newPage();

await page.goto("https://example.com/products/123", {
  waitUntil: "networkidle",
  timeout: 30000,
});

const price = await page.locator("[data-testid='price']").innerText();
const title = await page.locator("h1").innerText();

console.log({ title, price });
await browser.close();
Enter fullscreen mode Exit fullscreen mode

This is fine until the site renames data-testid, renders the price after a delayed client-side request, or shows a cookie modal that covers the page. The failure usually looks boring:

locator.innerText: Timeout 30000ms exceeded.
Call log:
  - waiting for locator('[data-testid="price"]')
Enter fullscreen mode Exit fullscreen mode

That error tells you the selector failed, not whether the data was unavailable. The JSON endpoint may still have returned the price correctly.

Start by finding the data source

Before writing browser automation, capture the request that populated the page.

In Chrome DevTools:

  1. Open Network
  2. Filter by fetch or xhr
  3. Reload the page
  4. Click likely requests
  5. Check Response and Preview for structured JSON
  6. Right click the request and copy as cURL

You might end up with something like this:

curl 'https://www.example.com/api/products/123?currency=USD' \
  -H 'accept: application/json' \
  -H 'user-agent: Mozilla/5.0' \
  -H 'x-client-version: web-2026.06.1'
Enter fullscreen mode Exit fullscreen mode

If that works outside the browser, you can usually replace a browser job with a normal HTTP client:

const res = await fetch("https://www.example.com/api/products/123?currency=USD", {
  headers: {
    accept: "application/json",
    "user-agent": "Mozilla/5.0",
    "x-client-version": "web-2026.06.1",
  },
});

if (!res.ok) {
  throw new Error(`HTTP ${res.status}: ${await res.text()}`);
}

const product = await res.json();
console.log(product.price, product.availability);
Enter fullscreen mode Exit fullscreen mode

That is the clean path: less infrastructure, lower latency, fewer moving parts, and easier retries.

For teams that want maintained access to these private website APIs instead of managing request signatures themselves, Wire exposes cataloged site actions as normal REST calls with structured JSON responses.

Where direct HTTP gets messy

The copied cURL request is often not enough in production.

Many sites include values that change per session or per request:

  • CSRF tokens
  • signed query parameters
  • device or browser fingerprints
  • short-lived session cookies
  • nonces that prevent replay
  • GraphQL operation hashes

The failure mode is usually a 401, 403, or a JSON response that looks valid but contains an application-level denial:

{
  "error": "invalid_signature",
  "message": "request signature expired"
}
Enter fullscreen mode Exit fullscreen mode

This is where a direct network approach stops being just “copy as cURL”. You need to understand how the frontend generates those headers and tokens. Sometimes that is straightforward. Sometimes the signing code changes weekly.

For supported sites, Wire maintains stable action identifiers while handling endpoint changes, signatures, and authenticated session details behind the API boundary.

If you build this yourself, treat the integration like any other external dependency. Add contract tests against the response shape. Alert on 403 spikes. Store HAR files for debugging. Do not assume a private endpoint is stable just because it returned JSON today.

When a browser is still the right tool

Direct HTTP is not always better. It is better when the data already exists as a request you can reproduce reliably.

Use browser automation when the task depends on the visual state of the page:

  • multi-page forms with conditional fields
  • portals that require interactive navigation
  • flows where validation messages affect the next action
  • sites that render data only after complex client-side state changes
  • tasks where a human would say “click the second matching result”

This is where tools like Skyvern make sense. Skyvern runs Chromium, observes screenshots and DOM, asks an LLM what to do, and executes actions through Playwright. Its cached runs can replay generated Playwright scripts, which removes the LLM from successful repeat paths, but it still runs a browser.

That distinction matters. A cached browser workflow may avoid model latency, but it still pays for page load, browser memory, navigation, and UI fragility. A live AI browser agent can adapt to UI changes, but it may take many seconds per step and can fail in ways that are harder to reproduce than a bad HTTP response.

A practical decision rule

Use this order:

  1. Check the Network tab first
  2. If the data is available as JSON, try to reproduce the request with curl
  3. If the request needs tokens or signatures, decide whether maintaining that logic is worth it
  4. If the workflow is visual, stateful, or form-heavy, use browser automation
  5. If volume is high, avoid browsers unless there is no reliable network-layer option

The next time you start a scraping or automation task, spend 10 minutes capturing the HAR file before opening Playwright. If you can turn the target action into one authenticated HTTP request with a testable JSON schema, do that first. If you cannot, then reach for the browser with a clearer reason.

Top comments (0)