Anakin

Posted on Jun 17

Choosing Between Scraping APIs, Browser Sessions, and Self-Managed Playwright

#playwright #webscraping #ai #architecture

Most teams do not start by designing scraping infrastructure. They start with a script. Then the script becomes a cron job, the cron job becomes a queue, and eventually someone is debugging why Chromium processes are eating all the memory on a worker node. At that point, the question is not “how do we scrape this page?” It is “which parts of this system do we actually want to own?”

The three common options

For continuously updated AI systems, most web data pipelines land in one of three patterns.

Pattern	Best fit	Main cost	Common failure
Async scraping API	Public pages, batch ingestion, RAG refreshes	Per job or per request	Vendor limits, polling complexity
Hosted browser sessions	Authenticated flows, multi-page navigation	Session time	Paying for idle browser minutes
Self-managed Playwright or Puppeteer	High volume, custom browser behavior	Engineering time and infrastructure	Proxy, retry, and browser maintenance

None of these is universally better. The right choice depends on volume, authentication, latency, and how much operational work your team can absorb.

Async APIs: good default for background ingestion

Async APIs work well when you can submit work and collect the result later. That maps cleanly to RAG ingestion, where you usually do not want a user request waiting on a live browser.

The shape is simple:

async function submitUrl(url) {
  const res = await fetch("https://scraper.example.com/jobs", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.SCRAPER_TOKEN}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({ url, render: true })
  });

  if (!res.ok) throw new Error(`submit failed: ${res.status}`);
  return res.json();
}

You store the returned job ID, poll until completion, validate the result, then update your database or vector store.

The tradeoff is latency. If a job takes 3 to 15 seconds, that is acceptable for background refresh. It is not acceptable inside a chat completion path unless you stream an intermediate answer or tell the user you are fetching fresh data.

Async APIs also introduce rate limits. If the submit endpoint allows 60 requests per minute, this loop will eventually fail:

await Promise.all(urls.map(url => submitUrl(url)));

Use a limiter instead:

import pLimit from "p-limit";

const limit = pLimit(5);

await Promise.all(
  urls.map(url => limit(() => submitUrl(url)))
);

For teams choosing this path, Wire fits the async scraping API category: submit URLs, track job IDs, and consume structured page output without running your own browser pool.

Hosted browser sessions: use them when state matters

Some sources require login, cookies, local storage, or multi-step navigation. A stateless scrape request is a bad fit for that.

Example workflow:

Log in
Navigate to an account page
Open a report
Apply filters
Extract the rendered table

You can do this with a hosted browser session by connecting Playwright to a remote browser endpoint:

import { chromium } from "playwright";

const browser = await chromium.connectOverCDP(process.env.REMOTE_BROWSER_WS);
const page = await browser.newPage();

await page.goto("https://example.com/login");
await page.fill("input[name=email]", process.env.APP_USER);
await page.fill("input[name=password]", process.env.APP_PASSWORD);
await page.click("button[type=submit]");

await page.goto("https://example.com/reports");
await page.click("text=Last 30 days");

const rows = await page.locator("table tbody tr").evaluateAll(nodes =>
  nodes.map(row => row.innerText)
);

await browser.close();

The provider runs the browser. You control the interaction.

This is useful, but the billing model matters. Browser services often charge by session time. A script that waits on slow pages, sleeps between actions, or leaves sessions open after errors can become expensive.

Always close sessions in finally:

let browser;

try {
  browser = await chromium.connectOverCDP(process.env.REMOTE_BROWSER_WS);
  // work
} finally {
  await browser?.close();
}

Hosted browser sessions are a middle ground. You avoid maintaining browser hosts, but you still own workflow logic, selectors, validation, and credentials handling.

Self-managed Playwright: control is not free

Running Playwright yourself gives you the most control. You can install extensions, intercept network calls, tune launch flags, control geographies, and keep browser pools warm.

A basic worker looks harmless:

import { chromium } from "playwright";

const browser = await chromium.launch({ headless: true });

for (const url of urls) {
  const page = await browser.newPage();
  try {
    await page.goto(url, { waitUntil: "networkidle", timeout: 30000 });
    const title = await page.title();
    console.log({ url, title });
  } catch (err) {
    console.error(`failed ${url}:`, err.message);
  } finally {
    await page.close();
  }
}

await browser.close();

Production adds the parts this example avoids:

browser pool sizing
queue backpressure
proxy rotation
retry budgets
per-domain rate limits
CAPTCHA and bot detection handling
memory leak monitoring
selector maintenance
alerting when extraction quality drops

The common mistake is comparing API credits to server cost only. That ignores engineering time. If a site redesign breaks ten selectors and one developer spends a day fixing parsers, that cost belongs in the scraping budget.

Self-managed starts to make sense when volume is high enough, the workflows need custom behavior, and the team already has operational capacity. As a rough heuristic, below hundreds of thousands of pages per month, managed options often cost less once maintenance is included. Above that, run the numbers with your actual failure rate and developer time.

A practical decision process

Ask these questions in order.

Does the source require authentication or multi-step state?

If yes, start with hosted browser sessions or self-managed Playwright. Async single-page extraction can work only if you can provide valid session state safely and the page does not need interaction.

Does the user need the data during the current request?

If yes, avoid long async scraping jobs in the hot path. Use cached content, stream progress, or design the product around delayed retrieval. If no, async jobs are usually simpler.

Do you need custom browser behavior?

If you need extensions, custom TLS behavior, request interception, or nonstandard anti-bot handling, self-managed may be justified. If not, owning that stack is probably unnecessary.

Can you tolerate selector maintenance?

If your extraction depends on CSS selectors, treat redesigns as expected events. Add validation and alerts for empty fields. A scraper that returns 200 OK with empty extracted data has still failed.

if (!result.price || !result.name) {
  await alerts.send({
    source: url,
    reason: "extraction returned missing required fields"
  });
}

The short version

Use async APIs for background ingestion from public pages. Use hosted browser sessions when login and navigation state matter. Use self-managed Playwright when you need control badly enough to pay for it with infrastructure and maintenance.

Do not decide from request volume alone. Include failure rates, selector churn, retry behavior, developer time, and whether stale data causes real product problems.

The full breakdown is here if you want the complete picture.

DEV Community