DEV Community

Cover image for Puppeteer networkidle is not a scraping strategy
Massi
Massi

Posted on

Puppeteer networkidle is not a scraping strategy

The most suspicious line in a lot of Puppeteer and Playwright scrapers is not page.goto.

It is this:

await page.goto(url, { waitUntil: "networkidle" })
Enter fullscreen mode Exit fullscreen mode

It looks responsible.

It feels like you are saying:

Wait until the page is done.
Then scrape it.
Enter fullscreen mode Exit fullscreen mode

The problem is that modern web pages are rarely "done."

Analytics keeps firing.

Personalization keeps polling.

Reviews load after the product.

Inventory loads after the variant.

Ads never stop.

Chat widgets wake up late.

Some pages go idle before the content you need appears.

Some never go idle at all.

So your scraper either waits too little, waits too long, or times out for reasons unrelated to the data you wanted.

That is not a scraping strategy.

That is outsourcing correctness to background noise.

I am building Webclaw, a web extraction API, CLI, and MCP server for AI agents and LLM apps. The rule I trust more is:

Do not wait for silence.
Wait for evidence.
Enter fullscreen mode Exit fullscreen mode

Network idle is a browser lifecycle hint

networkidle is useful for some browser automation tasks.

It is not useless.

The problem is treating it like proof that the page contains the thing you need.

It does not mean:

the article body exists
the product price exists
the JSON-LD exists
the review count exists
the SPA finished hydrating
the page is not blocked
the extracted markdown is good
Enter fullscreen mode Exit fullscreen mode

It only means the browser saw a quiet enough network window according to that tool's definition.

That is a very different claim.

For scraping, the readiness condition should come from the target data, not from the browser feeling calm.

The four bad waits

I see these patterns a lot.

1. Wait for network idle

await page.goto(url, { waitUntil: "networkidle" })
const html = await page.content()
Enter fullscreen mode Exit fullscreen mode

This can hang on noisy pages and still miss content on pages that load important data late.

2. Wait a fixed number of seconds

await page.waitForTimeout(5000)
Enter fullscreen mode Exit fullscreen mode

This is honest but crude.

It passes locally, fails in production, and gets worse when latency changes.

3. Wait for any selector

await page.waitForSelector("main")
Enter fullscreen mode Exit fullscreen mode

This is better, but still weak.

main can exist before the content exists.

#root can exist while the app is still empty.

.product can exist as a skeleton loader.

4. Wait for DOMContentLoaded

await page.goto(url, { waitUntil: "domcontentloaded" })
Enter fullscreen mode Exit fullscreen mode

This only tells you the initial document was parsed.

For a client-rendered app, that might be exactly the moment before anything useful happens.

What to wait for instead

You want evidence that the target content exists.

For an article:

headline exists
body has enough text
author or date is present
article schema is present
navigation is not most of the output
Enter fullscreen mode Exit fullscreen mode

For a product page:

title exists
price exists
availability exists
variant data exists
product JSON-LD exists
reviews or rating exists when expected
Enter fullscreen mode Exit fullscreen mode

For docs:

main heading exists
section headings exist
code blocks exist
body text crosses a minimum token count
Enter fullscreen mode Exit fullscreen mode

That readiness check can be a selector, but it should not be just any selector.

It should be tied to the data you came for.

Rendering should be a fallback

Before launching a browser, ask a cheaper question:

Does the initial HTML already contain the content?
Enter fullscreen mode Exit fullscreen mode

If yes, extract it.

If no, classify why.

Useful categories:

blocked response
empty app shell
hydration payload with data
hydration payload without data
bad extraction rule
client-side data dependency
Enter fullscreen mode Exit fullscreen mode

Only one of those clearly says:

render this page
Enter fullscreen mode Exit fullscreen mode

That distinction matters.

If the HTML contains the article, you do not need Chrome.

If the JSON-LD contains the product data, you do not need Chrome.

If the page is blocked, Chrome may not be the first fix.

If the extractor missed the content, rendering gives you a more expensive failure.

This is the longer Webclaw version of the idea:

JavaScript Rendering API for Web Scraping: when browser fallback is actually needed

A better rendering loop

The flow I prefer looks like this:

fetch raw HTML
classify the response
extract available content
score extraction quality
render only if needed
wait for target evidence
extract again
return clean markdown or JSON
Enter fullscreen mode Exit fullscreen mode

When rendering is required, do not wait for generic browser quiet.

Wait for specific evidence:

price text is present
article body crosses 500 words
JSON-LD includes Product or Article
expected API response arrived
skeleton loader disappeared
main content hash stopped changing
Enter fullscreen mode Exit fullscreen mode

That is slower to design than networkidle.

It is also much closer to correctness.

The content-quality check is the missing piece

Even after rendering, you still need to check the output.

The browser can finish.

The DOM can exist.

The extractor can run.

The result can still be trash.

Examples:

80 percent navigation
cookie banner text
empty product fields
missing article body
duplicate footer links
loading skeleton text
Enter fullscreen mode Exit fullscreen mode

This is why a web extraction system should score the cleaned output before returning it as success.

For LLM apps, this matters even more.

Bad context does not just sit in a database.

It becomes an answer, a tool call, a summary, or a RAG chunk.

Related Webclaw posts:

Where anti-bot fits in

JavaScript rendering and anti-bot handling overlap, but they are not the same problem.

A page can be empty because it needs JavaScript.

A page can be empty because it is blocked.

A page can be empty because your extractor is wrong.

If you collapse all three into "use browser," you lose the ability to debug.

For the anti-bot side:

The API should hide the decision

From the outside, a scraping API should not make the caller guess:

should I pass render=true?
should I wait for network idle?
should I retry in Chrome?
should I parse JSON-LD instead?
should I call the XHR endpoint?
Enter fullscreen mode Exit fullscreen mode

The API should classify the page and choose the cheapest correct path.

In Webclaw, the user-facing call stays boring:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'
Enter fullscreen mode Exit fullscreen mode

If the initial response is enough, return clean markdown.

If the page needs rendering, render.

If the user wants typed fields, extract them with a schema.

Docs:

The rule

Do not wait for the page to be done.

Modern pages are not done.

Wait for the content you need.

Then verify the output you are about to trust.

That one change makes browser scraping less magical and a lot easier to reason about.

Project links:

Top comments (0)