Massi

Posted on May 23

Puppeteer networkidle is not a scraping strategy

#webscraping #javascript #puppeteer #playwright

The most suspicious line in a lot of Puppeteer and Playwright scrapers is not page.goto.

It is this:

await page.goto(url, { waitUntil: "networkidle" })

It looks responsible.

It feels like you are saying:

Wait until the page is done.
Then scrape it.

The problem is that modern web pages are rarely "done."

Analytics keeps firing.

Personalization keeps polling.

Reviews load after the product.

Inventory loads after the variant.

Ads never stop.

Chat widgets wake up late.

Some pages go idle before the content you need appears.

Some never go idle at all.

So your scraper either waits too little, waits too long, or times out for reasons unrelated to the data you wanted.

That is not a scraping strategy.

That is outsourcing correctness to background noise.

I am building Webclaw, a web extraction API, CLI, and MCP server for AI agents and LLM apps. The rule I trust more is:

Do not wait for silence.
Wait for evidence.

Network idle is a browser lifecycle hint

networkidle is useful for some browser automation tasks.

It is not useless.

The problem is treating it like proof that the page contains the thing you need.

It does not mean:

the article body exists
the product price exists
the JSON-LD exists
the review count exists
the SPA finished hydrating
the page is not blocked
the extracted markdown is good

It only means the browser saw a quiet enough network window according to that tool's definition.

That is a very different claim.

For scraping, the readiness condition should come from the target data, not from the browser feeling calm.

The four bad waits

I see these patterns a lot.

1. Wait for network idle

await page.goto(url, { waitUntil: "networkidle" })
const html = await page.content()

This can hang on noisy pages and still miss content on pages that load important data late.

2. Wait a fixed number of seconds

await page.waitForTimeout(5000)

This is honest but crude.

It passes locally, fails in production, and gets worse when latency changes.

3. Wait for any selector

await page.waitForSelector("main")

This is better, but still weak.

main can exist before the content exists.

#root can exist while the app is still empty.

.product can exist as a skeleton loader.

4. Wait for DOMContentLoaded

await page.goto(url, { waitUntil: "domcontentloaded" })

This only tells you the initial document was parsed.

For a client-rendered app, that might be exactly the moment before anything useful happens.

What to wait for instead

You want evidence that the target content exists.

For an article:

headline exists
body has enough text
author or date is present
article schema is present
navigation is not most of the output

For a product page:

title exists
price exists
availability exists
variant data exists
product JSON-LD exists
reviews or rating exists when expected

For docs:

main heading exists
section headings exist
code blocks exist
body text crosses a minimum token count

That readiness check can be a selector, but it should not be just any selector.

It should be tied to the data you came for.

Rendering should be a fallback

Before launching a browser, ask a cheaper question:

Does the initial HTML already contain the content?

If yes, extract it.

If no, classify why.

Useful categories:

blocked response
empty app shell
hydration payload with data
hydration payload without data
bad extraction rule
client-side data dependency

Only one of those clearly says:

render this page

That distinction matters.

If the HTML contains the article, you do not need Chrome.

If the JSON-LD contains the product data, you do not need Chrome.

If the page is blocked, Chrome may not be the first fix.

If the extractor missed the content, rendering gives you a more expensive failure.

This is the longer Webclaw version of the idea:

JavaScript Rendering API for Web Scraping: when browser fallback is actually needed

A better rendering loop

The flow I prefer looks like this:

fetch raw HTML
classify the response
extract available content
score extraction quality
render only if needed
wait for target evidence
extract again
return clean markdown or JSON

When rendering is required, do not wait for generic browser quiet.

Wait for specific evidence:

price text is present
article body crosses 500 words
JSON-LD includes Product or Article
expected API response arrived
skeleton loader disappeared
main content hash stopped changing

That is slower to design than networkidle.

It is also much closer to correctness.

The content-quality check is the missing piece

Even after rendering, you still need to check the output.

The browser can finish.

The DOM can exist.

The extractor can run.

The result can still be trash.

Examples:

80 percent navigation
cookie banner text
empty product fields
missing article body
duplicate footer links
loading skeleton text

This is why a web extraction system should score the cleaned output before returning it as success.

For LLM apps, this matters even more.

Bad context does not just sit in a database.

It becomes an answer, a tool call, a summary, or a RAG chunk.

Where anti-bot fits in

JavaScript rendering and anti-bot handling overlap, but they are not the same problem.

A page can be empty because it needs JavaScript.

A page can be empty because it is blocked.

A page can be empty because your extractor is wrong.

If you collapse all three into "use browser," you lose the ability to debug.

For the anti-bot side:

The API should hide the decision

From the outside, a scraping API should not make the caller guess:

should I pass render=true?
should I wait for network idle?
should I retry in Chrome?
should I parse JSON-LD instead?
should I call the XHR endpoint?

The API should classify the page and choose the cheapest correct path.

In Webclaw, the user-facing call stays boring:

curl -X POST https://api.webclaw.io/v1/scrape \
  -H "Authorization: Bearer $WEBCLAW_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "only_main_content": true
  }'

If the initial response is enough, return clean markdown.

If the page needs rendering, render.

If the user wants typed fields, extract them with a schema.

Docs:

The rule

Do not wait for the page to be done.

Modern pages are not done.

Wait for the content you need.

Then verify the output you are about to trust.

That one change makes browser scraping less magical and a lot easier to reason about.

Project links:

Top comments (2)

Harjot Singh • Jun 1

you make a solid point about the pitfalls of relying on networkidle. it's true that modern sites are in a constant state of flux. at moonshift, we help you get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. if you're interested, I can set you up with a free run to see how it works.

Harjot Singh • Jun 1

totally agree with your point about relying on networkidle - it's a risky strategy since pages can be so dynamic. with moonshift, you can get a full next.js + postgres + auth app deployed in about 7 minutes, and you own the code on your github. if you're interested, I can set you up with a complimentary run to see how it works.