The most suspicious line in a lot of Puppeteer and Playwright scrapers is not page.goto.
It is this:
await page.goto(url, { waitUntil: "networkidle" })
It looks responsible.
It feels like you are saying:
Wait until the page is done.
Then scrape it.
The problem is that modern web pages are rarely "done."
Analytics keeps firing.
Personalization keeps polling.
Reviews load after the product.
Inventory loads after the variant.
Ads never stop.
Chat widgets wake up late.
Some pages go idle before the content you need appears.
Some never go idle at all.
So your scraper either waits too little, waits too long, or times out for reasons unrelated to the data you wanted.
That is not a scraping strategy.
That is outsourcing correctness to background noise.
I am building Webclaw, a web extraction API, CLI, and MCP server for AI agents and LLM apps. The rule I trust more is:
Do not wait for silence.
Wait for evidence.
Network idle is a browser lifecycle hint
networkidle is useful for some browser automation tasks.
It is not useless.
The problem is treating it like proof that the page contains the thing you need.
It does not mean:
the article body exists
the product price exists
the JSON-LD exists
the review count exists
the SPA finished hydrating
the page is not blocked
the extracted markdown is good
It only means the browser saw a quiet enough network window according to that tool's definition.
That is a very different claim.
For scraping, the readiness condition should come from the target data, not from the browser feeling calm.
The four bad waits
I see these patterns a lot.
1. Wait for network idle
await page.goto(url, { waitUntil: "networkidle" })
const html = await page.content()
This can hang on noisy pages and still miss content on pages that load important data late.
2. Wait a fixed number of seconds
await page.waitForTimeout(5000)
This is honest but crude.
It passes locally, fails in production, and gets worse when latency changes.
3. Wait for any selector
await page.waitForSelector("main")
This is better, but still weak.
main can exist before the content exists.
#root can exist while the app is still empty.
.product can exist as a skeleton loader.
4. Wait for DOMContentLoaded
await page.goto(url, { waitUntil: "domcontentloaded" })
This only tells you the initial document was parsed.
For a client-rendered app, that might be exactly the moment before anything useful happens.
What to wait for instead
You want evidence that the target content exists.
For an article:
headline exists
body has enough text
author or date is present
article schema is present
navigation is not most of the output
For a product page:
title exists
price exists
availability exists
variant data exists
product JSON-LD exists
reviews or rating exists when expected
For docs:
main heading exists
section headings exist
code blocks exist
body text crosses a minimum token count
That readiness check can be a selector, but it should not be just any selector.
It should be tied to the data you came for.
Rendering should be a fallback
Before launching a browser, ask a cheaper question:
Does the initial HTML already contain the content?
If yes, extract it.
If no, classify why.
Useful categories:
blocked response
empty app shell
hydration payload with data
hydration payload without data
bad extraction rule
client-side data dependency
Only one of those clearly says:
render this page
That distinction matters.
If the HTML contains the article, you do not need Chrome.
If the JSON-LD contains the product data, you do not need Chrome.
If the page is blocked, Chrome may not be the first fix.
If the extractor missed the content, rendering gives you a more expensive failure.
This is the longer Webclaw version of the idea:
JavaScript Rendering API for Web Scraping: when browser fallback is actually needed
A better rendering loop
The flow I prefer looks like this:
fetch raw HTML
classify the response
extract available content
score extraction quality
render only if needed
wait for target evidence
extract again
return clean markdown or JSON
When rendering is required, do not wait for generic browser quiet.
Wait for specific evidence:
price text is present
article body crosses 500 words
JSON-LD includes Product or Article
expected API response arrived
skeleton loader disappeared
main content hash stopped changing
That is slower to design than networkidle.
It is also much closer to correctness.
The content-quality check is the missing piece
Even after rendering, you still need to check the output.
The browser can finish.
The DOM can exist.
The extractor can run.
The result can still be trash.
Examples:
80 percent navigation
cookie banner text
empty product fields
missing article body
duplicate footer links
loading skeleton text
This is why a web extraction system should score the cleaned output before returning it as success.
For LLM apps, this matters even more.
Bad context does not just sit in a database.
It becomes an answer, a tool call, a summary, or a RAG chunk.
Related Webclaw posts:
Where anti-bot fits in
JavaScript rendering and anti-bot handling overlap, but they are not the same problem.
A page can be empty because it needs JavaScript.
A page can be empty because it is blocked.
A page can be empty because your extractor is wrong.
If you collapse all three into "use browser," you lose the ability to debug.
For the anti-bot side:
- Anti-bot scraping API signals
- Browser fallback beats browser-first
- TLS fingerprinting in 2026
- Cloudflare error codes for scrapers
The API should hide the decision
From the outside, a scraping API should not make the caller guess:
should I pass render=true?
should I wait for network idle?
should I retry in Chrome?
should I parse JSON-LD instead?
should I call the XHR endpoint?
The API should classify the page and choose the cheapest correct path.
In Webclaw, the user-facing call stays boring:
curl -X POST https://api.webclaw.io/v1/scrape \
-H "Authorization: Bearer $WEBCLAW_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"only_main_content": true
}'
If the initial response is enough, return clean markdown.
If the page needs rendering, render.
If the user wants typed fields, extract them with a schema.
Docs:
The rule
Do not wait for the page to be done.
Modern pages are not done.
Wait for the content you need.
Then verify the output you are about to trust.
That one change makes browser scraping less magical and a lot easier to reason about.
Project links:
Top comments (0)