Node.js Web Scraping Best Practices: Lessons From Production Pipelines
Node.js is a great choice for web scraping.
The ecosystem is mature, async I/O is a natural fit, and you can move quickly with tools like fetch, cheerio, Playwright, Puppeteer, and scraping APIs.
But most Node.js scrapers fail for the same reason: they are written like one-off scripts and then quietly promoted into production.
This post is a practical checklist for building scrapers that survive real usage.
1. Start with a queue, not a loop
The first version of a scraper often looks like this:
for (const url of urls) {
const html = await fetch(url).then(r => r.text());
const data = parse(html);
await save(data);
}
That is fine for a test. It is fragile in production.
Use jobs instead:
const job = {
url: 'https://example.com/products/123',
type: 'product',
priority: 'normal',
attempts: 0,
createdAt: new Date().toISOString(),
};
A queue gives you:
- retries without losing work
- priority handling
- restartability
- deduplication
- visibility into failures
- safer concurrency control
You can start simple with a database table or a JSON-backed queue. You do not need Kafka on day one. You just need to stop treating URLs as a temporary in-memory array.
2. Control concurrency per domain
Global concurrency is too blunt.
// Too simple for production
const concurrency = 20;
Different domains tolerate different request patterns. Even within one domain, product pages, search pages, and detail pages may need different pacing.
A better structure:
const domainPolicy = {
'example.com': {
concurrency: 2,
minDelayMs: 3000,
maxDelayMs: 9000,
},
'another-site.com': {
concurrency: 1,
minDelayMs: 8000,
maxDelayMs: 20000,
},
};
function randomDelay(min, max) {
return min + Math.floor(Math.random() * (max - min));
}
Then make your workers respect the policy:
async function waitForDomainPolicy(domain) {
const policy = domainPolicy[domain] || domainPolicy.default;
await sleep(randomDelay(policy.minDelayMs, policy.maxDelayMs));
}
Your goal is not maximum requests per second. Your goal is stable, valid data over time.
3. Use the right fetch strategy for the page
Not every page needs a browser.
A practical hierarchy:
- HTTP fetch for static pages
- HTML parsing with Cheerio when the markup is available
- Browser rendering when JavaScript is required
- Residential proxy / scraping API when the site is protected
Example wrapper:
async function fetchPage(url, options = {}) {
if (options.render) {
return await fetchWithBrowser(url);
}
return await fetchWithHttp(url);
}
If you use a scraping API such as XCrawl, keep it behind the same wrapper:
async function fetchWithScrapingApi(url, options = {}) {
const result = await xcrawl.scrapeMarkdown(url, {
render: options.render ?? false,
country: options.country,
});
return {
status: result.status,
markdown: result.data?.markdown || '',
html: result.data?.html || '',
finalUrl: result.finalUrl,
};
}
The rest of your application should not care which fetch method was used.
4. Detect blocks before parsing
Do not send every response directly into the parser.
A blocked response can still be 200 OK.
function detectBlock(response) {
const text = `${response.html || ''}\n${response.markdown || ''}`.toLowerCase();
if (response.status === 403) return 'forbidden';
if (response.status === 429) return 'rate_limited';
if (text.includes('captcha')) return 'captcha';
if (text.includes('access denied')) return 'access_denied';
if (text.includes('enable javascript')) return 'needs_rendering';
if (text.length < 500) return 'too_little_content';
return null;
}
Then route failures by type:
const response = await fetchPage(job.url);
const block = detectBlock(response);
if (block === 'needs_rendering') {
return requeue(job, { render: true });
}
if (block === 'rate_limited') {
return requeue(job, { delayMs: 30 * 60 * 1000 });
}
if (block) {
return markBlocked(job, block);
}
Blind retries are a common way to make blocking worse. Typed retries are safer.
5. Store raw responses first
This is one of the highest-ROI habits in scraping.
async function storeRawPage(job, response) {
return db.rawPages.insert({
url: job.url,
status: response.status,
html: response.html,
markdown: response.markdown,
finalUrl: response.finalUrl,
fetchedAt: new Date(),
});
}
Why it matters:
- you can re-parse old pages after fixing a parser
- you can inspect block pages later
- you can prove whether missing data came from the site or your code
- you reduce repeated requests during debugging
Storage is usually cheaper than re-scraping.
6. Write parsers that degrade gracefully
A brittle parser:
const price = Number($('.price').text().replace('$', ''));
A more defensive parser:
function extractPrice(text) {
const patterns = [
/\$\s?([0-9,]+(?:\.\d{2})?)/,
/USD\s?([0-9,]+(?:\.\d{2})?)/i,
/price[^0-9]{0,20}([0-9,]+(?:\.\d{2})?)/i,
];
for (const pattern of patterns) {
const match = text.match(pattern);
if (match) {
return Number(match[1].replace(/,/g, ''));
}
}
return null;
}
Returning null is often better than throwing. The validation layer can decide whether the record is good enough.
7. Validate before writing to your main tables
A successful parse is not the same as valid data.
function validateProduct(product) {
const checks = {
hasTitle: Boolean(product.title && product.title.length >= 8),
hasPrice: typeof product.price === 'number' && product.price > 0,
hasUrl: Boolean(product.url?.startsWith('https://')),
priceReasonable: product.price == null || product.price < 10000,
};
const passed = Object.values(checks).filter(Boolean).length;
const confidence = passed / Object.keys(checks).length;
return {
valid: confidence >= 0.75,
confidence,
checks,
};
}
Only write valid records to the main table:
const validation = validateProduct(product);
if (!validation.valid) {
await db.reviewQueue.insert({ product, validation, rawPageId });
return;
}
await db.products.upsert(product);
This prevents silent corruption. Your scraper will still fail sometimes, but bad data will not quietly become business data.
8. Deduplicate aggressively
Do not scrape the same fresh URL again and again.
async function planJobs(urls) {
const existing = await db.pages.findByUrls(urls);
const sixHoursAgo = Date.now() - 6 * 60 * 60 * 1000;
return urls.filter(url => {
const record = existing.get(url);
if (!record) return true;
return new Date(record.fetchedAt).getTime() < sixHoursAgo;
});
}
Deduplication reduces:
- cost
- block risk
- processing time
- noisy logs
Freshness should be a business decision. A stock price page and a company profile page do not need the same refresh interval.
9. Stream large jobs
If you process thousands of URLs, avoid loading everything into memory.
async function* scrapeStream(jobs) {
for await (const job of jobs) {
const response = await fetchPage(job.url, job.options);
yield { job, response };
}
}
for await (const { job, response } of scrapeStream(jobSource())) {
await processResponse(job, response);
}
Generators make long-running jobs easier to control. They also pair well with backpressure, batching, and incremental checkpointing.
10. Add health checks
A scraper can look healthy while returning junk.
Create known test cases:
const healthChecks = [
{
url: 'https://example.com/products/test-product',
expect: page => page.includes('Test Product'),
},
];
async function runHealthChecks() {
for (const check of healthChecks) {
const response = await fetchPage(check.url, { render: true });
const text = response.markdown || response.html || '';
if (!check.expect(text)) {
await sendAlert(`Health check failed for ${check.url}`);
return false;
}
}
return true;
}
Run health checks on a schedule. You want to know about broken extraction before your users or clients notice stale data.
11. Track the metrics that matter
Do not obsess over requests per second.
Track:
- valid records per run
- validation failure rate
- block rate by domain
- retry count by failure type
- render fallback rate
- cost per valid record
- freshness lag
- parser error rate
The most important metric is:
effective scrape rate = valid records / total attempts
A slower scraper with a high effective scrape rate is usually better than a fast scraper that fills your database with partial records.
12. Keep secrets out of code
Do not hardcode API keys.
const xcrawl = new XCrawlScraper({
apiKey: process.env.XCRAWL_API_KEY,
});
Use environment variables, secret managers, or platform-provided secrets.
This matters more than people think because scraping projects often start as local scripts and later get copied into workers, cron jobs, dashboards, or client repos.
A simple production skeleton
Here is the shape I like:
async function processJob(job) {
await waitForDomainPolicy(new URL(job.url).hostname);
const response = await fetchPage(job.url, job.options);
const rawPageId = await storeRawPage(job, response);
const block = detectBlock(response);
if (block) return handleBlock(job, block, rawPageId);
const parsed = parsePage(response);
const validation = validate(parsed);
if (!validation.valid) {
return db.reviewQueue.insert({ job, parsed, validation, rawPageId });
}
await db.records.upsert(parsed);
return markDone(job);
}
It is not fancy. That is the point.
Good production scrapers are usually boring systems with strong boundaries:
- queue
- pacing
- fetch
- raw storage
- parse
- validate
- monitor
Final checklist
Before you call a Node.js scraper production-ready, check this list:
- [ ] jobs are stored outside memory
- [ ] concurrency is controlled per domain
- [ ] delays are configurable
- [ ] JavaScript rendering is optional, not default
- [ ] block pages are detected
- [ ] raw responses are stored
- [ ] parsers return structured partial data instead of crashing
- [ ] validation happens before database writes
- [ ] failed records go to a review queue
- [ ] large jobs stream or checkpoint progress
- [ ] health checks run on a schedule
- [ ] metrics track valid data, not just requests
- [ ] secrets come from environment variables
If you have these pieces, your scraper can still break — but it will break in a way you can understand and recover from.
That is the difference between a scraping script and a production scraping pipeline.
Top comments (0)