Charles

Posted on Jun 5 • Edited on Jun 26

Node.js Web Scraping Best Practices: Lessons From Production Pipelines

#tutorial #javascript #node #scraping

Node.js Web Scraping Best Practices: Lessons From Production Pipelines

Node.js is a great choice for web scraping.

The ecosystem is mature, async I/O is a natural fit, and you can move quickly with tools like fetch, cheerio, Playwright, Puppeteer, and scraping APIs.

But most Node.js scrapers fail for the same reason: they are written like one-off scripts and then quietly promoted into production.

This post is a practical checklist for building scrapers that survive real usage.

1. Start with a queue, not a loop

The first version of a scraper often looks like this:

for (const url of urls) {
  const html = await fetch(url).then(r => r.text());
  const data = parse(html);
  await save(data);
}

That is fine for a test. It is fragile in production.

Use jobs instead:

const job = {
  url: 'https://example.com/products/123',
  type: 'product',
  priority: 'normal',
  attempts: 0,
  createdAt: new Date().toISOString(),
};

A queue gives you:

retries without losing work
priority handling
restartability
deduplication
visibility into failures
safer concurrency control

You can start simple with a database table or a JSON-backed queue. You do not need Kafka on day one. You just need to stop treating URLs as a temporary in-memory array.

2. Control concurrency per domain

Global concurrency is too blunt.

// Too simple for production
const concurrency = 20;

Different domains tolerate different request patterns. Even within one domain, product pages, search pages, and detail pages may need different pacing.

A better structure:

const domainPolicy = {
  'example.com': {
    concurrency: 2,
    minDelayMs: 3000,
    maxDelayMs: 9000,
  },
  'another-site.com': {
    concurrency: 1,
    minDelayMs: 8000,
    maxDelayMs: 20000,
  },
};

function randomDelay(min, max) {
  return min + Math.floor(Math.random() * (max - min));
}

Then make your workers respect the policy:

async function waitForDomainPolicy(domain) {
  const policy = domainPolicy[domain] || domainPolicy.default;
  await sleep(randomDelay(policy.minDelayMs, policy.maxDelayMs));
}

Your goal is not maximum requests per second. Your goal is stable, valid data over time.

3. Use the right fetch strategy for the page

Not every page needs a browser.

A practical hierarchy:

HTTP fetch for static pages
HTML parsing with Cheerio when the markup is available
Browser rendering when JavaScript is required
Residential proxy / scraping API when the site is protected

Example wrapper:

async function fetchPage(url, options = {}) {
  if (options.render) {
    return await fetchWithBrowser(url);
  }

  return await fetchWithHttp(url);
}

If you use a scraping API such as XCrawl, keep it behind the same wrapper:

async function fetchWithScrapingApi(url, options = {}) {
  const result = await xcrawl.scrapeMarkdown(url, {
    render: options.render ?? false,
    country: options.country,
  });

  return {
    status: result.status,
    markdown: result.data?.markdown || '',
    html: result.data?.html || '',
    finalUrl: result.finalUrl,
  };
}

The rest of your application should not care which fetch method was used.

4. Detect blocks before parsing

Do not send every response directly into the parser.

A blocked response can still be 200 OK.

function detectBlock(response) {
  const text = `${response.html || ''}\n${response.markdown || ''}`.toLowerCase();

  if (response.status === 403) return 'forbidden';
  if (response.status === 429) return 'rate_limited';
  if (text.includes('captcha')) return 'captcha';
  if (text.includes('access denied')) return 'access_denied';
  if (text.includes('enable javascript')) return 'needs_rendering';
  if (text.length < 500) return 'too_little_content';

  return null;
}

Then route failures by type:

const response = await fetchPage(job.url);
const block = detectBlock(response);

if (block === 'needs_rendering') {
  return requeue(job, { render: true });
}

if (block === 'rate_limited') {
  return requeue(job, { delayMs: 30 * 60 * 1000 });
}

if (block) {
  return markBlocked(job, block);
}

Blind retries are a common way to make blocking worse. Typed retries are safer.

5. Store raw responses first

This is one of the highest-ROI habits in scraping.

async function storeRawPage(job, response) {
  return db.rawPages.insert({
    url: job.url,
    status: response.status,
    html: response.html,
    markdown: response.markdown,
    finalUrl: response.finalUrl,
    fetchedAt: new Date(),
  });
}

Why it matters:

you can re-parse old pages after fixing a parser
you can inspect block pages later
you can prove whether missing data came from the site or your code
you reduce repeated requests during debugging

Storage is usually cheaper than re-scraping.

6. Write parsers that degrade gracefully

A brittle parser:

const price = Number($('.price').text().replace('$', ''));

A more defensive parser:

function extractPrice(text) {
  const patterns = [
    /\$\s?([0-9,]+(?:\.\d{2})?)/,
    /USD\s?([0-9,]+(?:\.\d{2})?)/i,
    /price[^0-9]{0,20}([0-9,]+(?:\.\d{2})?)/i,
  ];

  for (const pattern of patterns) {
    const match = text.match(pattern);
    if (match) {
      return Number(match[1].replace(/,/g, ''));
    }
  }

  return null;
}

Returning null is often better than throwing. The validation layer can decide whether the record is good enough.

7. Validate before writing to your main tables

A successful parse is not the same as valid data.

function validateProduct(product) {
  const checks = {
    hasTitle: Boolean(product.title && product.title.length >= 8),
    hasPrice: typeof product.price === 'number' && product.price > 0,
    hasUrl: Boolean(product.url?.startsWith('https://')),
    priceReasonable: product.price == null || product.price < 10000,
  };

  const passed = Object.values(checks).filter(Boolean).length;
  const confidence = passed / Object.keys(checks).length;

  return {
    valid: confidence >= 0.75,
    confidence,
    checks,
  };
}

Only write valid records to the main table:

const validation = validateProduct(product);

if (!validation.valid) {
  await db.reviewQueue.insert({ product, validation, rawPageId });
  return;
}

await db.products.upsert(product);

This prevents silent corruption. Your scraper will still fail sometimes, but bad data will not quietly become business data.

8. Deduplicate aggressively

Do not scrape the same fresh URL again and again.

async function planJobs(urls) {
  const existing = await db.pages.findByUrls(urls);
  const sixHoursAgo = Date.now() - 6 * 60 * 60 * 1000;

  return urls.filter(url => {
    const record = existing.get(url);
    if (!record) return true;
    return new Date(record.fetchedAt).getTime() < sixHoursAgo;
  });
}

Deduplication reduces:

cost
block risk
processing time
noisy logs

Freshness should be a business decision. A stock price page and a company profile page do not need the same refresh interval.

9. Stream large jobs

If you process thousands of URLs, avoid loading everything into memory.

async function* scrapeStream(jobs) {
  for await (const job of jobs) {
    const response = await fetchPage(job.url, job.options);
    yield { job, response };
  }
}

for await (const { job, response } of scrapeStream(jobSource())) {
  await processResponse(job, response);
}

Generators make long-running jobs easier to control. They also pair well with backpressure, batching, and incremental checkpointing.

10. Add health checks

A scraper can look healthy while returning junk.

Create known test cases:

const healthChecks = [
  {
    url: 'https://example.com/products/test-product',
    expect: page => page.includes('Test Product'),
  },
];

async function runHealthChecks() {
  for (const check of healthChecks) {
    const response = await fetchPage(check.url, { render: true });
    const text = response.markdown || response.html || '';

    if (!check.expect(text)) {
      await sendAlert(`Health check failed for ${check.url}`);
      return false;
    }
  }

  return true;
}

Run health checks on a schedule. You want to know about broken extraction before your users or clients notice stale data.

11. Track the metrics that matter

Do not obsess over requests per second.

Track:

valid records per run
validation failure rate
block rate by domain
retry count by failure type
render fallback rate
cost per valid record
freshness lag
parser error rate

The most important metric is:

effective scrape rate = valid records / total attempts

A slower scraper with a high effective scrape rate is usually better than a fast scraper that fills your database with partial records.

12. Keep secrets out of code

Do not hardcode API keys.

const xcrawl = new XCrawlScraper({
  apiKey: process.env.XCRAWL_API_KEY,
});

Use environment variables, secret managers, or platform-provided secrets.

This matters more than people think because scraping projects often start as local scripts and later get copied into workers, cron jobs, dashboards, or client repos.

A simple production skeleton

Here is the shape I like:

async function processJob(job) {
  await waitForDomainPolicy(new URL(job.url).hostname);

  const response = await fetchPage(job.url, job.options);
  const rawPageId = await storeRawPage(job, response);

  const block = detectBlock(response);
  if (block) return handleBlock(job, block, rawPageId);

  const parsed = parsePage(response);
  const validation = validate(parsed);

  if (!validation.valid) {
    return db.reviewQueue.insert({ job, parsed, validation, rawPageId });
  }

  await db.records.upsert(parsed);
  return markDone(job);
}

It is not fancy. That is the point.

Good production scrapers are usually boring systems with strong boundaries:

queue
pacing
fetch
raw storage
parse
validate
monitor

Final checklist

Before you call a Node.js scraper production-ready, check this list:

[ ] jobs are stored outside memory
[ ] concurrency is controlled per domain
[ ] delays are configurable
[ ] JavaScript rendering is optional, not default
[ ] block pages are detected
[ ] raw responses are stored
[ ] parsers return structured partial data instead of crashing
[ ] validation happens before database writes
[ ] failed records go to a review queue
[ ] large jobs stream or checkpoint progress
[ ] health checks run on a schedule
[ ] metrics track valid data, not just requests
[ ] secrets come from environment variables

If you have these pieces, your scraper can still break — but it will break in a way you can understand and recover from.

That is the difference between a scraping script and a production scraping pipeline.

DEV Community

Node.js Web Scraping Best Practices: Lessons From Production Pipelines

Node.js Web Scraping Best Practices: Lessons From Production Pipelines

1. Start with a queue, not a loop

2. Control concurrency per domain

3. Use the right fetch strategy for the page

4. Detect blocks before parsing

5. Store raw responses first

6. Write parsers that degrade gracefully

7. Validate before writing to your main tables

8. Deduplicate aggressively

9. Stream large jobs

10. Add health checks

11. Track the metrics that matter

12. Keep secrets out of code

A simple production skeleton

Final checklist

Top comments (0)