DEV Community

Charles
Charles

Posted on • Edited on

Node.js Web Scraping Best Practices: Lessons From Production Pipelines

Node.js Web Scraping Best Practices: Lessons From Production Pipelines

Node.js is a great choice for web scraping.

The ecosystem is mature, async I/O is a natural fit, and you can move quickly with tools like fetch, cheerio, Playwright, Puppeteer, and scraping APIs.

But most Node.js scrapers fail for the same reason: they are written like one-off scripts and then quietly promoted into production.

This post is a practical checklist for building scrapers that survive real usage.

1. Start with a queue, not a loop

The first version of a scraper often looks like this:

for (const url of urls) {
  const html = await fetch(url).then(r => r.text());
  const data = parse(html);
  await save(data);
}
Enter fullscreen mode Exit fullscreen mode

That is fine for a test. It is fragile in production.

Use jobs instead:

const job = {
  url: 'https://example.com/products/123',
  type: 'product',
  priority: 'normal',
  attempts: 0,
  createdAt: new Date().toISOString(),
};
Enter fullscreen mode Exit fullscreen mode

A queue gives you:

  • retries without losing work
  • priority handling
  • restartability
  • deduplication
  • visibility into failures
  • safer concurrency control

You can start simple with a database table or a JSON-backed queue. You do not need Kafka on day one. You just need to stop treating URLs as a temporary in-memory array.

2. Control concurrency per domain

Global concurrency is too blunt.

// Too simple for production
const concurrency = 20;
Enter fullscreen mode Exit fullscreen mode

Different domains tolerate different request patterns. Even within one domain, product pages, search pages, and detail pages may need different pacing.

A better structure:

const domainPolicy = {
  'example.com': {
    concurrency: 2,
    minDelayMs: 3000,
    maxDelayMs: 9000,
  },
  'another-site.com': {
    concurrency: 1,
    minDelayMs: 8000,
    maxDelayMs: 20000,
  },
};

function randomDelay(min, max) {
  return min + Math.floor(Math.random() * (max - min));
}
Enter fullscreen mode Exit fullscreen mode

Then make your workers respect the policy:

async function waitForDomainPolicy(domain) {
  const policy = domainPolicy[domain] || domainPolicy.default;
  await sleep(randomDelay(policy.minDelayMs, policy.maxDelayMs));
}
Enter fullscreen mode Exit fullscreen mode

Your goal is not maximum requests per second. Your goal is stable, valid data over time.

3. Use the right fetch strategy for the page

Not every page needs a browser.

A practical hierarchy:

  1. HTTP fetch for static pages
  2. HTML parsing with Cheerio when the markup is available
  3. Browser rendering when JavaScript is required
  4. Residential proxy / scraping API when the site is protected

Example wrapper:

async function fetchPage(url, options = {}) {
  if (options.render) {
    return await fetchWithBrowser(url);
  }

  return await fetchWithHttp(url);
}
Enter fullscreen mode Exit fullscreen mode

If you use a scraping API such as XCrawl, keep it behind the same wrapper:

async function fetchWithScrapingApi(url, options = {}) {
  const result = await xcrawl.scrapeMarkdown(url, {
    render: options.render ?? false,
    country: options.country,
  });

  return {
    status: result.status,
    markdown: result.data?.markdown || '',
    html: result.data?.html || '',
    finalUrl: result.finalUrl,
  };
}
Enter fullscreen mode Exit fullscreen mode

The rest of your application should not care which fetch method was used.

4. Detect blocks before parsing

Do not send every response directly into the parser.

A blocked response can still be 200 OK.

function detectBlock(response) {
  const text = `${response.html || ''}\n${response.markdown || ''}`.toLowerCase();

  if (response.status === 403) return 'forbidden';
  if (response.status === 429) return 'rate_limited';
  if (text.includes('captcha')) return 'captcha';
  if (text.includes('access denied')) return 'access_denied';
  if (text.includes('enable javascript')) return 'needs_rendering';
  if (text.length < 500) return 'too_little_content';

  return null;
}
Enter fullscreen mode Exit fullscreen mode

Then route failures by type:

const response = await fetchPage(job.url);
const block = detectBlock(response);

if (block === 'needs_rendering') {
  return requeue(job, { render: true });
}

if (block === 'rate_limited') {
  return requeue(job, { delayMs: 30 * 60 * 1000 });
}

if (block) {
  return markBlocked(job, block);
}
Enter fullscreen mode Exit fullscreen mode

Blind retries are a common way to make blocking worse. Typed retries are safer.

5. Store raw responses first

This is one of the highest-ROI habits in scraping.

async function storeRawPage(job, response) {
  return db.rawPages.insert({
    url: job.url,
    status: response.status,
    html: response.html,
    markdown: response.markdown,
    finalUrl: response.finalUrl,
    fetchedAt: new Date(),
  });
}
Enter fullscreen mode Exit fullscreen mode

Why it matters:

  • you can re-parse old pages after fixing a parser
  • you can inspect block pages later
  • you can prove whether missing data came from the site or your code
  • you reduce repeated requests during debugging

Storage is usually cheaper than re-scraping.

6. Write parsers that degrade gracefully

A brittle parser:

const price = Number($('.price').text().replace('$', ''));
Enter fullscreen mode Exit fullscreen mode

A more defensive parser:

function extractPrice(text) {
  const patterns = [
    /\$\s?([0-9,]+(?:\.\d{2})?)/,
    /USD\s?([0-9,]+(?:\.\d{2})?)/i,
    /price[^0-9]{0,20}([0-9,]+(?:\.\d{2})?)/i,
  ];

  for (const pattern of patterns) {
    const match = text.match(pattern);
    if (match) {
      return Number(match[1].replace(/,/g, ''));
    }
  }

  return null;
}
Enter fullscreen mode Exit fullscreen mode

Returning null is often better than throwing. The validation layer can decide whether the record is good enough.

7. Validate before writing to your main tables

A successful parse is not the same as valid data.

function validateProduct(product) {
  const checks = {
    hasTitle: Boolean(product.title && product.title.length >= 8),
    hasPrice: typeof product.price === 'number' && product.price > 0,
    hasUrl: Boolean(product.url?.startsWith('https://')),
    priceReasonable: product.price == null || product.price < 10000,
  };

  const passed = Object.values(checks).filter(Boolean).length;
  const confidence = passed / Object.keys(checks).length;

  return {
    valid: confidence >= 0.75,
    confidence,
    checks,
  };
}
Enter fullscreen mode Exit fullscreen mode

Only write valid records to the main table:

const validation = validateProduct(product);

if (!validation.valid) {
  await db.reviewQueue.insert({ product, validation, rawPageId });
  return;
}

await db.products.upsert(product);
Enter fullscreen mode Exit fullscreen mode

This prevents silent corruption. Your scraper will still fail sometimes, but bad data will not quietly become business data.

8. Deduplicate aggressively

Do not scrape the same fresh URL again and again.

async function planJobs(urls) {
  const existing = await db.pages.findByUrls(urls);
  const sixHoursAgo = Date.now() - 6 * 60 * 60 * 1000;

  return urls.filter(url => {
    const record = existing.get(url);
    if (!record) return true;
    return new Date(record.fetchedAt).getTime() < sixHoursAgo;
  });
}
Enter fullscreen mode Exit fullscreen mode

Deduplication reduces:

  • cost
  • block risk
  • processing time
  • noisy logs

Freshness should be a business decision. A stock price page and a company profile page do not need the same refresh interval.

9. Stream large jobs

If you process thousands of URLs, avoid loading everything into memory.

async function* scrapeStream(jobs) {
  for await (const job of jobs) {
    const response = await fetchPage(job.url, job.options);
    yield { job, response };
  }
}

for await (const { job, response } of scrapeStream(jobSource())) {
  await processResponse(job, response);
}
Enter fullscreen mode Exit fullscreen mode

Generators make long-running jobs easier to control. They also pair well with backpressure, batching, and incremental checkpointing.

10. Add health checks

A scraper can look healthy while returning junk.

Create known test cases:

const healthChecks = [
  {
    url: 'https://example.com/products/test-product',
    expect: page => page.includes('Test Product'),
  },
];

async function runHealthChecks() {
  for (const check of healthChecks) {
    const response = await fetchPage(check.url, { render: true });
    const text = response.markdown || response.html || '';

    if (!check.expect(text)) {
      await sendAlert(`Health check failed for ${check.url}`);
      return false;
    }
  }

  return true;
}
Enter fullscreen mode Exit fullscreen mode

Run health checks on a schedule. You want to know about broken extraction before your users or clients notice stale data.

11. Track the metrics that matter

Do not obsess over requests per second.

Track:

  • valid records per run
  • validation failure rate
  • block rate by domain
  • retry count by failure type
  • render fallback rate
  • cost per valid record
  • freshness lag
  • parser error rate

The most important metric is:

effective scrape rate = valid records / total attempts
Enter fullscreen mode Exit fullscreen mode

A slower scraper with a high effective scrape rate is usually better than a fast scraper that fills your database with partial records.

12. Keep secrets out of code

Do not hardcode API keys.

const xcrawl = new XCrawlScraper({
  apiKey: process.env.XCRAWL_API_KEY,
});
Enter fullscreen mode Exit fullscreen mode

Use environment variables, secret managers, or platform-provided secrets.

This matters more than people think because scraping projects often start as local scripts and later get copied into workers, cron jobs, dashboards, or client repos.

A simple production skeleton

Here is the shape I like:

async function processJob(job) {
  await waitForDomainPolicy(new URL(job.url).hostname);

  const response = await fetchPage(job.url, job.options);
  const rawPageId = await storeRawPage(job, response);

  const block = detectBlock(response);
  if (block) return handleBlock(job, block, rawPageId);

  const parsed = parsePage(response);
  const validation = validate(parsed);

  if (!validation.valid) {
    return db.reviewQueue.insert({ job, parsed, validation, rawPageId });
  }

  await db.records.upsert(parsed);
  return markDone(job);
}
Enter fullscreen mode Exit fullscreen mode

It is not fancy. That is the point.

Good production scrapers are usually boring systems with strong boundaries:

  • queue
  • pacing
  • fetch
  • raw storage
  • parse
  • validate
  • monitor

Final checklist

Before you call a Node.js scraper production-ready, check this list:

  • [ ] jobs are stored outside memory
  • [ ] concurrency is controlled per domain
  • [ ] delays are configurable
  • [ ] JavaScript rendering is optional, not default
  • [ ] block pages are detected
  • [ ] raw responses are stored
  • [ ] parsers return structured partial data instead of crashing
  • [ ] validation happens before database writes
  • [ ] failed records go to a review queue
  • [ ] large jobs stream or checkpoint progress
  • [ ] health checks run on a schedule
  • [ ] metrics track valid data, not just requests
  • [ ] secrets come from environment variables

If you have these pieces, your scraper can still break — but it will break in a way you can understand and recover from.

That is the difference between a scraping script and a production scraping pipeline.

Top comments (0)