DEV Community

Charles
Charles

Posted on

Node.js Scraping Best Practices: What 1000+ Hours Taught Me

Node.js Scraping Best Practices: What 1000+ Hours Taught Me

After 1000+ hours scraping everything from Amazon to Zillow, here are the patterns that actually matter and the mistakes that will waste your time.

1. Use Residential Proxies From Day One

This is the most important advice in this post. Don't bother with datacenter IPs for anything beyond testing.

// This will get you blocked on most real sites
const response = await axios.get(url);

// This will keep you running for months
const xcrawl = new XCrawlScraper({ apiKey: process.env.XCRAWL_API_KEY });
const result = await xcrawl.scrapeMarkdown(url);
Enter fullscreen mode Exit fullscreen mode

The cost difference is real ($10/month datacenter vs $49/month residential) but the effectiveness difference is 10x. I've seen scrapers go from 20% success rate to 95% just by switching proxy types.

2. Handle JavaScript-Rendered Pages Properly

More and more sites render everything with JavaScript. If you're using axios or fetch directly, you're getting partial data at best.

// Wrong: you'll get empty shells or "enable JavaScript" messages
const html = await axios.get(url);

// Right: renders the page like a real browser
const result = await xcrawl.scrapeMarkdown(url, {
  render: true,
});
Enter fullscreen mode Exit fullscreen mode

The tradeoff: rendering is slower (3-5 seconds vs 0.5 seconds) and costs more. Only use it when you actually need JavaScript execution.

3. Parse Multiple Possible Formats

HTML structure changes. Your parser should handle multiple formats:

function extractPrice(markdown) {
  const patterns = [
    /\$[\d,]+\.?\d{2}/,           // $129.99
    /USD\s*([\d.]+)/i,             // USD 129.99
    /class="price".*?>([\d.]+)/,    // HTML patterns
    /"price":\s*([\d.]+)/,         // JSON remnants
  ];

  for (const pattern of patterns) {
    const match = markdown.match(pattern);
    if (match) return parseFloat(match[1]);
  }

  return null; // Don't throw - log and return null
}
Enter fullscreen mode Exit fullscreen mode

The key insight: return null instead of throwing. A null price is better than a crash.

4. Respect Rate Limits

When a site tells you to slow down, slow down:

async function scrapeWithBackoff(url, retries = 3) {
  for (let i = 0; i < retries; i++) {
    const result = await xcrawl.scrape(url);

    if (result.status === 429) {
      const backoff = Math.pow(2, i) * 10000; // 10s, 20s, 40s
      console.log('Rate limited. Waiting ' + backoff/1000 + 's...');
      await sleep(backoff);
      continue;
    }

    return result;
  }

  throw new Error('Failed after ' + retries + ' retries');
}
Enter fullscreen mode Exit fullscreen mode

Ignoring rate limits is how you get your entire IP range blocked.

5. Store Raw Data First, Parse Later

This saved me countless times:

async function scrapeAndStore(url) {
  const result = await xcrawl.scrapeMarkdown(url);

  // Store raw markdown - you can re-parse anytime
  await db.rawScrapes.insert({
    url,
    markdown: result.data.markdown,
    scrapedAt: new Date(),
  });

  // Then extract what you need
  const price = extractPrice(result.data.markdown);
  return { url, price };
}
Enter fullscreen mode Exit fullscreen mode

When Amazon changes their HTML structure, you can re-parse all 6 months of stored markdown without hitting their servers again.

6. Build Health Checks Into Your Pipeline

Your scraper will fail silently. Always check for that:

async function healthCheck() {
  const testCases = [
    { url: 'https://amazon.com/dp/B09V3KXJPB', expectedPrice: null },
    { url: 'https://target.com/p/apple-airpods', expectedTitle: /airpods/i },
  ];

  for (const test of testCases) {
    const result = await xcrawl.scrapeMarkdown(test.url);
    const html = result.data.markdown;

    if (result.status !== 200) {
      sendAlert('Health check failed: ' + test.url + ' returned ' + result.status);
      return false;
    }

    if (html.length < 500) {
      sendAlert('Health check warning: ' + test.url + ' returned suspiciously little content');
    }
  }

  return true;
}
Enter fullscreen mode Exit fullscreen mode

Run this every hour. If it fails, you'll know before your data goes stale.

7. Don't Duplicate Work

If you're scraping 1000 URLs, don't start from scratch every time:

async function scrapeWithDeduplication(urls) {
  const existingUrls = await db.getAllUrls();
  const newUrls = urls.filter(u => !existingUrls.has(u));
  const staleUrls = urls.filter(u => existingUrls.get(u).lastChecked < Date.now() - 6*3600*1000);

  // Only scrape URLs that are new or stale (>6 hours old)
  const toScrape = [...newUrls, ...staleUrls];

  console.log('Skipping ' + (urls.length - toScrape.length) + ' URLs (already fresh)');
  return toScrape;
}
Enter fullscreen mode Exit fullscreen mode

This cuts your API calls by 60-80% once you've built up a dataset.

8. Use Streaming for Large Datasets

If you're scraping thousands of URLs, don't do it all in memory:

async function* scrapeStream(urls) {
  for (const url of urls) {
    const result = await xcrawl.scrapeMarkdown(url);
    yield { url, result };
    // Don't await the next one immediately - respect the delay
    await sleep(2000 + Math.random() * 3000);
  }
}

// Usage
for await (const { url, result } of scrapeStream(largeUrlList)) {
  await processResult(result);
}
Enter fullscreen mode Exit fullscreen mode

Generator patterns prevent memory explosions on large jobs.

The Pattern That Changed Everything

Most developers think scraping is about writing clever parsers. It's not. It's about building systems that survive real-world chaos:

Reliable Scraping = Residential Proxies + Smart Retry + Data Validation + Health Checks + Raw Storage
Enter fullscreen mode Exit fullscreen mode

Each piece alone isn't enough. Together, they're unbreakable.

Key Takeaways

  1. Residential proxies aren't optional — datacenter IPs are dead on protected sites
  2. Store raw data first — enables re-parsing when structure changes
  3. Validate everything — 200 OK means nothing without data validation
  4. Health checks catch silent failures — your scraper will break, you just won't know
  5. Don't duplicate work — cache aggressively, re-scrape only stale data
  6. Use generators for large jobs — prevent memory issues

1000 hours of lessons condensed into one post. Questions about your specific scraping challenge? Let's talk.

Top comments (0)