DEV Community

Алексей Спинов
Алексей Спинов

Posted on

I Built 77 Web Scrapers — Here Are the 5 Patterns That Never Break

After building 77 production web scrapers, I've learned that most scrapers break within weeks. But a few patterns make them nearly indestructible.

Pattern 1: API-First, HTML-Last

Before writing a single CSS selector, check if the site has a JSON API.

// Instead of this (breaks on redesign):
const title = $("h1.product-title").text();

// Do this (works forever):
const data = await fetch("https://site.com/api/products/123");
const { title } = await data.json();
Enter fullscreen mode Exit fullscreen mode

Examples of hidden APIs:

  • YouTube: youtubei/v1/search (Innertube API)
  • Reddit: append .json to any URL
  • Hacker News: hn.algolia.com/api/v1/search

Pattern 2: Use Official Public APIs First

9 APIs that need NO authentication:

API What You Get
Wikipedia Market overviews, article snippets
Google News RSS Latest 100 news articles
GitHub Search Repos, stars, tech landscape
HN Algolia Community discussions, points
Stack Overflow Developer questions, votes
arXiv Academic papers, abstracts
npm Registry Package ecosystem data
Reddit JSON Community sentiment
PyPI Python packages

Pattern 3: Promise.allSettled() Over Promise.all()

// BAD: One failed source kills everything
const results = await Promise.all([wiki(), news(), github()]);

// GOOD: Get results from sources that work
const results = await Promise.allSettled([wiki(), news(), github()]);
const successful = results.filter(r => r.status === "fulfilled");
Enter fullscreen mode Exit fullscreen mode

This is how I query 9 sources in parallel — if arXiv is slow, Wikipedia results still come through.

Pattern 4: Rate Limiting Built Into the Scraper

const delay = (ms) => new Promise(r => setTimeout(r, ms));

for (const url of urls) {
  const data = await scrape(url);
  results.push(data);
  await delay(1000 + Math.random() * 2000); // 1-3s random delay
}
Enter fullscreen mode Exit fullscreen mode

Never hammer a server. Random delays between 1-3 seconds prevent blocks.

Pattern 5: Structured Output Validation

function validateResult(item) {
  const required = ["title", "url", "date"];
  const missing = required.filter(f => !item[f]);
  if (missing.length > 0) {
    console.warn(`Missing fields: ${missing.join(", ")}`);
    return null;
  }
  return item;
}

const clean = results.map(validateResult).filter(Boolean);
Enter fullscreen mode Exit fullscreen mode

Never return garbage. Validate every item before adding to output.


Resources


Need data from a specific website? I'll scrape it for $20 — any site, any format (JSON/CSV/Excel). 77 production scrapers. Delivered in 24 hours. Email: Spinov001@gmail.com | Hire me →

Top comments (0)