I Built 77 Web Scrapers — Here Are the 5 Patterns That Never Break

#webdev #javascript #node #tutorial

After building 77 production web scrapers, I've learned that most scrapers break within weeks. But a few patterns make them nearly indestructible.

Pattern 1: API-First, HTML-Last

Before writing a single CSS selector, check if the site has a JSON API.

// Instead of this (breaks on redesign):
const title = $("h1.product-title").text();

// Do this (works forever):
const data = await fetch("https://site.com/api/products/123");
const { title } = await data.json();

Examples of hidden APIs:

YouTube: youtubei/v1/search (Innertube API)
Reddit: append .json to any URL
Hacker News: hn.algolia.com/api/v1/search

Pattern 2: Use Official Public APIs First

9 APIs that need NO authentication:

API	What You Get
Wikipedia	Market overviews, article snippets
Google News RSS	Latest 100 news articles
GitHub Search	Repos, stars, tech landscape
HN Algolia	Community discussions, points
Stack Overflow	Developer questions, votes
arXiv	Academic papers, abstracts
npm Registry	Package ecosystem data
Reddit JSON	Community sentiment
PyPI	Python packages

Pattern 3: Promise.allSettled() Over Promise.all()

// BAD: One failed source kills everything
const results = await Promise.all([wiki(), news(), github()]);

// GOOD: Get results from sources that work
const results = await Promise.allSettled([wiki(), news(), github()]);
const successful = results.filter(r => r.status === "fulfilled");

This is how I query 9 sources in parallel — if arXiv is slow, Wikipedia results still come through.

Pattern 4: Rate Limiting Built Into the Scraper

const delay = (ms) => new Promise(r => setTimeout(r, ms));

for (const url of urls) {
  const data = await scrape(url);
  results.push(data);
  await delay(1000 + Math.random() * 2000); // 1-3s random delay
}

Never hammer a server. Random delays between 1-3 seconds prevent blocks.

Pattern 5: Structured Output Validation

function validateResult(item) {
  const required = ["title", "url", "date"];
  const missing = required.filter(f => !item[f]);
  if (missing.length > 0) {
    console.warn(`Missing fields: ${missing.join(", ")}`);
    return null;
  }
  return item;
}

const clean = results.map(validateResult).filter(Boolean);

Never return garbage. Validate every item before adding to output.

Resources

77 Free Web Scrapers — all using these patterns
MCP Market Research Server — queries 9 APIs in parallel
500 Free Market Research Reports

Need data from a specific website? I'll scrape it for $20 — any site, any format (JSON/CSV/Excel). 77 production scrapers. Delivered in 24 hours. Email: Spinov001@gmail.com | Hire me →

Order custom data via Payoneer ($20)
Also: Neon Free Postgres | Vercel Free API | Hetzner 4x More Server

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email me: spinov001@gmail.com*