After building 77 production web scrapers, I've learned that most scrapers break within weeks. But a few patterns make them nearly indestructible.
Pattern 1: API-First, HTML-Last
Before writing a single CSS selector, check if the site has a JSON API.
// Instead of this (breaks on redesign):
const title = $("h1.product-title").text();
// Do this (works forever):
const data = await fetch("https://site.com/api/products/123");
const { title } = await data.json();
Examples of hidden APIs:
- YouTube:
youtubei/v1/search(Innertube API) - Reddit: append
.jsonto any URL - Hacker News:
hn.algolia.com/api/v1/search
Pattern 2: Use Official Public APIs First
9 APIs that need NO authentication:
| API | What You Get |
|---|---|
| Wikipedia | Market overviews, article snippets |
| Google News RSS | Latest 100 news articles |
| GitHub Search | Repos, stars, tech landscape |
| HN Algolia | Community discussions, points |
| Stack Overflow | Developer questions, votes |
| arXiv | Academic papers, abstracts |
| npm Registry | Package ecosystem data |
| Reddit JSON | Community sentiment |
| PyPI | Python packages |
Pattern 3: Promise.allSettled() Over Promise.all()
// BAD: One failed source kills everything
const results = await Promise.all([wiki(), news(), github()]);
// GOOD: Get results from sources that work
const results = await Promise.allSettled([wiki(), news(), github()]);
const successful = results.filter(r => r.status === "fulfilled");
This is how I query 9 sources in parallel — if arXiv is slow, Wikipedia results still come through.
Pattern 4: Rate Limiting Built Into the Scraper
const delay = (ms) => new Promise(r => setTimeout(r, ms));
for (const url of urls) {
const data = await scrape(url);
results.push(data);
await delay(1000 + Math.random() * 2000); // 1-3s random delay
}
Never hammer a server. Random delays between 1-3 seconds prevent blocks.
Pattern 5: Structured Output Validation
function validateResult(item) {
const required = ["title", "url", "date"];
const missing = required.filter(f => !item[f]);
if (missing.length > 0) {
console.warn(`Missing fields: ${missing.join(", ")}`);
return null;
}
return item;
}
const clean = results.map(validateResult).filter(Boolean);
Never return garbage. Validate every item before adding to output.
Resources
- 77 Free Web Scrapers — all using these patterns
- MCP Market Research Server — queries 9 APIs in parallel
- 500 Free Market Research Reports
Need data from a specific website? I'll scrape it for $20 — any site, any format (JSON/CSV/Excel). 77 production scrapers. Delivered in 24 hours. Email: Spinov001@gmail.com | Hire me →
Top comments (0)