Last week, I challenged myself to build and publish 18 production-ready web scrapers on Apify Store. Not toy projects - real tools that handle pagination, anti-bot measures, and edge cases.
Here's what I learned (and the mistakes I made).
The Challenge
Goal: Build scrapers for different categories - jobs, news, crypto, social media, developer tools.
Stack: Node.js, Cheerio, Crawlee, and FireCrawl API for the tough sites.
Result: 18 working scrapers, 350+ test runs, ~1 paying user (we'll get there).
Lesson 1: Free APIs Are Everywhere (And Nobody Uses Them)
Before writing a single line of scraping code, I discovered something surprising: many "protected" sites have completely free, undocumented APIs.
Examples I Found:
| Site | API Type | Auth Required |
|---|---|---|
| Remotive.com | REST API | No |
| CoinGecko | Public API | No |
| Greenhouse Job Boards | JSON endpoints | No |
| Hacker News | Firebase API | No |
| JSON append to URLs | No |
The lesson: Spend 30 minutes looking for APIs before writing a scraper. Check:
- Network tab in DevTools
-
robots.txtfor API hints - GitHub for unofficial API wrappers
- Adding
.jsonto URLs
// Instead of scraping Reddit HTML:
const url = 'https://www.reddit.com/r/webscraping.json';
const response = await fetch(url);
const data = await response.json();
// Clean JSON with all post data!
Lesson 2: The 403 Tier List
Not all websites are created equal. After building 18 scrapers, here's my tier list:
S-Tier (Easy - Use APIs)
- Hacker News
- CoinGecko
- GitHub API
- Stack Overflow API
- NPM Registry
A-Tier (Medium - Standard Scraping Works)
- Dev.to
- RemoteOK
- Arbeitnow
- Eventbrite
- Google News RSS
B-Tier (Hard - Need Stealth)
- Product Hunt
- Glassdoor
- TripAdvisor
- Bark.com
F-Tier (Basically Impossible Without $$$)
- LinkedIn (DataDome)
- Yelp (Custom WAF)
- DoorDash (Bot Detection)
- Amazon (CAPTCHA + IP blocks)
Lesson: Pick your battles. Start with S and A tier sites.
Lesson 3: The "Works on My Machine" Problem
My scrapers worked perfectly locally. Then I deployed them.
What changed:
- Apify's IP ranges are well-known (blocked by many sites)
- No residential proxy by default
- Different User-Agent detection
Solution: Use external scraping APIs for tough sites:
// For B-tier sites, use a scraping API
const scrapePage = async (url) => {
// FireCrawl, ScrapingBee, or similar
const response = await fetch('https://api.firecrawl.dev/v1/scrape', {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: JSON.stringify({ url, formats: ['markdown'] })
});
return response.json();
};
Lesson 4: Pagination is Where Scrapers Die
Most scraper tutorials show you how to scrape one page. Real scrapers need to handle:
- Infinite scroll
- "Load more" buttons
- URL-based pagination (?page=2)
- Cursor-based pagination
- Rate limits between pages
My pagination pattern:
const scrapeWithPagination = async (baseUrl, maxPages = 10) => {
const results = [];
let page = 1;
while (page <= maxPages) {
const url = `${baseUrl}?page=${page}`;
const data = await scrapePage(url);
if (!data.items?.length) break; // No more results
results.push(...data.items);
page++;
// Be nice to servers
await new Promise(r => setTimeout(r, 1000));
}
return results;
};
Lesson 5: Error Handling > Feature Count
My first scrapers had great features and terrible error handling. They crashed on:
- Empty responses
- Changed HTML structure
- Rate limit responses
- Network timeouts
- Partial data
Now every scraper has:
const safeScrape = async (url) => {
try {
const response = await fetch(url, {
timeout: 30000,
headers: { 'User-Agent': getRandomUA() }
});
if (response.status === 429) {
console.log('Rate limited, waiting...');
await sleep(60000);
return safeScrape(url); // Retry
}
if (!response.ok) {
console.log(`HTTP ${response.status} for ${url}`);
return { success: false, error: response.status };
}
const data = await response.json();
return { success: true, data };
} catch (error) {
console.log(`Error scraping ${url}: ${error.message}`);
return { success: false, error: error.message };
}
};
Lesson 6: The MCP Revolution
The most exciting discovery: MCP (Model Context Protocol) lets AI agents use scrapers directly.
Instead of:
- Human requests data
- Human runs scraper
- Human processes results
- Human gives to AI
Now:
- AI agent calls scraper via MCP
- AI processes results automatically
This changes everything. Scrapers aren't just for developers anymore - they're tools for AI agents.
What Actually Worked (And What Didn't)
Worked:
- Job board scrapers - High demand, structured data
- News aggregators - RSS feeds are reliable
- Developer tools (GitHub, NPM, Stack Overflow) - Great APIs
- Crypto data - Free APIs everywhere
Didn't Work:
- E-commerce - Too protected, need expensive proxies
- Social media - API changes, legal gray area
- Review sites - Heavy anti-bot (Yelp, TripAdvisor)
The Numbers (Honest)
After one week:
- 18 scrapers published
- 350+ test runs
- ~21 MAU (Monthly Active Users)
- $0 revenue (so far)
The Apify $1M Challenge requires 50 MAU by January 31st. I'm getting there!
Lesson: Building is the easy part. Distribution is everything.
What's Next
- Content marketing - This article is part of that
- GitHub awesome-lists - PRs submitted
- Community engagement - Discord, Reddit (carefully)
- Better SEO - Optimizing actor descriptions
Try My Scrapers
All 18 scrapers are free to try on Apify Store:
Jobs:
Developer Tools:
News & Social:
Other:
Questions?
Drop a comment if you want me to dive deeper into any of these topics:
- Anti-bot bypass techniques
- Pagination patterns
- MCP integration for AI agents
- Monetizing scrapers
Building in public. Follow the journey on Twitter/X or check the portfolio.
Top comments (0)