I recently built over 20 web scrapers and published them all on the Apify Store. Here's what I learned about building scrapers at scale, dealing with anti-bot systems, and making data extraction tools that actually work.
The Stack
- Runtime: Node.js with ES modules
- Framework: Crawlee — handles retries, proxy rotation, rate limiting
- Browsers: Playwright for JS-heavy sites, Cheerio for static HTML
- Proxies: Residential proxies via Apify proxy pool
- Platform: Apify for hosting, scaling, and monetization
What I Built
| Category | Scrapers |
|---|---|
| E-commerce | Amazon, Walmart |
| Real Estate | Zillow |
| Jobs | Indeed, LinkedIn Jobs |
| Social Media | Reddit, TikTok, Pinterest, Facebook |
| Reviews | Trustpilot, TripAdvisor, Google Maps Places |
| Tech/Dev | Hacker News, DEV.to, GitHub |
| Video | YouTube |
| SEO | Google SERP |
| Travel | Booking.com |
All output clean, structured JSON with the fields you'd expect — prices, ratings, URLs, dates, etc.
Hard Lessons Learned
1. CheerioCrawler vs PlaywrightCrawler
My biggest mistake was defaulting to CheerioCrawler (fast HTTP + HTML parsing). Modern websites are increasingly JS-rendered — Amazon, Walmart, Booking.com, Zillow all require a real browser.
Rule of thumb: If the site shows a loading spinner before content appears, you need Playwright.
2. Anti-Bot is No Joke
Some sites were practically impossible:
- Instagram — login wall blocks all unauthenticated access
- Twitter/X — aggressive bot detection even with residential proxies
- Glassdoor — 403 blocks on every approach
- Google Shopping — CAPTCHA walls after 2-3 requests
The sites that worked best were ones with server-rendered HTML or public APIs (Reddit's old.reddit.com JSON endpoints, Apple's iTunes API).
3. Wrong Code in Production
During a rush to deploy, I accidentally pushed Zillow scraper code to the Walmart actor and Walmart code to the Indeed actor. Lesson: always verify what's actually running in production, not just what's in your local files.
4. Intercept APIs, Don't Scrape DOM
The Pinterest fix was a breakthrough moment. Instead of trying to parse React's virtual DOM, I intercepted Pinterest's internal API calls using page.on('response'). The API returns clean JSON — way more reliable than CSS selectors.
page.on('response', async (response) => {
if (response.url().includes('/resource/')) {
const data = await response.json();
// Clean, structured data — no DOM parsing needed
}
});
Pricing & Business Model
All scrapers use pay-per-result pricing: $1.50 per 1,000 results. Apify handles hosting, scaling, billing, and proxy infrastructure. As a developer, you get ~80% of revenue minus compute costs.
Try Them Out
All scrapers are free to try (Apify gives new users free credits):
👉 Browse all scrapers on Apify Store
Popular ones:
What's your experience with web scraping at scale? Any tips for dealing with anti-bot systems? Let me know in the comments.
Top comments (0)