I Built 20+ Web Scrapers and Published Them for Free — Here's What I Learned

#webscraping #python #javascript #automation

I recently built over 20 web scrapers and published them all on the Apify Store. Here's what I learned about building scrapers at scale, dealing with anti-bot systems, and making data extraction tools that actually work.

The Stack

Runtime: Node.js with ES modules
Framework: Crawlee — handles retries, proxy rotation, rate limiting
Browsers: Playwright for JS-heavy sites, Cheerio for static HTML
Proxies: Residential proxies via Apify proxy pool
Platform: Apify for hosting, scaling, and monetization

What I Built

Category	Scrapers
E-commerce	Amazon, Walmart
Real Estate	Zillow
Jobs	Indeed, LinkedIn Jobs
Social Media	Reddit, TikTok, Pinterest, Facebook
Reviews	Trustpilot, TripAdvisor, Google Maps Places
Tech/Dev	Hacker News, DEV.to, GitHub
Video	YouTube
SEO	Google SERP
Travel	Booking.com

All output clean, structured JSON with the fields you'd expect — prices, ratings, URLs, dates, etc.

Hard Lessons Learned

1. CheerioCrawler vs PlaywrightCrawler

My biggest mistake was defaulting to CheerioCrawler (fast HTTP + HTML parsing). Modern websites are increasingly JS-rendered — Amazon, Walmart, Booking.com, Zillow all require a real browser.

Rule of thumb: If the site shows a loading spinner before content appears, you need Playwright.

2. Anti-Bot is No Joke

Some sites were practically impossible:

Instagram — login wall blocks all unauthenticated access
Twitter/X — aggressive bot detection even with residential proxies
Glassdoor — 403 blocks on every approach
Google Shopping — CAPTCHA walls after 2-3 requests

The sites that worked best were ones with server-rendered HTML or public APIs (Reddit's old.reddit.com JSON endpoints, Apple's iTunes API).

3. Wrong Code in Production

During a rush to deploy, I accidentally pushed Zillow scraper code to the Walmart actor and Walmart code to the Indeed actor. Lesson: always verify what's actually running in production, not just what's in your local files.

4. Intercept APIs, Don't Scrape DOM

The Pinterest fix was a breakthrough moment. Instead of trying to parse React's virtual DOM, I intercepted Pinterest's internal API calls using page.on('response'). The API returns clean JSON — way more reliable than CSS selectors.

page.on('response', async (response) => {
    if (response.url().includes('/resource/')) {
        const data = await response.json();
        // Clean, structured data — no DOM parsing needed
    }
});

Pricing & Business Model

All scrapers use pay-per-result pricing: $1.50 per 1,000 results. Apify handles hosting, scaling, billing, and proxy infrastructure. As a developer, you get ~80% of revenue minus compute costs.