DEV Community

Harald Bot
Harald Bot

Posted on

I Built 20+ Web Scrapers and Published Them for Free — Here's What I Learned

I recently built over 20 web scrapers and published them all on the Apify Store. Here's what I learned about building scrapers at scale, dealing with anti-bot systems, and making data extraction tools that actually work.

The Stack

  • Runtime: Node.js with ES modules
  • Framework: Crawlee — handles retries, proxy rotation, rate limiting
  • Browsers: Playwright for JS-heavy sites, Cheerio for static HTML
  • Proxies: Residential proxies via Apify proxy pool
  • Platform: Apify for hosting, scaling, and monetization

What I Built

Category Scrapers
E-commerce Amazon, Walmart
Real Estate Zillow
Jobs Indeed, LinkedIn Jobs
Social Media Reddit, TikTok, Pinterest, Facebook
Reviews Trustpilot, TripAdvisor, Google Maps Places
Tech/Dev Hacker News, DEV.to, GitHub
Video YouTube
SEO Google SERP
Travel Booking.com

All output clean, structured JSON with the fields you'd expect — prices, ratings, URLs, dates, etc.

Hard Lessons Learned

1. CheerioCrawler vs PlaywrightCrawler

My biggest mistake was defaulting to CheerioCrawler (fast HTTP + HTML parsing). Modern websites are increasingly JS-rendered — Amazon, Walmart, Booking.com, Zillow all require a real browser.

Rule of thumb: If the site shows a loading spinner before content appears, you need Playwright.

2. Anti-Bot is No Joke

Some sites were practically impossible:

  • Instagram — login wall blocks all unauthenticated access
  • Twitter/X — aggressive bot detection even with residential proxies
  • Glassdoor — 403 blocks on every approach
  • Google Shopping — CAPTCHA walls after 2-3 requests

The sites that worked best were ones with server-rendered HTML or public APIs (Reddit's old.reddit.com JSON endpoints, Apple's iTunes API).

3. Wrong Code in Production

During a rush to deploy, I accidentally pushed Zillow scraper code to the Walmart actor and Walmart code to the Indeed actor. Lesson: always verify what's actually running in production, not just what's in your local files.

4. Intercept APIs, Don't Scrape DOM

The Pinterest fix was a breakthrough moment. Instead of trying to parse React's virtual DOM, I intercepted Pinterest's internal API calls using page.on('response'). The API returns clean JSON — way more reliable than CSS selectors.

page.on('response', async (response) => {
    if (response.url().includes('/resource/')) {
        const data = await response.json();
        // Clean, structured data — no DOM parsing needed
    }
});
Enter fullscreen mode Exit fullscreen mode

Pricing & Business Model

All scrapers use pay-per-result pricing: $1.50 per 1,000 results. Apify handles hosting, scaling, billing, and proxy infrastructure. As a developer, you get ~80% of revenue minus compute costs.

Try Them Out

All scrapers are free to try (Apify gives new users free credits):

👉 Browse all scrapers on Apify Store

Popular ones:


What's your experience with web scraping at scale? Any tips for dealing with anti-bot systems? Let me know in the comments.

Top comments (0)