DEV Community

Do
Do

Posted on

I Built 18 Web Scrapers in One Week - Here's What I Learned About Modern Scraping

Last week, I challenged myself to build and publish 18 production-ready web scrapers on Apify Store. Not toy projects - real tools that handle pagination, anti-bot measures, and edge cases.

Here's what I learned (and the mistakes I made).

The Challenge

Goal: Build scrapers for different categories - jobs, news, crypto, social media, developer tools.

Stack: Node.js, Cheerio, Crawlee, and FireCrawl API for the tough sites.

Result: 18 working scrapers, 350+ test runs, ~1 paying user (we'll get there).

Lesson 1: Free APIs Are Everywhere (And Nobody Uses Them)

Before writing a single line of scraping code, I discovered something surprising: many "protected" sites have completely free, undocumented APIs.

Examples I Found:

Site API Type Auth Required
Remotive.com REST API No
CoinGecko Public API No
Greenhouse Job Boards JSON endpoints No
Hacker News Firebase API No
Reddit JSON append to URLs No

The lesson: Spend 30 minutes looking for APIs before writing a scraper. Check:

  • Network tab in DevTools
  • robots.txt for API hints
  • GitHub for unofficial API wrappers
  • Adding .json to URLs
// Instead of scraping Reddit HTML:
const url = 'https://www.reddit.com/r/webscraping.json';
const response = await fetch(url);
const data = await response.json();
// Clean JSON with all post data!
Enter fullscreen mode Exit fullscreen mode

Lesson 2: The 403 Tier List

Not all websites are created equal. After building 18 scrapers, here's my tier list:

S-Tier (Easy - Use APIs)

  • Hacker News
  • CoinGecko
  • GitHub API
  • Stack Overflow API
  • NPM Registry

A-Tier (Medium - Standard Scraping Works)

  • Dev.to
  • RemoteOK
  • Arbeitnow
  • Eventbrite
  • Google News RSS

B-Tier (Hard - Need Stealth)

  • Product Hunt
  • Glassdoor
  • TripAdvisor
  • Bark.com

F-Tier (Basically Impossible Without $$$)

  • LinkedIn (DataDome)
  • Yelp (Custom WAF)
  • DoorDash (Bot Detection)
  • Amazon (CAPTCHA + IP blocks)

Lesson: Pick your battles. Start with S and A tier sites.

Lesson 3: The "Works on My Machine" Problem

My scrapers worked perfectly locally. Then I deployed them.

What changed:

  • Apify's IP ranges are well-known (blocked by many sites)
  • No residential proxy by default
  • Different User-Agent detection

Solution: Use external scraping APIs for tough sites:

// For B-tier sites, use a scraping API
const scrapePage = async (url) => {
  // FireCrawl, ScrapingBee, or similar
  const response = await fetch('https://api.firecrawl.dev/v1/scrape', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${API_KEY}` },
    body: JSON.stringify({ url, formats: ['markdown'] })
  });
  return response.json();
};
Enter fullscreen mode Exit fullscreen mode

Lesson 4: Pagination is Where Scrapers Die

Most scraper tutorials show you how to scrape one page. Real scrapers need to handle:

  • Infinite scroll
  • "Load more" buttons
  • URL-based pagination (?page=2)
  • Cursor-based pagination
  • Rate limits between pages

My pagination pattern:

const scrapeWithPagination = async (baseUrl, maxPages = 10) => {
  const results = [];
  let page = 1;

  while (page <= maxPages) {
    const url = `${baseUrl}?page=${page}`;
    const data = await scrapePage(url);

    if (!data.items?.length) break; // No more results

    results.push(...data.items);
    page++;

    // Be nice to servers
    await new Promise(r => setTimeout(r, 1000));
  }

  return results;
};
Enter fullscreen mode Exit fullscreen mode

Lesson 5: Error Handling > Feature Count

My first scrapers had great features and terrible error handling. They crashed on:

  • Empty responses
  • Changed HTML structure
  • Rate limit responses
  • Network timeouts
  • Partial data

Now every scraper has:

const safeScrape = async (url) => {
  try {
    const response = await fetch(url, {
      timeout: 30000,
      headers: { 'User-Agent': getRandomUA() }
    });

    if (response.status === 429) {
      console.log('Rate limited, waiting...');
      await sleep(60000);
      return safeScrape(url); // Retry
    }

    if (!response.ok) {
      console.log(`HTTP ${response.status} for ${url}`);
      return { success: false, error: response.status };
    }

    const data = await response.json();
    return { success: true, data };

  } catch (error) {
    console.log(`Error scraping ${url}: ${error.message}`);
    return { success: false, error: error.message };
  }
};
Enter fullscreen mode Exit fullscreen mode

Lesson 6: The MCP Revolution

The most exciting discovery: MCP (Model Context Protocol) lets AI agents use scrapers directly.

Instead of:

  1. Human requests data
  2. Human runs scraper
  3. Human processes results
  4. Human gives to AI

Now:

  1. AI agent calls scraper via MCP
  2. AI processes results automatically

This changes everything. Scrapers aren't just for developers anymore - they're tools for AI agents.

What Actually Worked (And What Didn't)

Worked:

  • Job board scrapers - High demand, structured data
  • News aggregators - RSS feeds are reliable
  • Developer tools (GitHub, NPM, Stack Overflow) - Great APIs
  • Crypto data - Free APIs everywhere

Didn't Work:

  • E-commerce - Too protected, need expensive proxies
  • Social media - API changes, legal gray area
  • Review sites - Heavy anti-bot (Yelp, TripAdvisor)

The Numbers (Honest)

After one week:

  • 18 scrapers published
  • 350+ test runs
  • ~21 MAU (Monthly Active Users)
  • $0 revenue (so far)

The Apify $1M Challenge requires 50 MAU by January 31st. I'm getting there!

Lesson: Building is the easy part. Distribution is everything.

What's Next

  1. Content marketing - This article is part of that
  2. GitHub awesome-lists - PRs submitted
  3. Community engagement - Discord, Reddit (carefully)
  4. Better SEO - Optimizing actor descriptions

Try My Scrapers

All 18 scrapers are free to try on Apify Store:

Jobs:

Developer Tools:

News & Social:

Other:

Questions?

Drop a comment if you want me to dive deeper into any of these topics:

  • Anti-bot bypass techniques
  • Pagination patterns
  • MCP integration for AI agents
  • Monetizing scrapers

Building in public. Follow the journey on Twitter/X or check the portfolio.

Top comments (0)