Posted on Jan 18

I Built 18 Web Scrapers in One Week - Here's What I Learned About Modern Scraping

#webscraping #javascript #automation #api

Last week, I challenged myself to build and publish 18 production-ready web scrapers on Apify Store. Not toy projects - real tools that handle pagination, anti-bot measures, and edge cases.

Here's what I learned (and the mistakes I made).

The Challenge

Goal: Build scrapers for different categories - jobs, news, crypto, social media, developer tools.

Stack: Node.js, Cheerio, Crawlee, and FireCrawl API for the tough sites.

Result: 18 working scrapers, 350+ test runs, ~1 paying user (we'll get there).

Lesson 1: Free APIs Are Everywhere (And Nobody Uses Them)

Before writing a single line of scraping code, I discovered something surprising: many "protected" sites have completely free, undocumented APIs.

Examples I Found:

Site	API Type	Auth Required
Remotive.com	REST API	No
CoinGecko	Public API	No
Greenhouse Job Boards	JSON endpoints	No
Hacker News	Firebase API	No
Reddit	JSON append to URLs	No

The lesson: Spend 30 minutes looking for APIs before writing a scraper. Check:

Network tab in DevTools
robots.txt for API hints
GitHub for unofficial API wrappers
Adding .json to URLs

// Instead of scraping Reddit HTML:
const url = 'https://www.reddit.com/r/webscraping.json';
const response = await fetch(url);
const data = await response.json();
// Clean JSON with all post data!

Lesson 2: The 403 Tier List

Not all websites are created equal. After building 18 scrapers, here's my tier list:

S-Tier (Easy - Use APIs)

Hacker News
CoinGecko
GitHub API
Stack Overflow API
NPM Registry

A-Tier (Medium - Standard Scraping Works)

Dev.to
RemoteOK
Arbeitnow
Eventbrite
Google News RSS

B-Tier (Hard - Need Stealth)

Product Hunt
Glassdoor
TripAdvisor
Bark.com

F-Tier (Basically Impossible Without $$$)

LinkedIn (DataDome)
Yelp (Custom WAF)
DoorDash (Bot Detection)
Amazon (CAPTCHA + IP blocks)

Lesson: Pick your battles. Start with S and A tier sites.

Lesson 3: The "Works on My Machine" Problem

My scrapers worked perfectly locally. Then I deployed them.

What changed:

Apify's IP ranges are well-known (blocked by many sites)
No residential proxy by default
Different User-Agent detection

Solution: Use external scraping APIs for tough sites:

// For B-tier sites, use a scraping API
const scrapePage = async (url) => {
  // FireCrawl, ScrapingBee, or similar
  const response = await fetch('https://api.firecrawl.dev/v1/scrape', {
    method: 'POST',
    headers: { 'Authorization': `Bearer ${API_KEY}` },
    body: JSON.stringify({ url, formats: ['markdown'] })
  });
  return response.json();
};

Lesson 4: Pagination is Where Scrapers Die

Most scraper tutorials show you how to scrape one page. Real scrapers need to handle:

Infinite scroll
"Load more" buttons
URL-based pagination (?page=2)
Cursor-based pagination
Rate limits between pages

My pagination pattern:

const scrapeWithPagination = async (baseUrl, maxPages = 10) => {
  const results = [];
  let page = 1;

  while (page <= maxPages) {
    const url = `${baseUrl}?page=${page}`;
    const data = await scrapePage(url);

    if (!data.items?.length) break; // No more results

    results.push(...data.items);
    page++;

    // Be nice to servers
    await new Promise(r => setTimeout(r, 1000));
  }

  return results;
};

Lesson 5: Error Handling > Feature Count

My first scrapers had great features and terrible error handling. They crashed on:

Empty responses
Changed HTML structure
Rate limit responses
Network timeouts
Partial data

Now every scraper has:

const safeScrape = async (url) => {
  try {
    const response = await fetch(url, {
      timeout: 30000,
      headers: { 'User-Agent': getRandomUA() }
    });

    if (response.status === 429) {
      console.log('Rate limited, waiting...');
      await sleep(60000);
      return safeScrape(url); // Retry
    }

    if (!response.ok) {
      console.log(`HTTP ${response.status} for ${url}`);
      return { success: false, error: response.status };
    }

    const data = await response.json();
    return { success: true, data };

  } catch (error) {
    console.log(`Error scraping ${url}: ${error.message}`);
    return { success: false, error: error.message };
  }
};

Lesson 6: The MCP Revolution

The most exciting discovery: MCP (Model Context Protocol) lets AI agents use scrapers directly.

Instead of:

Human requests data
Human runs scraper
Human processes results
Human gives to AI

Now:

AI agent calls scraper via MCP
AI processes results automatically

This changes everything. Scrapers aren't just for developers anymore - they're tools for AI agents.

What Actually Worked (And What Didn't)

Worked:

Job board scrapers - High demand, structured data
News aggregators - RSS feeds are reliable
Developer tools (GitHub, NPM, Stack Overflow) - Great APIs
Crypto data - Free APIs everywhere

Didn't Work:

E-commerce - Too protected, need expensive proxies
Social media - API changes, legal gray area
Review sites - Heavy anti-bot (Yelp, TripAdvisor)

The Numbers (Honest)

After one week:

18 scrapers published
350+ test runs
~21 MAU (Monthly Active Users)
$0 revenue (so far)

The Apify $1M Challenge requires 50 MAU by January 31st. I'm getting there!

Lesson: Building is the easy part. Distribution is everything.

What's Next

Content marketing - This article is part of that
GitHub awesome-lists - PRs submitted
Community engagement - Discord, Reddit (carefully)
Better SEO - Optimizing actor descriptions

Try My Scrapers

All 18 scrapers are free to try on Apify Store:

Jobs:

Developer Tools:

News & Social:

Other:

Questions?

Drop a comment if you want me to dive deeper into any of these topics:

Anti-bot bypass techniques
Pagination patterns
MCP integration for AI agents
Monetizing scrapers

Building in public. Follow the journey on Twitter/X or check the portfolio.

DEV Community