Crawlee Has a Free API — Web Scraping Framework That Handles Anti-Bot Protection

#crawlee #webscraping #automation #javascript

TL;DR

Crawlee is an open-source web scraping framework by Apify that handles browser fingerprinting, proxy rotation, and request queuing automatically. It supports HTTP, Cheerio, Playwright, and Puppeteer crawlers — all with the same unified interface.

What Is Crawlee?

Crawlee is the most complete scraping framework:

Anti-bot protection — automatic fingerprinting, headers, proxy rotation
Multiple crawlers — HTTP, Cheerio, Playwright, Puppeteer
Request queue — handles millions of URLs with retry logic
Auto-scaling — adjusts concurrency based on system resources
Session management — rotates sessions to avoid blocks
Storage — built-in dataset and key-value storage
Free — Apache 2.0

Quick Start

npx crawlee create my-scraper
cd my-scraper
npm start

HTTP Crawler (Fastest)

import { HttpCrawler } from "crawlee";

const crawler = new HttpCrawler({
  maxRequestsPerCrawl: 100,

  async requestHandler({ request, body, log }) {
    log.info(`Processing ${request.url}`);
    // body is raw HTML string
    console.log(body.substring(0, 200));
  },
});

await crawler.run(["https://example.com"]);

Cheerio Crawler (HTML Parsing)

import { CheerioCrawler, Dataset } from "crawlee";

const crawler = new CheerioCrawler({
  async requestHandler({ $, request, enqueueLinks }) {
    // jQuery-like selectors with Cheerio
    const title = $("h1").text();
    const price = $(".price").text();
    const description = $(".description").text();

    // Save to dataset
    await Dataset.pushData({
      url: request.url,
      title,
      price,
      description,
    });

    // Follow links (auto-deduplication)
    await enqueueLinks({
      globs: ["https://example.com/products/*"],
    });
  },

  maxRequestsPerCrawl: 1000,
  maxConcurrency: 10,
});

await crawler.run(["https://example.com/products"]);

// Export results
const dataset = await Dataset.open();
await dataset.exportToCSV("results");

Playwright Crawler (JavaScript-Heavy Sites)

import { PlaywrightCrawler, Dataset } from "crawlee";

const crawler = new PlaywrightCrawler({
  launchContext: {
    launchOptions: { headless: true },
  },

  async requestHandler({ page, request, enqueueLinks }) {
    // Wait for dynamic content
    await page.waitForSelector(".product-card");

    // Extract data from rendered page
    const products = await page.$$eval(".product-card", (cards) =>
      cards.map((card) => ({
        name: card.querySelector(".name")?.textContent?.trim(),
        price: card.querySelector(".price")?.textContent?.trim(),
        image: card.querySelector("img")?.src,
      }))
    );

    await Dataset.pushData(products);

    // Handle infinite scroll
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(2000);

    // Follow pagination
    await enqueueLinks({ selector: ".next-page" });
  },
});

await crawler.run(["https://example.com"]);

Proxy Rotation

import { CheerioCrawler, ProxyConfiguration } from "crawlee";

const proxyConfiguration = new ProxyConfiguration({
  proxyUrls: [
    "http://user:pass@proxy1.com:8080",
    "http://user:pass@proxy2.com:8080",
    "http://user:pass@proxy3.com:8080",
  ],
});

const crawler = new CheerioCrawler({
  proxyConfiguration,
  sessionPoolOptions: {
    maxPoolSize: 100,
    sessionOptions: {
      maxUsageCount: 50, // rotate after 50 uses
    },
  },
  async requestHandler({ $ }) {
    // Proxy is automatically rotated per request
  },
});

Crawlee vs Alternatives

Feature	Crawlee	Scrapy	Puppeteer	Selenium
Language	TypeScript	Python	JavaScript	Multi
Anti-bot	Built-in	Manual	Manual	Manual
Browser support	Playwright+Puppeteer	Splash	Chrome	Multi
Request queue	Built-in	Built-in	Manual	Manual
Auto-scaling	Yes	Scrapy Cloud	Manual	Manual
Session rotation	Built-in	Manual	Manual	Manual
Dataset storage	Built-in	Built-in	Manual	Manual

Deploy to Apify

# Deploy your Crawlee scraper to Apify cloud
npx apify push

# Run in cloud with API
curl -X POST https://api.apify.com/v2/acts/YOUR_ACTOR/runs \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -d '{"startUrls": [{"url": "https://example.com"}]}'

Resources

Crawlee Documentation
GitHub Repository — 16K+ stars
Examples
Apify Platform — run Crawlee in the cloud

Ready to scrape at scale? Check out my production scrapers on Apify — built with Crawlee for reliable, scalable data extraction. Custom scraping solutions? Email spinov001@gmail.com

DEV Community