DEV Community: Apify

Firecrawl vs. Apify: 2025 guide for AI and data teams

Saurav Jain — Mon, 11 Aug 2025 08:42:57 +0000

A detailed comparison of Firecrawl's unified AI-driven scraping and Apify's comprehensive, flexible ecosystem. We explain what each does best

Fresh, structured web data is the fuel for AI, including agents, RAG pipelines, competitive‑intelligence dashboards, and change‑monitoring services. Two platforms dominate that space in 2025:

Firecrawl – an API‑first crawler that turns any URL into LLM‑ready Markdown/JSON in seconds.
Apify – a full‑stack scraping platform with thousands of reusable data collection tools.

We'll go deep into the differences and what they do best, so you can choose (or combine) wisely.

Let's kick off with a table of each platform’s features, benefits, and trade‑offs so you can see where each one excels.

	Firecrawl	Apify
Core value prop	Fast, semantic scraping API geared for AI	End-to-end scraping platform (6,000+ data collection tools, proxies, compliance)
Stand‑out features	Pre-warmed browsersNL extractionAGPL OSS & self-hostStealth proxies	Scraper marketplaceJS/TS & Py SDKsGlobal proxy pool + CAPTCHACron-scheduler, retries, webhooksSOC 2 Type II and GDPR compliance
Primary benefits	Sub-second latency on cached pagesPrompt-based (no selectors)Predictable credit pricing	Fine‑grained session & proxy controlNo‑code operation for analysts; devs extend via code
Pros	✔ Fast single‑page fetches✔ Self‑host to avoid vendor lock‑in✔ Low entry cost (free + $16 Hobby tier)	✔ Breadth: 6,000 off-the-shelf scrapers✔ Effective anti‑blocking technology✔ Monetize your own scrapers; earn rev‑share
Cons	✖ Credits can disappear fast on large crawls✖ Limited built-in scheduling✖ AGPL copyleft for forks	✖ Actors / CU concepts add a learning curve✖ Consumption costs can spike with inefficient code✖ Cold-start ≈1.5s

Get data for AI with Apify

Pricing

Firecrawl’s pricing model

Firecrawl uses a simple credit-based model: 1 page = 1 credit (under standard conditions). That makes it very easy to predict costs. The free tier lets you scrape up to 500 pages before committing financially (no credit card required). Paid plans range from $16 to $333 for standard pricing.

Extraction tasks that go beyond simple scraping consume additional credits or tokens, and Firecrawl offers dedicated “Extract” plans ranging from $89 to $719/month, depending on token volume.

Apify’s pricing model

Apify combines a subscription and consumption model. You get a base amount of platform credit with each pricing plan, but the consumption rate depends on the resources used. 1 compute unit = 1 gigabyte-hour of RAM.

It’s harder to predict costs this way because scraping a JavaScript-rendered site that requires browser automation consumes a lot more than a simple HTML scraper.

That’s why Apify has introduced pay-per-event pricing. Scrapers with this pricing model charge for specific actions rather than just results. For example, a scraper that charges $5 per run start and $2 per 1,000 results would cost $15 for 5,000 results. This can make large-scale scraping jobs cheaper in the long run.

Apify's forever-free tier gives you $5 of credit that renews automatically every month, so you can test any scraper on Apify Store without financial commitment (no credit card required). Paid plans range from $39 to $999 per month.

Firecrawl and Apify pricing compared

	Firecrawl(flat credits)	Apify(pre-paid credit+CU)
3,000 pages	Hobby: $16 ✅	Starter: $39 ✅
100,000 pages	Standard: $83 ✅	Scale: $199 ✅
500,000 pages	Growth: $333 or Enterprise	Business: $999 or Enterprise
Summary	500k pages/month? Firecrawl is usually cheaper.Millions of lightweight pages or heavy anti‑bot workflows? A well‑optimized Apify scraper can win on total cost.

Performance

Firecrawl: Unified AI-driven scraping

Firecrawl offers a single, consistent API that handles scraping, crawling, and AI‑driven site navigation, so developers never have to juggle multiple endpoints or bespoke parameters.

When you request a page, the service decides on‑the‑fly whether it needs a headless browser, waits for all dynamic elements to render, and then applies extraction models that automatically ignore ads, menus, and other noise.

Instead of writing brittle CSS or XPath rules, you ask for the data in plain language — “product prices and availability,” for example — and Firecrawl returns a clean JSON block that stays stable even when the site’s markup changes.

The same intelligence powers the crawler: it goes through internal links without sitemaps, skips duplicate content, and can infer which pages matter most from their position in the site hierarchy and linking patterns, all while respecting any boundaries you set.

For sites that hide information behind clicks, forms, or paginated views, you can enable the FIRE‑1 agent. It mimics human behavior by clicking “Load More,” filling search fields, and even solving simple CAPTCHA, eliminating special‑case code.

Performance optimizations run throughout the platform. Recently scraped pages come from cache in milliseconds, hundreds of URLs can be batched in a single call for parallel processing, and converting HTML to lightweight Markdown cuts the token count for downstream LLMs by roughly two‑thirds. The net effect is faster, lower‑maintenance data collection that remains reliable as websites evolve.

Apify: A comprehensive, flexible ecosystem

Apify approaches web scraping as an ecosystem problem rather than a single‑tool exercise. Its foundation is the Actor system — self‑contained programs that run in Apify’s cloud with uniform input/output, shared storage, and common scheduling and monitoring. Because every Actor behaves the same way, you can link them into multistep workflows just by passing data from one to the next.

This standardization fuels Apify Store, a marketplace of more than 6,000 pre‑built scrapers and automation tools maintained by domain specialists. If you need Amazon product data, Instagram follower stats, or a one‑off government registry crawl, chances are an Actor already exists and is kept up to date as site layouts change, so you rarely start from a blank page.

When an off‑the‑shelf Actor doesn’t fit, you can build your own with the Apify SDK (also released as the open‑source Crawlee library). The SDK offers high‑level helpers — request queues, automatic retries, error handling, parallel processing — while still letting you drop down to raw Puppeteer, Playwright, or HTTP calls when necessary. It supports both JavaScript/TypeScript and Python, making it easy to slot into existing codebases.

Infrastructure management is largely hands‑off. Apify autoscales compute instances, rotates datacenter or residential proxies by geography, and applies anti‑detection tactics such as browser‑fingerprint randomization, human‑like delays, and outsourced CAPTCHA solving. Enterprise‑grade features cover the operational side: detailed run statistics, alerting, conditional scheduling (e.g., “scrape 100 sites at 06:00 only if yesterday’s run succeeded”), real‑time webhooks for downstream systems, and configurable result retention to satisfy compliance audits or historical analyses.

In practice, the platform lets teams mix and match ready‑made Actors, custom code, and reliable infrastructure to solve diverse scraping tasks without rebuilding each time.

Technical feature	Firecrawl	Apify
API surface	Single base URL with multiple HTTP/JSON endpoints under one uniform API	Each Actor exposes a standard REST interface: you POST to the “run” endpoint with a JSON input and then GET results from key‑value stores or datasets
Dynamic‑content handling	Automatically spins up a headless browser when JS rendering is detected, with no extra configuration	You choose per Actor: Puppeteer, Playwright, Cheerio, raw HTTP, etc.
Extraction approach	“Zero‑selector” extraction: ML/NLP models parse pages into JSON or Markdown out‑of‑the‑box	Code‑based extraction inside each Actor using Crawlee's page handlers or raw selectors
Workflow composition	In a single request you can switch between scrape, crawl, or the FIRE‑1 “agent” mode using flags	Chain Actors/tasks asynchronously using shared storage, datasets, or webhooks to build multistep pipelines
Browser automation / navigation	FIRE‑1 agent clicks buttons, paginates, fills forms, solves simple CAPTCHAs	Automation coded per Actor; CAPTCHA solving baked in
Anti‑detection & proxy support	Fully managed proxies, rotating sessions, fingerprint spoofing and geotargeting are all built‑in	Apify Proxy (datacenter & residential) with session rotation, geo‑targeting and spoofing; you enable it per Actor
Caching & batching	Global intelligent cache for recently fetched pages; supports batching hundreds of URLs in one API call	Request queues with autoscaled parallelism; per‑Actor caching logic is up to you (e.g. dataset dedupe, custom cache)
Scalability	Firecrawl auto‑scales browser instances for concurrent requests	Platform scales Actors and browser instances elastically according to queue depth and configured concurrency limits
Monitoring & scheduling	Metrics and scheduling built into the single API dashboard	Per‑Actor run metrics, alerts, conditional scheduling, and webhook triggers via Apify Console
Data format optimization	Converts HTML → Markdown to cut LLM tokens by ≈ 67%	Returns whatever the Actor emits (HTML, JSON, CSV, etc.); no built‑in token reduction
Language / SDK support	Any language that can call HTTP/JSON (official clients in Python, Node, Go, Rust, C#, etc.)	Official SDKs in JavaScript/TypeScript and Python (Apify SDK / Crawlee); other languages via raw REST calls

How each platform scales

Firecrawl

Firecrawl scales like a classic SaaS API. Your subscription tier defines a fixed pool of headless‑browser workers, and the service queues requests behind those workers. Because the queueing and resource allocation are handled centrally, you get stable latency and never touch infrastructure. Even the mid‑tier plans can move thousands of pages per minute thanks to caching and smart routing, so the raw worker counts rarely become a choke point.

Apify

Apify takes a looser, cloud‑native approach. You launch as many Actors as your budget allows, and the platform spins up containers to run them in parallel. That “elastic infinity” is perfect for bursty jobs — say, crawling an entire retail catalog overnight or tracking tens of thousands of social‑media accounts — because you can flood the platform with work and pay only for the compute you burn. Automatic retries and detailed run logs keep large Actor swarms reliable.

The marketplace magnifies Apify’s scale advantage: if your project hits 50 different sites, chances are someone has already published an Actor for most of them. You trade weeks of scraper authoring for coordination logic — managing many Actors’ versions, schedules, and output formats — but you gain speed and capacity on day one.

Integrations

Integrating Firecrawl into AI pipelines

Firecrawl treats integration the same way it treats scraping: make the routine path effortless and keep the escape hatches open.

Official SDKs for Python, JavaScript, Rust, and Go expose idiomatic methods, but the real strength lies in direct hooks to AI tooling. A native LangChain loader, for example, can be wired up in just a few lines; it fetches pages, paginates automatically, preserves attribution metadata, and delivers chunks that are ready for embeddings or other RAG workflows. LlamaIndex receives similar first‑class support with retrievers that fetch, de‑duplicate, and format content for chatbots or summarization agents while keeping token counts under control.

Outside pure code, Firecrawl blocks appear in Make.com and Zapier, so non‑developers can drag‑and‑drop flows — say, watch a competitor’s site, extract new product data in plain English, and update a shared spreadsheet — without touching a script. The result is a scraper that plugs into AI stacks and no‑code tools with equal ease.

Apify integrations for production workflows

Apify’s connectors reflect a more mature, enterprise‑centric agenda: it's designed to sit inside CI/CD pipelines and data‑engineering stacks, not just trigger one‑off jobs.

The Zapier app goes beyond “run an Actor” by adding triggers for job completion, built‑in data filters, and error‑handling branches, so a sales team can scrape LinkedIn, enrich leads with a second Actor, and post qualified prospects to a CRM in a single Zap.

The GitHub integration lets teams treat scraper code like any other service — commit, test, review, and deploy via GitHub Actions — bringing familiar DevOps discipline to crawling logic.

For AI tooling, Apify has loaders for LangChain and LlamaIndex, and integrations for HuggingFace, Pinecone, Haystack, Qdrant, and other vector databases.

For data pipelines, Apify ships connectors that push results straight into S3, GCS, or Azure Blob, or stream them through webhooks to Kafka or Pub/Sub, turning the platform into a managed data‑ingestion layer that feeds downstream analytics with no extra glue code.

Integration feature	Firecrawl	Apify
Official SDKs	Python, JavaScript/TypeScript (official), Rust, Go (community)	JavaScript/TypeScript & Python (via Apify SDK / Crawlee)
AI‑framework hooks	Native loaders/retrievers for LangChain and LlamaIndex (handles pagination, chunking, metadata, token‑cost control)	Official loaders for LangChain and LlamaIndex; individual Actors may embed extra AI logic
No‑code / automation tools	Make and Zapier blocks for drag‑and‑drop flows	Zapier app with branching, filters, and error handling for complex automations
DevOps / CI · CD	–	GitHub/Bitbucket integration for version control, tests, and CI‑based deployments
Data‑pipeline connectors	– (pull via API or SDK)	Export Actors to S3, GCS, Azure; real‑time streaming to Kafka/Pub/Sub/webhooks
Webhook support	Crawl/scrape lifecycle webhooks for async callbacks	Webhooks on Actor events; plus metamorph & transform steps for streaming
Token‑usage optimization	HTML→Markdown plus loader‑level chunk sizing & token‑budget helpers	Up to individual Actor / downstream loader; no platform‑wide token controls

Picking the right platform for your workload

When Firecrawl is the better fit

Choose Firecrawl when low‑latency access to web data and tight coupling with AI pipelines are top priorities. Its uniform API, natural‑language extraction, and sub‑second response times let you build chatbots, RAG systems, or research agents without wrestling with scraping logic. Pricing is credit‑based and predictable. If you know you’ll process about  50k pages a month, you can budget the spend to the dollar and count on the same performance every day. The open‑source core offers a self‑host option for teams with data‑residency mandates or those who simply want an exit ramp from the hosted service.

When Apify delivers more value

Apify is the best option when breadth, flexibility, and enterprise guarantees matter. Its marketplace of 6,000‑plus maintained Actors means you can cover dozens of sites in hours instead of building each scraper yourself. Non‑developers can launch and schedule those Actors through a web UI, while engineering teams still have full SDK control when needed. SOC 2 compliance, GDPR alignment, and a history of large‑scale deployments satisfy procurement checklists, and the platform’s elastic Actor model handles everything from tiny HTML scrapes to overnight catalog crawls without manual capacity planning.

Get started with Apify

5 best JavaScript web scraping libraries in 2025

Saurav Jain — Thu, 24 Jul 2025 18:30:00 +0000

Even if you're not a JavaScript developer, it's worth knowing these libraries for parsing HTML, interacting with pages, and dealing with dynamic content.

Websites are becoming increasingly complex and dynamic. The modern web is full of JavaScript-rendered apps that load content asynchronously, use auth systems involving multiple steps and JavaScript-based token handling, and block scraping bots.

That's why JavaScript is still a great choice for collecting web data in 2025. But if you're a developer new to web scraping or unfamiliar with the JavaScript language, you're probably wondering which libraries and frameworks you should try.

At Apify, we've been scraping the web with JavaScript and Node.js for a decade. This selection of 5 libraries is informed by our experience of using them for data extraction, from parsing HTML to navigating web pages and scraping dynamic content.

1. Crawlee

Tackling complexity, stealth, and scalability in one package

Juggling multiple libraries for requests, parsing, browser automation, and crawling logic quickly becomes a maintenance headache. You end up writing glue code to handle queues, rotate proxies, and merge results, only to find you still get blocked or stranded when scale increases.

Crawlee, developed by the Apify team, unifies everything under a single interface. Out of the box, it mimics real browsers (headers, TLS fingerprints, and even stealth plugins) so you avoid common anti-bot defenses without manual header or fingerprint tweaking. Instead of wiring together Cheerio + Playwright/Puppeteer + queue managers, Crawlee provides:

Switchable crawler classes: CheerioCrawler for static HTML, PlaywrightCrawler or PuppeteerCrawler for dynamic pages, all sharing a common configuration style.
Built-in queue management: Breadth-first or depth-first crawling with concurrency settings, retry logic, and automatic backoff. You define start URLs; Crawlee handles enqueuing, prioritization, and scheduling.
Automatic proxy rotation and session handling: Effectively rotate proxies or manage cookies and browser contexts, so you stay under rate limits and maintain logins across multiple pages.
Pluggable data storages: Datasets (JSON, CSV, or key-value stores) appear in a local “datasets” directory, making it trivial to persist results or resume failed crawls.
Lifecycle hooks and customizability: Logging, error handling, and custom request handlers via routers, so you can insert your own logic at enqueue, request success, or failure without rewriting core code.
Native integration with the Apify platform: Once your crawler is ready, running apify push deploys it, and Apify handles autoscaling, proxy billing, and data exports. No extra configuration needed.
Starter templates and file structure: When you run npx crawlee create my-crawler, you get a main.js and routes.js setup. Boilerplate code means you can focus on selectors rather than instantiating browser instances, setting headers, or wiring queues. The default file structure looks like this:

/my-crawler
├── main.js      # entry point: initializes crawler class and starts run()
├── routes.js    # defines request handlers via createCheerioRouter/createPlaywrightRouter
├── storages /
       datasets/    # where results are stored as JSON files per page
├──    key-value-stores/  # storage for arbitrary binary data (images, videos, JSON files…)
└── package.json

Code snapshot (Cheerio crawler)

// routes.js
import { Dataset, createCheerioRouter } from "crawlee";

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log, $ }) => {
  log.info(`enqueueing new URLs`);
  // Finds “next” pages and enqueues them
  await enqueueLinks({ globs: ["https://news.ycombinator.com/?p=*"] });

  // Extract post URL, title, rank
  const data = $(".athing")
    .map((idx, post) => ({
      postUrl: $(post).find(".title a").attr("href"),
      title: $(post).find(".title a").text(),
      rank: $(post).find(".rank").text(),
    }))
    .toArray();

  // Push to dataset for automatic file output
  await Dataset.pushData({ data });
});

Why this matters for your workflow

Simplicity: No more separate proxy rotation libraries, queue managers, or manual header generators. Crawlee handles it all.
Scaling: You can start locally and then deploy to the Apify platform, where it auto‐scales, monitors memory/CPU, and logs failures.
Maintenance: Switching from CheerioCrawler to PlaywrightCrawler only requires changing one import and maybe tweaking selectors. The core logic stays the same.

2. Impit

Making browser impersonation simple

Sending vanilla HTTP requests often gets you blocked by modern anti-scraping systems. You might spend hours rotating user-agents, randomizing delays, or implementing captchas manually, only to still find your IP banned.

Impit is an HTTP client for Node.js and Python, based on Rust’s reqwest, specifically tailored for scraping. Instead of wrestling with header spoofing or TLS fingerprinting yourself, you get:

Automatic fingerprint spoofing: Pick from a library of existing browser fingerprints, and impit builds a full set of realistic HTTP headers and matching TLS settings. This makes your requests indistinguishable from browser requests and reduces detection risk.
Integrated tough-cookie support: Handle session cookies out of the box, so you can maintain login sessions or track redirects using the most popular JS cookie library.
fetch API: Impit implements a subset of the well-known fetch API (MDN), so you can write your scrapers without having to read lengthy docs.
Proxy integration: Support for HTTP and HTTPS proxies via a single option, so you can rotate IPs with minimal code.

Code snapshot (impersonating Firefox)

import { Impit } from "impit";

async function fetchHtml() {
  const impit = new Impit({ browser: 'firefox', http3: true })
  const response = await impit.fetch("https://news.ycombinator.com/");
  console.log(await response.text()); // raw HTML
}

fetchHtml();

Why this matters for your workflow

Stealth: You no longer manually assemble user-agent strings or randomize headers; impit covers 95% of common anti-bot checks.
Error handling: Configurable retries and timeouts mean fewer surprises when a request fails.

3. Cheerio

Converting unruly HTML into structured data

Plain HTML is cluttered: nested tags, inconsistent class names, and no programmatic way to navigate the DOM on the server. If you’ve written custom regex or string-based parsers, you know how brittle that can be.

Cheerio loads raw HTML into a fast, jQuery-like API on the server. You can query for elements, attributes, and text using familiar CSS selectors, then extract exactly what you need without worrying about manual string manipulation.

Code snapshot (parsing Hacker News)

import { gotScraping } from "got-scraping";
import * as cheerio from "cheerio";

async function fetchTitles() {
  const response = await gotScraping("https://news.ycombinator.com/");
  const $ = cheerio.load(response.body);

  $(".athing").each((_, post) => {
    const title = $(post).find(".title a").text();
    const rank = $(post).find(".rank").text();
    console.log(`${rank} ${title}`);
  });
}
fetchTitles();

Why this matters for your workflow

Robustness: No more fragile regex. With Cheerio, you use .find(), .text(), and .attr() just like jQuery.
Performance: Cheerio is lightweight and blazingly fast, so parsing large HTML documents doesn’t become your scraper’s bottleneck - especially when compared to full-headless browsers.
Familiar syntax: If you’ve used jQuery on the front end, there’s almost zero onboarding time.

4. Playwright

Handling JavaScript‐driven, dynamic content reliably

Many modern websites rely on client-side JavaScript to populate the DOM, for lazy loading, infinite scrolling, or data fetched via XHR/AJAX. Cheerio has no power here.

Playwright, on the other hand, spins up a real browser (Chromium, Firefox, or WebKit), navigates pages as a human would, waits for selectors or network to idle, and then gives you a fully rendered DOM snapshot. You can even intercept requests to block ads or unwanted resources.

Code snapshot (Amazon product page)

import { firefox } from "playwright";

async function scrapeAmazon() {
  const browser = await firefox.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(
    "https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/"
  );

  const book = {
    title: await page.locator("#productTitle").innerText(),
    author: await page.locator("span.author a").innerText(),
    kindlePrice: await page
      .locator("#formats span.ebook-price-value")
      .innerText(),
    paperbackPrice: await page
      .locator("#tmm-grid-swatch-PAPERBACK .slot-price span")
      .innerText(),
    hardcoverPrice: await page
      .locator("#tmm-grid-swatch-HARDCOVER .slot-price span")
      .innerText(),
  };

  console.log(book);
  await browser.close();
}
scrapeAmazon();

Why this matters for your workflow

Reliability: If the data isn’t in the initial HTML, you need a browser to run the page’s JS. Playwright ensures you get exactly what a real user sees.
Flexible waits: You can await page.waitForSelector() or specify .waitUntil("networkidle") so you only scrape once all resources load, reducing flaky results.
Intercepting resources: Block images, CSS, or analytics endpoints to speed up scrapes and reduce noise in your logs.

5. Puppeteer

A Chrome-centric approach to browser automation

Puppeteer and Playwright are pretty much the same thing, except for some minor differences in API, and unlike Playwright, it's limited to JavaScript and Node.js. Puppeteer is older, but only recently did Firefox add official support for Puppeteer.

Switching to Playwright isn't difficult, but if you prefer Chromium’s engine, this is a good option.

Puppeteer gives you a headless (or headed) Chrome instance with an easy API for navigation, selection, and evaluation. It supports intercepting requests, generating PDFs, and capturing screenshots. While it doesn’t include the same cross-browser support as Playwright, it’s been around for longer and integrates well with Chrome DevTools Protocol.

Code snapshot (basic Puppeteer scraper)

import puppeteer from "puppeteer";

async function scrapeSite() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto("https://news.ycombinator.com/", { waitUntil: "networkidle2" });

  const articles = await page.$$eval(".athing", posts =>
    posts.map(post => ({
      title: post.querySelector(".title a")?.innerText.trim(),
      url: post.querySelector(".title a")?.href,
      rank: post.querySelector(".rank")?.innerText.trim(),
    }))
  );

  console.log(articles);
  await browser.close();
}
scrapeSite();

Why this matters for your workflow

Existing Puppeteer code: Migrate incrementally or reuse libraries that depend on Puppeteer.
Chrome-only features: Use DevTools Protocol to capture screenshots, trace performance, or emulate network conditions without additional dependencies.
Lightweight automation needs: if you only need a headless Chrome for a few pages and already have an IP rotation or session management solution, Puppeteer might be the simplest choice.

In summary

JavaScript remains a top choice for web scraping in 2025, thanks to its solid ecosystem of open-source libraries, which make it easier to parse HTML, interact with web pages, and deal with dynamic content. In our opinion, these five are the best:

Crawlee - A comprehensive, all-in-one scraping framework that handles browser automation, proxy rotation, session management, queuing, and data storage. It simplifies scaling and maintenance by unifying multiple tools under one interface.
Impit - A stealthy HTTP client tailored for scraping, with automatic realistic header generation, cookie jar support, and proxy integration - ideal for scraping without a full browser.
Cheerio - A fast and lightweight HTML parser that mimics jQuery, perfect for extracting structured data from static HTML without using a browser.
Playwright - A full browser automation library for scraping JavaScript-rendered sites. It supports multiple browsers, waits for content to load, and intercepts resources, making it highly reliable for dynamic pages.
Puppeteer - A headless Chrome automation tool with strong DevTools support. It’s suitable for existing Puppeteer codebases or lightweight scraping needs focused on Chromium.

Whether you’re parsing static HTML or navigating complex, JavaScript-rendered pages, this toolkit helps you choose and combine the best options for performance, stealth, and scalability.

Build and deploy MCP servers in minutes with a TypeScript template

Saurav Jain — Thu, 24 Jul 2025 07:33:07 +0000

Transform any stdio MCP server into a scalable, cloud-hosted service.

Model Context Protocol (MCP) is transforming and simplifying how AI applications connect with external tools. While we’ve covered how to use MCP with tools that give agents context from the web, this guide digs deeper into the developer side: how to build and deploy your own MCP servers on the Apify platform.

With Apify's MCP templates, you can transform any stdio or remote MCP server into a scalable, cloud-hosted service in minutes. There are currently two templates available: for Python and for TypeScript.

In this tutorial, we'll show you how to build an MCP server on Apify with TypeScript.

Why deploy MCP servers on Apify?

Before we get into implementation, here’s why the Apify platform is ideal for hosting MCP servers:

1. Instant scalability: Apify's infrastructure automatically scales based on demand, from a single request to thousands of concurrent connections.
2. Built-in monetization: With the pay-per-event (PPE) model, you can charge users for each tool request, API call, or custom event, turning your MCP server into a revenue stream.
3. Persistent URLs: Standby mode provides stable endpoints like https://your-username--your-mcp-server.apify.actor/sse, perfect for MCP client configurations.
4. Zero infrastructure management: No servers to maintain, no Docker orchestration, no SSL certificates. Just deploy and run.

Understanding the architecture

When you deploy an MCP server on Apify, you can work with two types of MCP servers:

stdio MCP servers: Local servers that communicate via standard input/output, which Apify converts to SSE (Server-Sent Events) for remote access
SSE MCP servers: Remote servers that already communicate via HTTP/SSE, which Apify can proxy and enhance with monetization features

This flexible architecture allows you to turn any type of MCP server into an Apify Actor, whether it's a local stdio-based tool or a remote SSE endpoint, and expose it through a unified SSE interface with built-in scaling and monetization.

Actors are lightweight, containerized programs that take JSON inputs, execute tasks, and return structured outputs.

Step-by-step implementation

Let’s walk through building an MCP server on Apify using the TypeScript template.

Step 1: Choose and create your Actor from the template

# Create the TypeScript MCP server from template
apify create my-mcp-server --template ts-mcp-server
cd my-mcp-server

Step 2: Configure your MCP server

Open src/main.ts and set the MCP_COMMAND:

// For stdio servers:
// Example: Everything MCP server
const MCP_COMMAND = 'npx @modelcontextprotocol/server-everything';

// For SSE servers (requires mcp-remote package):
// Custom SSE endpoint
const MCP_COMMAND = 'npx mcp-remote https://your-domain.com/mcp-endpoint';

Step 3: Install your MCP server dependencies

Update package.json with the MCP server dependencies:

{
  "dependencies": {
    "@modelcontextprotocol/server-everything": "^2025.5.12",
    "mcp-remote": "^0.1.16"
  }
}

Note: If you’re developing the Actor locally, you can use npm install instead of editing package.json directly.

Step 4: Set up monetization (optional)

In .actor/pay_per_event.json:

[
  {
    "tool-request": {
      "eventTitle": "Tool Request",
      "eventDescription": "Charge for each tool execution",
      "eventPriceUsd": 0.05
    }
  }
]

Trigger charges in your code:

For TypeScript:

await Actor.charge({ eventName: 'tool-request' })

Step 5: Configure Actor settings

In .actor/actor.json:

{
  "actorSpecification": 1,
  "name": "my-mcp-server",
  "version": "1.0.0",
  "usesStandbyMode": true,
  "minMemoryMbytes": 512,
  "maxMemoryMbytes": 4096,
  "webServerMcpPath": "/sse" // So the Actor is recognized as an MCP server
}

Step 6: Deploy to Apify

apify login
apify push

Step 7: Configure standby mode

In Apify Console:

Go to your Actor’s settings
Enable “Standby mode”
Set idle timeout (e.g., 300 seconds)
Adjust memory allocation as needed

Step 8: Connect your MCP client

Use this URL:

https://your-username--my-mcp-server.apify.actor/sse

Configure your client by pointing your MCP client to the URL, and be sure to set the Authorization headers with the bearer auth token Authorization: Bearer your-token.

Advanced configuration

Environment variables

To set non-sensitive environment variables, use actor.json:

{
  "environmentVariables": {
    "RATE_LIMIT": "100"
  }
}

Important: Do not put API tokens or other sensitive values in the actor.json file. Instead, set sensitive environment variables in the Apify Console UI under your Actor's settings when doing the build. For more information on setting custom environment variables, see the documentation.

TypeScript template capabilities

The TypeScript template supports both stdio and SSE server types:

stdio servers: Simple command string configuration for local MCP servers
SSE servers: Remote server proxying using mcp-remote for connecting to external SSE endpoints

Debugging and monitoring

Local development

To run the MCP server locally, use the following command:

APIFY_META_ORIGIN="STANDBY" ACTOR_WEB_SERVER_PORT=8080 apify run

You can use the MCP inspector on GitHub for debugging and testing locally.

Production monitoring

View Actor logs in Apify Console
Set up error/usage alerts
Track monetization in Analytics

Troubleshooting common issues

Memory errors: Increase memory or optimize code
Auth failures: Check tokens

What next?

Add more tools
Integrate with Apify Actors
Build custom clients
Share your server on Apify Store

Start building

Conclusion

Deploying MCP servers on the Apify platform turns local tools into scalable, monetizable cloud services. With our TypeScript template and standby mode, you can get a production-ready server running in minutes, whether it's a local stdio server or an existing SSE server.

Explore what’s possible:

10 web scraping challenges (+ solutions) in 2025

Dávid Lukáč — Thu, 05 Dec 2024 15:04:45 +0000

Web scraping comes with its fair share of challenges. Websites are becoming increasingly difficult to scrape due to the rise of anti-scraping measures like CAPTCHAs and browser fingerprinting. At the same time, the demand for data, especially to fuel AI, is higher than ever.

As you probably know, web scraping isn’t always a stress-free process, but learning how to navigate these obstacles can be incredibly rewarding.

In this guide, we’ll cover 10 common problems you’re likely to encounter when scraping the web and, just as importantly, how to solve them:

For the solutions, we’ll use Crawlee, an open-source library for Python and Node.js, and the Apify platform. These tools make life easier, but the techniques we’ll talk about can be used with other tools as well. By the end, you’ll have a solid understanding of how to overcome some of the toughest hurdles web scraping can throw at you.

1. Dynamic content

Modern websites often use JavaScript frameworks like React, Angular, or Vue.js to create dynamic and interactive experiences. These single-page applications (SPAs) load content on the fly without refreshing the page, which is great for users but can complicate web scraping.

Traditional scrapers that pull raw HTML often miss data generated by JavaScript after the page loads. To capture dynamically loaded content, scrapers need to execute JavaScript and interact with the page, just like a browser.

That’s where headless browsers like Playwright, Puppeteer, or Selenium come in. They mimic real browsers, loading JavaScript and revealing the data you need.

In the example below, we’re using Crawlee, an open-source web scraping library, with Playwright to scrape a dynamic page (MintMobile). While Playwright alone could handle this, Crawlee adds powerful web scraping features you’ll learn about in the next sections.

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { firefox } from 'playwright';

const crawler = new PlaywrightCrawler({
    launchContext: {
        // Here you can set options that are passed to the playwright .launch() function.
        launchOptions: {
            headless: true,
        },
        launcher: firefox,
    },
    async requestHandler({ page, request, log }) {
        await page.goto(request.url);

        // Extract data
        const productInfo = await page.$eval('#WebPage', (info) => {
            return {
                name: info.querySelector('h1[data-qa="device-name"]').innerText,
                storage: info.querySelector(
                    'a[data-qa="storage-selection"] p:nth-child(1)'
                ).innerText,
                devicePrice: info.querySelector(
                    'a[data-qa="storage-selection"] p:nth-child(2)'
                ).innerText,
            };
        });

        if (!productInfo) {
            log.warning(`No product info found on ${request.url}`);
        } else {
            log.info(`Extracted product info from ${request.url}`);
            // Save the extracted data, e.g., push to Apify dataset
            await Dataset.pushData(productInfo);
        }
    },
});

// Start the crawler with a list of product review pages
await crawler.run([
    "https://www.mintmobile.com/devices/samsung-galaxy-z-flip6/6473480/",
]);

2. User agents and browser fingerprinting

If a website blocks your scraper, you can’t access the data, which makes all your efforts pointless. To avoid this, you want to make your scrapers mimic real users as much as possible. Two basic elements of anti-bot defenses to keep in mind are user agents and browser fingerprinting.

A user agent is a piece of metadata sent with every HTTP request, telling the website what browser and device are making the request. It looks something like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36

If your scraper uses something obvious like the Axios default User Agent, axios/1.7.2 , the site will likely flag you as a bot and block your access.

Fingerprinting takes it a step further. Websites analyze details like your screen resolution, installed fonts, timezone, language, and even whether the browser is running in headless mode. All this data creates a unique “fingerprint” for your scraper. If your fingerprint looks too uniform or lacks variety, like using the same resolution or timezone across all requests, you’re more likely to get caught. Some sites can even track you across sessions, bypassing tactics like IP rotation.

As you can imagine, manually managing user agents and fingerprints can be a headache, it’s time-consuming, error-prone, and hard to keep up with as websites constantly improve their defenses.

Thankfully, modern open-source tools like Crawlee take care of these challenges for us. Crawlee automatically applies the correct user agent and fingerprints to our request to ensure our bots appear “human-like.” Its PlaywrightCrawler and PuppeteerCrawler also make headless browsers behave like real ones, lowering your chances of detection, which is why I opted for using Playwright with Crawlee in the first section 😉

3. Rate limiting

Rate limiting is how websites keep things under control by capping the number of requests a user or IP can make within a set time frame. This helps prevent server overload, defend against DoS attacks, and discourage automated scrapers. If your scraper goes over the limit, the server might respond with a 429 Too Many Requests error or even block your IP temporarily. This can be a major roadblock, interrupting your data collection and leaving you with incomplete results.

To solve this issue, you need to manage your request rates and stay within the website’s limits. Crawlee makes this easy by offering options to fine-tune how many requests your scraper sends at once, how many it sends per minute, and how it scales based on your system’s resources. This gives you the flexibility to adjust your scraper to avoid hitting rate limits while maintaining strong performance.

Here’s an example of how to handle rate limiting using Crawlee’s CheerioCrawler with adaptive concurrency to scrape Hacker News:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    // Ensure there will always be at least 2 concurrent requests
    minConcurrency: 2,
    // Prevent the crawler from exceeding 20 concurrent requests
    maxConcurrency: 20,
    // ...but also ensure the crawler never exceeds 250 requests per minute
    maxRequestsPerMinute: 250,

    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);

        // Extract data using Cheerio
        const data = $('.athing')
            .map((index, element) => {
                const $element = $(element);
                return {
                    title: $element.find('.title a').text(),
                    rank: $element.find('.rank').text(),
                    href: $element.find('.title a').attr('href'),
                };
            })
            .get();

        // Store the results to the default dataset.
        await Dataset.pushData(data);

        // Find a link to the next page and enqueue it if it exists.
        const infos = await enqueueLinks({
            selector: '.morelink',
        });

        if (infos.processedRequests.length === 0)
            log.info(`${request.url} is the last page!`);
    },
});

await crawler.addRequests(["https://news.ycombinator.com/"]);

// Run the crawler and wait for it to finish.
await crawler.run();

console.log('Crawler finished.');

4. IP bans

Building on the discussion about rate limiting, IP bans are another common issue you might have come across when scraping the web. Simply put, when a scraper sends too many requests too quickly or behaves in ways that don’t seem natural, the server might block the IP address, either temporarily or permanently. When that happens, your data collection comes to a complete halt, and naturally, we want to prevent this from happening.

While managing your scraper’s concurrency can help avoid this, sometimes it’s not enough. If you’re still running into blocks, using proxy rotation is a great next step. By rotating IP addresses, you can spread out your requests and make it harder for websites to flag and block your crawler’s activity.

With Crawlee, adding proxies is straightforward. Whether you’re using your own servers or working with a third-party provider, Crawlee handles the rotation automatically, ensuring your requests come from different IPs.

If you already have a list of proxies ready, integrating them into your Crawlee scraper takes just a few lines of code. Here’s how you can do it:

import { ProxyConfiguration, CheerioCrawler } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        "http://proxy-1.com",
        "http://proxy-2.com",
    ]
});
const proxyUrl = await proxyConfiguration.newUrl();

const crawler = new CheerioCrawler({
    proxyConfiguration,
    // ...rest of the code
});

Alternatively, you can use a third-party tool like Apify Proxy to access a large pool of residential and datacenter proxies, making proxy management even easier. It also gives you added flexibility by letting you control proxy groups and country codes.


import { Actor } from 'apify';

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});
const proxyUrl = await proxyConfiguration.newUrl();

5. Honeypot traps

Honeypot traps are hidden elements in a website’s HTML designed to detect and block automated bots and scrapers. These traps, like hidden links, forms, or buttons, are invisible to regular users but can be accidentally triggered by scrapers that process every element indiscriminately. When this happens, it signals bot activity to the website, often resulting in blocks, IP bans, and other issues. In short, you want to keep your scraper far away from these traps.

One way to avoid these traps is by filtering out hidden elements. You can check for CSS properties such as display: none and visibility: hidden to exclude them from your scraping process.

Another approach is to simulate real user behavior. Instead of scraping the entire HTML, focus on specific sections of the page where the data is located. Mimicking real interactions, like clicking on visible elements or navigating the page, helps your scraper appear more human-like and prevents it from interacting with invisible elements that a user wouldn’t be aware of.

Here’s an example of how you could modify the Hacker News scraper from the earlier section to filter out Honeypot traps:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    async requestHandler({ request, $, enqueueLinks, log }) {
        log.info(`Processing ${request.url}...`);

        // Function to check if an element is visible (filter out Honeypots)
        const isElementVisible = (element) => {
            const style = element.css([
                'display',
                'visibility',
                'opacity',
                'height',
                'width',
            ]);
            return (
                style.display !== 'none' &&
                style.visibility !== 'hidden' &&
                style.opacity !== '0'
            );
        };

        // Extract data using Cheerio while avoiding Honeypot traps
        const data = $('.athing')
            .filter((index, element) => isElementVisible($(element)))
            .map((index, element) => {
                const $element = $(element);
                return {
                    title: $element.find('.title a').text(),
                    rank: $element.find('.rank').text(),
                    href: $element.find('.title a').attr('href'),
                };
            })
            .get();

        // Store the results to the default dataset.
        await Dataset.pushData(data);

        // Find a link to the next page and enqueue it if it exists.
        const infos = await enqueueLinks({
            selector: '.morelink',
        });

        if (infos.processedRequests.length === 0)
            log.info(`${request.url} is the last page!`);
    },
});

await crawler.addRequests(["https://news.ycombinator.com/"]);

// Run the crawler and wait for it to finish.
await crawler.run();

console.log('Crawler finished.');

CAPTCHAs CAPTCHAs, or Completely Automated Public Turing tests to tell Computers and Humans Apart, are those familiar challenges we’ve all seen, clicking on traffic lights or selecting crosswalks in image grids. While frustrating for humans, they are designed to block bots, making them one of the toughest obstacles for scrapers. Encountering one during scraping can bring your process to a halt, as bots can’t solve these puzzles on their own.

The good news is that much of what we’ve already covered, like avoiding honeypot traps, rotating IPs, and making your scraper mimic human behavior, also helps reduce the chances of triggering CAPTCHAs. Websites generally try to show CAPTCHAs only when the activity looks suspicious. By blending in with regular traffic through techniques like rotating IPs, randomizing interactions, and managing request patterns thoughtfully, your scraper can often bypass CAPTCHAs entirely.

However, CAPTCHAs can still appear, even when precautions are in place. In such cases, your best bet is to integrate a CAPTCHA-solving service. Tools like Apify’s Anti Captcha Recaptcha Actor, which works with Anti-Captcha, can help you equip your crawlers with CAPTCHA-solving capabilities to handle these challenges automatically and avoid disrupting your scraping.

Here is an example of how you could use the Apify API to integrate the Anti Captcha Recaptcha Actor into your code:

import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with API token
const client = new ApifyClient({
    token: '',
});

// Prepare Actor input
const input = {
    {
    "cookies": "name=value; name2=value2",
    "key": "anticaptcha-key",
    "proxyAddress": "8.8.8.8",
    "proxyLogin": "theLogin",
    "proxyPassword": "thePassword",
    "proxyPort": 8080,
    "proxyType": "http",
    "siteKey": "6LfD3PIbAAAAAJs_eEHvoOl75_83eXSqpPSRFJ_u",
    "userAgent": "Opera 6.0",
    "webUrl": "https://2captcha.com/demo/recaptcha-v2"
}
};

(async () => {
    // Run the Actor and wait for it to finish
    const run = await client.actor("petr_cermak/anti-captcha-recaptcha").call(input);
})();

7. Data storage and organization

Storing and organizing data effectively is often overlooked in smaller projects but is actually a core component of any successful web scraping operation.

While collecting data is the first step, how you store, access, and present it has a huge impact on its usability and scalability. Web scraping generates a mix of data types, from structured information like prices and reviews to unstructured content like PDFs and images. This variety demands flexible storage solutions. For small projects, simple CSV or JSON files stored locally might work, but as your needs grow, these methods can quickly fall short.

For larger datasets or ongoing scraping, cloud-based solutions like MongoDB, Amazon S3 or Apify Storage become necessary. They’re designed to handle large volumes of data and offer quick querying capabilities.

One standout advantage of Apify Storage is that it’s specifically designed to meet the needs of web scraping. It offers Datasets for structured data, Key-Value Stores for storing metadata or configurations, and Request Queues to help manage and track your scraping workflows. It integrates seamlessly with tools like Crawlee, provides API access for straightforward data retrieval and management, and supports exporting data in multiple formats.

Best of all, Apify Storage is just one piece of the comprehensive Apify platform, which delivers a full-stack solution for all your web scraping needs.

8. Automation and monitoring

Manually running scrapers every time you need fresh data is not practical, especially for projects requiring regular updates like price tracking, market research, or monitoring real-time changes.

Automation ensures your workflows run on schedule, minimizing errors and keeping your data current, while monitoring helps detect and address issues like failed requests, CAPTCHAs, or website structure changes before they cause disruptions.

Apify Platform Monitoring simplifies this process by providing tools specifically designed for automating and monitoring web scraping workflows. With task scheduling, you can set your scrapers to run at specific intervals, ensuring consistent data updates.

As well as helping you automate scraping, Apify offers monitoring features to view task statuses, detailed logs, and error messages. These features keep you informed about your scraper’s performance, including notifications and alerts, which can be configured to inform you of task completions or errors via email, Slack, or other integrations.

9. Scalability and reliability

Building a scalable and reliable scraping operation relies on the key principles we’ve covered: avoiding blocks, maintaining data consistency, storing collected data efficiently, and automating tasks with proper monitoring. Together, these elements create a solid foundation for a system that can grow with your needs while ensuring quality and performance remain intact.

One crucial yet often overlooked aspect of scalability is infrastructure management. Handling your own servers can quickly turn into a costly and time-consuming challenge, especially as your project expands. That’s why choosing a robust cloud-based solution like Apify from the very start of your project is a smart choice. Designed for scalability, it automatically adjusts to your project’s needs, so you never have to worry about provisioning servers or hitting capacity limits. You only pay for what you use, keeping costs manageable while ensuring your scrapers keep running without interruption.

Get a free Apify plan now!

10. Real-time data scraping

The idea behind real-time data scraping is to continuously collect data as soon as it becomes available. This is often a critical requirement for projects involving time-sensitive data, such as stock market analysis, price monitoring, news aggregation, and tracking live trends.

To achieve this, you need to deploy your code to a cloud platform and automate your scraping process with a proper schedule. For example, you can deploy your scraping script as an Apify Actor and schedule it to run at intervals that match how “fresh” you need the data to be. Apify’s scheduling and monitoring tools make it easy for you to implement this automation, ensuring a constant flow of real-time data while helping you promptly handle any errors to maintain accuracy and reliability.

Conclusion

And here we are at the end of the article. I hope you’ve found it helpful and can use it as a reference when dealing with the challenges we’ve discussed. Of course, every scraping project is unique, and it’s impossible to cover every scenario in one post. That’s where the value of a strong developer community comes in.

Connecting with other developers who have faced and solved similar challenges can make a big difference. It’s a chance to exchange ideas, get advice, and share your own experiences.

If you haven’t already, I encourage you to join the Apify & Crawlee Developer Community on Discord. It’s a great space to learn, collaborate, and grow alongside others who share your interest in web scraping.

Hope to see you there!

11 best open-source web crawlers and scrapers in 2024

Dávid Lukáč — Tue, 29 Oct 2024 14:33:22 +0000

Free software libraries, packages, and SDKs for web crawling? Or is it a web scraper that you need?

Hey, we're Apify. You can build, deploy, share, and monitor your scrapers and crawlers on the Apify platform. Check us out.

If you're tired of the limitations and costs of proprietary web scraping tools or being locked into a single vendor, open-source web crawlers and scrapers offer a flexible, customizable alternative.

But not all open-source tools are the same.

Some are full-fledged libraries capable of handling large-scale data extraction projects, while others excel at dynamic content or are ideal for smaller, lightweight tasks. The right tool depends on your project’s complexity, the type of data you need, and your preferred programming language.

The libraries, frameworks, and SDKs we cover here take into account the diverse needs of developers, so you can choose a tool that meets your requirements.

What are open-source web crawlers and web scrapers?

Open-source web crawlers and scrapers let you adapt code to your needs without the cost of licenses or restrictions. Crawlers gather broad data, while scrapers target specific information. Open-source solutions like the ones below offer community-driven improvements, flexibility, and scalability—free from vendor lock-in.

Top 11 open-source web crawlers and scrapers in 2024

1. Crawlee

Language: Node.js, Python | GitHub: 15.4K+ stars | link

Crawlee is a complete web scraping and browser automation library designed for quickly and efficiently building reliable crawlers. With built-in anti-blocking features, it makes your bots look like real human users, reducing the likelihood of getting blocked.

Available in both Node.js and Python, Crawlee offers a unified interface that supports HTTP and headless browser crawling, making it versatile for various scraping tasks. It integrates with libraries like Cheerio and Beautiful Soup for efficient HTML parsing and headless browsers like Puppeteer and Playwright for JavaScript rendering.

The library excels in scalability, automatically managing concurrency based on system resources, rotating proxies to enhance efficiency, and employing human-like browser fingerprints to avoid detection. Crawlee also ensures robust data handling through persistent URL queuing and pluggable storage for data and files.

Check out Crawlee

Pros:

Easy switching between simple HTTP request/response handling and complex JavaScript-heavy pages by changing just a few lines of code.
Built-in sophisticated anti-blocking features like proxy rotation and generation of human-like fingerprints.
Integrating tools for common tasks like link extraction, infinite scrolling, and blocking unwanted assets, along with support for both Cheerio and JSDOM, provides a comprehensive scraping toolkit right out of the box.

Cons:

Its comprehensive feature set and the requirement to understand HTTP and browser-based scraping can create a steep learning curve.

🟧 Crawlee web scraping tutorial for Node.js

Best for: Crawlee is ideal for developers and teams seeking to manage simple and complex web scraping and automation tasks in JavaScript/TypeScript and Python. It is particularly effective for scraping web applications that combine static and dynamic pages, as it allows easy switching between different types of crawlers to handle each scenario.

Deploy your scraping code to the cloud

2. Scrapy

Language: Python | GitHub: 52.9k stars | link

Scrapy is one of the most complete and popular web scraping frameworks within the Python ecosystem. It is written using Twisted, an event-driven networking framework, giving Scrapy asynchronous capabilities.

As a comprehensive web crawling framework designed specifically for data extraction, Scrapy provides built-in support for handling requests, processing responses, and exporting data in multiple formats, including CSV, JSON, and XML.

Its main drawback is that it cannot natively handle dynamic websites. However, you can configure Scrapy with a browser automation tool like Playwright or Selenium to unlock these capabilities.

💡 Learn more about using Scrapy for web scraping

Pros:

Significant performance boost due to its asynchronous nature.
Specifically designed for web scraping, providing a robust foundation for such tasks.
Extensible middleware architecture makes adjusting Scrapy’s capabilities to fit various scraping scenarios easy.
Supported by a well-established community with a wealth of resources available online.

Cons:

Steep learning curve, which can be challenging for less experienced web scraping developers.
Lacks the ability to handle content generated by JavaScript natively, requiring integration with tools like Selenium or Playwright to scrape dynamic pages.
More complex than necessary for simple and small-scale scraping tasks.

Best for: Scrapy is ideally suited for developers, data scientists, and researchers embarking on large-scale web scraping projects who require a reliable and scalable solution for extracting and processing vast amounts of data.

💡 Run multiple Scrapy spiders in the cloud

Read the docs

3.MechanicalSoup

Language: Python | GitHub: 4.7K+ stars | link

MechanicalSoup is a Python library designed to automate website interactions. It provides a simple API to access and interact with HTML content, similar to interacting with web pages through a web browser, but programmatically. MechanicalSoup essentially combines the best features of libraries like Requests for HTTP requests and Beautiful Soup for HTML parsing.

Now, you might wonder when to use MechanicalSoup over the traditional combination of BS4+ Requests. MechanicalSoup provides some distinct features particularly useful for specific web scraping tasks. These include submitting forms, handling login authentication, navigating through pages, and extracting data from HTML.

MechanicalSoup makes it possible by creating a StatefulBrowser object in Python that can store cookies and session data and handle other aspects of a browsing session.

However, while MechanicalSoup offers some browser-like functionalities akin to what you'd expect from a browser automation tool such as Selenium, it does so without launching an actual browser. This approach has its advantages but also comes with certain limitations, which we'll explore next:

Pros:

Great choice for simple automation tasks such as filling out forms and scraping data from pages that do not require JavaScript rendering.
Lightweight tool that interacts with web pages through requests without a graphical browser interface. This makes it faster and less demanding on system resources.
Directly integrates Beautiful Soup, offering all the benefits you would expect from BS4, plus some extra features.

Cons:

Unlike real browser automation tools like Playwright and Selenium, MechanicalSoup cannot execute JavaScript. Many modern websites require JavaScript for dynamic content loading and user interactions, which MechanicalSoup cannot handle.
Unlike Selenium and Playwright, MechanicalSoup does not support advanced browser interactions such as moving the mouse, dragging and dropping, or keyboard actions that might be necessary to retrieve dates from more complex websites.

Best for: MechanicalSoup is a more efficient and lightweight option for more basic scraping tasks, especially for static websites and those with straightforward interactions and navigation.

🍲 Learn more about MechanicalSoup

4. Node Crawler

Language: Node.js | GitHub: 6.7K+ stars | link

Node Crawler, often referred to as 'Crawler,' is a popular web crawling library for Node.js. At its core, Crawler utilizes Cheerio as the default parser, but it can be configured to use JSDOM if needed. The library offers a wide range of customization options, including robust queue management that allows you to enqueue URLs for crawling while it manages concurrency, rate limiting, and retries.

Advantages:

Built on Node.js, Node Crawler excels at efficiently handling multiple, simultaneous web requests, which makes it ideal for high-volume web scraping and crawling.
Integrates directly with Cheerio (a fast, flexible, and lean implementation of core jQuery designed specifically for the server), simplifying the process of HTML parsing and data extraction.
Provides extensive options for customization, from user-agent strings to request intervals, making it suitable for a wide range of web crawling scenarios.
Easy to set up and use, even for those new to Node.js or web scraping.

Disadvantages:

Does not handle JavaScript rendering natively. For dynamic JavaScript-heavy sites, you need to integrate it with something like Puppeteer or a headless browser.
While Node Crawler simplifies many tasks, the asynchronous model and event-driven architecture of Node.js can present a learning curve for those unfamiliar with such patterns.

Best for: Node Crawler is a great choice for developers familiar with the Node.js ecosystem who need to handle large-scale or high-speed web scraping tasks. It provides a flexible solution for web crawling that leverages the strengths of Node.js's asynchronous capabilities.

📖 Related: Web scraping with Node.js guide

5. Selenium

Language: Multi-language | GitHub: 30.6K stars | link

Selenium is a widely-used open-source framework for automating web browsers. It allows developers to write scripts in various programming languages to control browser actions. This makes it suitable for crawling and scraping dynamic content. Selenium provides a rich API that supports multiple browsers and platforms, so you can simulate user interactions like clicking buttons, filling forms, and navigating between pages. Its ability to handle JavaScript-heavy websites makes it particularly valuable for scraping modern web applications.

Pros:

Cross-browser support: Works with all major browsers (Chrome, Firefox, Safari, etc.), allowing for extensive testing and scraping.
Dynamic content handling: Capable of interacting with JavaScript-rendered content, making it effective for modern web applications.
Rich community and resources: A large ecosystem of tools and libraries that enhance its capabilities.

Cons:

Resource-intensive: Running a full browser can consume significant system resources compared to headless solutions.
Steeper learning curve: Requires understanding of browser automation concepts and may involve complex setup for advanced features.

Best for: Selenium is ideal for developers and testers needing to automate web applications or scrape data from sites that heavily rely on JavaScript. Its versatility makes it suitable for both testing and data extraction tasks.

📖 Related: How to do web scraping with Selenium in Python

6. Heritrix

Language: Java | GitHub: 2.8K+ stars | link

Heritrix is open-source web crawling software developed by the Internet Archive. It is primarily used for web archiving - collecting information from the web to build a digital library and support the Internet Archive's preservation efforts.

Advantages:

Optimized for large-scale web archiving, making it ideal for institutions like libraries and archives needing to preserve digital content systematically.
Detailed configuration options that allow users to customize crawl behavior deeply, including deciding which URLs to crawl, how to treat them, and how to manage the data collected.
Able to handle large datasets, which is essential for archiving significant web portions.

Disadvantages:

As it is written in Java, running Heritrix might require more substantial system resources than lighter, script-based crawlers, and it might limit usability for those unfamiliar with Java.
Optimized for capturing and preserving web content rather than extracting data for immediate analysis or use.
Does not render JavaScript, which means it cannot capture content from websites that rely heavily on JavaScript for dynamic content generation.

Best for: Heritrix is best suited for organizations and projects that aim to archive and preserve digital content on a large scale, such as libraries, archives, and other cultural heritage institutions. Its specialized nature makes it an excellent tool for its intended purpose but less adaptable for more general web scraping needs.

7. Apache Nutch

Language: Java | GitHub: 2.9K+ stars | link

Apache Nutch is an extensible open-source web crawler often used in fields like data analysis. It can fetch content through protocols such as HTTPS, HTTP, or FTP and extract textual information from document formats like HTML, PDF, RSS, and ATOM.

Advantages:

Highly reliable for continuous, extensive crawling operations given its maturity and focus on enterprise-level crawling.
Being part of the Apache project, Nutch benefits from strong community support, continuous updates, and improvements.
Seamless integration with Apache Solr and other Lucene-based search technologies, making it a robust backbone for building search engines.
Leveraging Hadoop allows Nutch to efficiently process large volumes of data, which is crucial for processing the web at scale.

Disadvantages:

Setting up Nutch and integrating it with Hadoop can be complex and daunting, especially for those new to these technologies.
Overly complicated for simple or small-scale crawling tasks, whereas lighter, more straightforward tools could be more effective.
Since Nutch is written in Java, it requires a Java environment, which might not be ideal for environments focused on other technologies.

Best for: Apache Nutch is ideal for organizations building large-scale search engines or collecting and processing vast amounts of web data. Its capabilities are especially useful in scenarios where scalability, robustness, and integration with enterprise-level search technologies are required.

8.Webmagic

Language: Java | GitHub: 11.4K+ stars | link

Webmagic is an open-source, simple, and flexible Java framework dedicated to web scraping. Unlike large-scale data crawling frameworks like Apache Nutch, WebMagic is designed for more specific, targeted scraping tasks, which makes it suitable for individual and enterprise users who need to extract data from various web sources efficiently.

Advantages:

Easier to set up and use than more complex systems like Apache Nutch, designed for broader web indexing and requires more setup.
Designed to be efficient for small to medium-scale scraping tasks, providing enough power without the overhead of larger frameworks.
For projects already within the Java ecosystem, integrating WebMagic can be more seamless than integrating a tool from a different language or platform.

Disadvantages:

Being Java-based, it might not appeal to developers working with other programming languages who prefer libraries available in their chosen languages.
WebMagic does not handle JavaScript rendering natively. For dynamic content loaded by JavaScript, you might need to integrate with headless browsers, which can complicate the setup.
While it has good documentation, the community around WebMagic might not be as large or active as those surrounding more popular frameworks like Scrapy, potentially affecting the future availability of third-party extensions and support.

Best for: WebMagic is a suitable choice for developers looking for a straightforward, flexible Java-based web scraping framework that balances ease of use with sufficient power for most web scraping tasks. It's particularly beneficial for users within the Java ecosystem who need a tool that integrates smoothly into larger Java applications.

9. Nokogiri

Language: Ruby | GitHub: 6.1K+ stars | link

Like Beautiful Soup, Nokogiri is also great at parsing HTML and XML documents via the programming language Ruby. Nokogiri relies on native parsers such as the libxml2 libxml2, libgumbo, and xerces. If you want to read or edit an XML document using Ruby programmatically, Nokogiri is the way to go.

Advantages:

Due to its underlying implementation in C (libxml2 and libxslt), Nokogiri is extremely fast, especially compared to pure Ruby libraries.
Able to handle both HTML and XML with equal proficiency, making it suitable for a wide range of tasks, from web scraping to RSS feed parsing.
Straightforward and intuitive API for performing complex parsing and querying tasks.
Strong, well-maintained community ensures regular updates and good support through forums and documentation.

Disadvantages:

Specific to Ruby, which might not be suitable for those working in other programming environments.
Installation can sometimes be problematic due to its dependencies on native C libraries.
Can be relatively heavy regarding memory usage, especially when dealing with large documents.

Best for: Nokogiri is particularly well-suited for developers already working within the Ruby ecosystem and needs a robust, efficient tool for parsing and manipulating HTML and XML data. Its speed, flexibility, and Ruby-native design make it an excellent choice for a wide range of web data extraction and transformation tasks.

10. Playwright

Language: Multi-language | GitHub: 67K+ stars| link

Playwright an open-source Node.js library introduced in 2020, is widely used for automated browser testing and web scraping. It is cross-platform, supports multiple languages like TypeScript, JavaScript, Python, and Java, and works with Chromium, Firefox, and Webkit. Playwright offers unique features for web automation, including headless mode, autowaits, browser contexts, authentication state persistence, and custom selector engines.

Advantages:

Playwright supports multiple browsers including Chromium, Firefox, and WebKit, for consistent scraping across different platforms. It can also be utilized with various programming languages such as JavaScript, Python, Java, and .NET, which makes it accessible to a broader range of developers.
Playwright can operate in headless mode, which reduces resource consumption and allows for faster execution of scraping tasks without a graphical interface. The framework automatically waits for elements to be ready before interacting with them. This reduces the need for manual delays and improves reliability.
It effectively manages websites that rely on JavaScript and AJAX for content loading, so it's suitable for modern web applications. The framework automatically waits for elements to be ready before interacting with them. This reduces the need for manual delays and improves reliability.

Disadvantages:

Running multiple browser instances can consume significant system resources, particularly when scraping large volumes of data.
While capable, Playwright is primarily designed for browser automation and testing rather than dedicated web crawling, which can complicate extensive scraping tasks.

Best for: Playwright is best suited for developers looking to automate interactions with web applications that utilize modern frameworks like React or Angular. Its ability to handle dynamic content makes it ideal for scenarios where traditional HTTP request libraries fall short. It is particularly advantageous in projects that require frequent updates or interactions with complex web interfaces.

11. Katana

Language: Go | GitHub: 11.1k | link

Katana is a web scraping framework focused on speed and efficiency. Developed by Project Discovery, it is designed to facilitate data collection from websites while providing a strong set of features tailored for security professionals and developers. Katana lets you create custom scraping workflows using a simple configuration format. It supports various output formats and integrates easily with other tools in the security ecosystem, which makes it a versatile choice for web crawling and scraping tasks.

Pros:

High performance: Built with efficiency in mind, allowing for fast data collection from multiple sources.
Extensible architecture: Easily integrates with other tools and libraries, enhancing its functionality.
Security-focused features: Includes capabilities that cater specifically to the needs of security researchers and penetration testers.

Cons:

Limited community support: As a newer tool, it does not have as extensive resources or community engagement as more established frameworks.
Niche use case focus: Primarily designed for security professionals, which may limit its appeal for general-purpose web scraping tasks.

Best for: Katana is best suited for security professionals and developers looking for a fast, efficient framework tailored to web scraping needs within the cybersecurity domain. Its integration capabilities make it particularly useful in security testing scenarios where data extraction is required.

All-in-one crawling and scraping solution: Apify

Apify is a full-stack web scraping and browser automation platform for building crawlers and scrapers in any programming language. It provides infrastructure for successful scraping at scale: storage, integrations, scheduling, proxies, and more.

So, whichever library you want to use for your scraping scripts, you can deploy them to the cloud and benefit from all the features the Apify platform has to offer.

Apify also hosts a library of ready-made data extraction and automation tools (Actors) created by other developers, which you can customize for your use case. That means you don't have to build everything from scratch.

How to scrape dynamic websites with Python

Saurav Jain — Fri, 07 Jun 2024 07:27:24 +0000

Scraping dynamic websites that load content through JavaScript after the initial page load can be a pain in the neck, as the data you want to scrape may not exist in the raw HTML source code.

I'm here to help you with that problem.

In this article, you'll learn how to scrape dynamic websites with Python and Playwright. By the end, you'll know how to:

Setup and install Playwright
Create a browser instance
Navigate to the page
Interact with the page
Scrape the data you need

What are dynamic websites?

Dynamic websites load content dynamically using client-side scripting languages like JavaScript. Unlike static websites, where the content is pre-rendered on the server, dynamic websites generate content on the fly based on user interactions, data fetched from APIs, or other dynamic sources. This makes them more complex to scrape compared to static websites.

What's the difference between a dynamic and static web page?

Static web pages are pre-rendered on the server and delivered as complete HTML files. Their content is fixed and does not change unless the underlying HTML file is modified. Dynamic web pages, on the other hand, generate content on-the-fly using client-side scripting languages like JavaScript.

Dynamic content is often generated using JavaScript frameworks and libraries like React, Angular, and Vue.js, which manipulate the Document Object Model (DOM) based on user interactions or data fetched from APIs using technologies like AJAX (Asynchronous JavaScript and XML). This dynamic content is not initially present in the HTML source code and requires additional processing to be captured.

Tools and Libraries for Scraping Dynamic Content

To scrape dynamic content, you need tools that can execute JavaScript and interact with web pages like a real browser. One such tool is Playwright, a Python library for automating Chromium, Firefox, and WebKit browsers. Playwright allows you to simulate user interactions, execute JavaScript, and capture the resulting DOM changes.

In addition to Playwright, you may also need libraries like BeautifulSoup for parsing HTML and extracting relevant data from the rendered DOM.

Step-by-Step Guide to Using Playwright

Setup and Installation:
- Install the Python Playwright library: pip install Playwright
- Install the required browser binaries (e.g., Chromium): Playwright install chromium
Scraping a Dynamically-loaded Website:
- Import the necessary Playwright modules and create a browser instance.

from Playwright.sync_api import sync_playwright

    with sync_playwright() as p:
        browser = p.chromium.launch()

- Launch a new browser context and create a new page.

    page = browser.new_page()

- Navigate to the target website.

page.goto("<https://example.com/infinite-scroll>")

- Interact with the page as needed (e.g., scroll, click buttons, fill forms) to trigger dynamic content loading.

```
# Scroll to the bottom to load more content
while True:
    page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
    new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
    if not new_content_loaded:
        break
```

- Wait for the desired content to load using Playwright's built-in wait mechanisms.

```
new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
```

- Extract the desired data from the rendered DOM using Playwright's evaluation mechanisms or in combination with BeautifulSoup.

```
content = page.inner_html("body")
```

Here's the complete example of scraping an infinite scrolling page using Playwright:

```
from Playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Launch a new Chromium browser instance
    browser = p.chromium.launch()

    # Create a new page object
    page = browser.new_page()

    # Navigate to the target website with infinite scrolling
    page.goto("<https://example.com/infinite-scroll>")

    # Scroll to the bottom to load more content
    while True:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load (timeout after 1 second)
        new_content_loaded = page.wait_for_selector(".new-content", timeout=1000) # Check for a specific class

        # If no new content is loaded, break out of the loop
        if not new_content_loaded:
            break

    # Extract the desired data from the rendered DOM
    content = page.inner_html("body")

    # Close the browser instance
    browser.close()
```

Challenges and Solutions

Web scraping dynamic content can present several challenges, such as handling CAPTCHAs, IP bans, and other anti-scraping measures implemented by websites. Here are some common solutions:

CAPTCHAs: Playwright provides mechanisms to solve CAPTCHAs using third-party services or custom solutions. You can leverage libraries like python-anticaptchacloud or python-anti-captcha to solve CAPTCHAs programmatically.
IP bans: Use rotating proxies or headless browsers to avoid IP bans and mimic real user behavior. Libraries like requests-html and selenium can be used in conjunction with proxy services like Bright Data or Oxylabs.
Anti-scraping measures: Implement techniques like randomized delays, user agent rotation, and other tactics to make your scraper less detectable. Libraries like fake-useragent and scrapy-fake-useragent can help with user agent rotation.

Summary and Next Steps

Scraping dynamic websites requires tools that can execute JavaScript and interact with web pages like a real browser. Playwright is a powerful Python library that enables you to automate Chromium, Firefox, and WebKit browsers, making it suitable for scraping dynamic content.

However, it's essential to understand that web scraping dynamic content can be more challenging than scraping static websites due to anti-scraping measures implemented by websites. You may need to employ additional techniques like rotating proxies, handling CAPTCHAs, and mimicking real user behavior to avoid detection and ensure successful scraping.

For further learning and additional resources, consider exploring Playwright's official documentation or one of our more in-depth tutorials:

Web scraping in 2024: breakthroughs and challenges ahead

Natasha Lekh — Sun, 28 Jan 2024 23:00:00 +0000

This article was first published on December 15, 2023, and updated on January 29, 2024, to reflect recent updates in the legal landscape of web scraping.

How did 2023 treat the web scraping industry? Let's take a short walk through the bad, the good, and the different of yesteryear. Welcome to a summary of the key events and trends that emerged in 2023, setting the stage for the landscape of 2024.

🎄 Want to compare to what web scraping was like in 2022? Check out our overview from the last year .

🧑 Irony of the year

The year started off funny. In 2022, Meta was very keen on sui ng individuals and companies for web scraping; in 2023, it continued to zero in even on its recent allies. The culprit in question, Bright Data, got sued by Facebo ok for scraping Facebook data. The trick is that Facebook was using Bright Data's services previously for scraping data (just from other websites). Essentially, Meta inadvertently revealed its practice of collecting data from other websites through its lawsuit against a firm it employed for this very purpose. Quite some web scraping uroboros there. This situation once more highlighted the two aspects of an age-old industry question: who really owns publicly accessible data, and is it okay to gather it?

🆕 In 2024, the court ruled against Meta and in favor of web scraping . The judge dismissed Meta's breach of contract claim, arguing that even though Bright Data has accepted the terms of service of Facebook and Instagram, the company was not acting as a "user" of the services when it was scraping but only as a logged-out "visitor," who is not bound by the terms.

In a cruel twist of fate, later last year, Meta got another billion-sized fine (as big as the previous 6 combined, apparently) from the Irish DPC for not protecting the data of EU citizens from surveillance. The Irish Data Protection Commission and EU are not playing when it comes to data privacy. The penalty relates to an inquiry that was opened by the DPC back in 2020. And it seems like in 2024 Meta will be facing several other lawsuits regarding ad space and its pay-or-consent policy, this time from Spanish media and Austrian data protection authority.

The pack of plaintiffs claiming the violation of terms and conditions was extended by a new member, with Air Canada filing suit against travel search site seats.aero in a similar case, alleging unlawful scraping of its website and thus violating its terms of conditions. Interestingly however, Air Canada also claims breach of criminal law under the Computer Fraud and Abuse Act (CFAA). This move could signal that, although claims on these grounds have been in the past dismissed by courts in Van Buren 2021 and the hiQ April 2022 ruling built on it, the CFAA has still not lost its allure for all of the websites wanting to sue web scraping companies.

👀 The non-event of the year

There are always a few things in life to be grateful for because they did not happen: the extinction of the bees, the eruption of Yellowstone Volcano, and the Google WEI Proposal. The Web Environment Integrity (WEI) proposal, which was pushed by Google, was eventually abandoned this year (not in the least due to the protest of the defenders of the free web see screenshot from explainers-by-googlers issues).

Issues in explainers-by-googlers after the announcement of the WEI Proposal

Google was trying to follow the likes of Apple to replace Captchas with a digitally signed token, an API containing a digitally signed token, to be precise. The reason seemed innocuous: to help separate real users from bot users, and real traffic from bot traffic, and limit online fraud and abuse all this without enabling privacy issues like cross-site tracking or browser fingerprinting. Sounds like a dream, right?

However, while it might aid in reducing ad fraud, Google's proposed method of authentication also carries the risk of curtailing web freedom by allowing websites or third parties to directly influence the choice of browsers and software used by visitors. It could also potentially lead to misuses, such as rejecting visitors using certain tools like ad blockers or download managers.

Besides, Google intended to implement the Web Environment Integrity API in Chromium, the open-source base for Chrome and several other browsers, excluding Firefox and Safari. This, in comparison, makes Apple's Private Access Token seem way less dangerous, not in the least part because Safari has a much smaller browser market share than Chrome.

The drawbacks were quickly noticed by the open web proponents in the tech community. Critics quickly recognized the potential for this to evolve into a kind of digital rights/restriction management for the web. They also highlighted that this change would wildly benefit the ad companies but create high risks of disadvantaging the users. It would also make scraping and web automation activities significantly harder. Well, for everyone except for Google, of course.

The rejection of WEI by the tech community again highlights the importance of maintaining an open and accessible web.

🁫 First domino of the year

Scraping social media is the most common web scraping use case. In the old internet days, websites kept their APIs free and accessible, and even if they backed down from that, they often left a free version for the developers. The year started with X (Twitter)'s move to a paid API model, which meant discontinuing free access even for developers. A few months later, Reddit followed suit with its API transition to a paid model which caused significant uproar and protests.

X's API policy changes might have contributed to the more frequent occurrences of Twitter scraping. With many projects forced to shut down due to the three price tiers, it's very likely that some developers had to turn to web scraping and browser automation as an alternative. We tried to keep up with these changes ourselves as providers of a more affordable Twitter API and Reddit API. But it's becoming increasingly difficult or inconvenient to scrape these websites without a reliable infrastructure.

👺 Troublemaker of the year

Last year the web scraping case law made strides with the hiQ vs. LinkedIn case. 2023 had been rather calm on the legal side of things, if not for one particular persona. If, in 2022, Meta was the one suing individuals and companies for harvesting data, this year was a debut for X (Twitter). To be fair, the year 2023 was a debut for a lot of things for Twitter, but let's focus on the thing in question.

Elon Musk, the tech mogul, made headlines with public promises to take legal action against web scraping companies. This move was followed by X adding and then silently removing the login requirement for viewing posts (tweets) and following through with the promise by initiating lawsuits against 4 unknown individuals. And before all that, Musk has made Twitter API paid. But let's follow step-by-step here.

In July 2023, Elon Musk (well, X Corp, if we're being precise) gave us all some heat by initiating legal action against four anonymous entities who were scraping Twitter. Apparently, the four defendants overwhelmed Twitter's registration page with automated requests to such an extent that it caused a significant server strain and disruption of service for users. The culprits are accused of overburdening Twitter's servers, diminishing user experience, and profiting unjustly at the company's expense.

And as a regular cherry on top, the lawsuit further accuses them of scraping Twitter user data in breach of the platform's user agreement. These days, breach of Terms of Service has become companies favorite reference when instigating lawsuits against web scraping. Seconded only by scraping data for large language models training which is a concern raised by Elon Musk as well. Despite these latter allegations, he did confirm that his recently launched firm, xAI, will use X posts for training purposes. So go figure.

The lawsuit suggests that the intensive data scraping led to such severe performance issues that X had to enforce a login requirement for access for everyone. Users are now required to have an account to view tweets and must subscribe to Twitter Blue's "verified" service to see over 600 posts per day.

Now, we don't know for sure whether the AI data scraping was so intense that it could have impacted the website as much. However, this lawsuit and the argumentation behind it raise concerns about the potential for misrepresenting ethical data scraping practices, especially companies that adhere to legal and ethical standards in data collection.

Resources:

📈 Trend of the year

**AI brings a new way to easily process large amounts of data something that required developing complex and special machine learning models before. These days anybody can do, for instance, sentiment analysis with LLMs.

Marek Trunkat, CTO of Apify**

Web scraping really became the household term after the waves caused by ChatGPT and OpenAI this year. Why? Because web scraping was heavily involved in the training process. In Google Trends, among the regular adjacent topics such as point-and-click or proxy, we see AI. And this trend is here to stay.

We were happy to observe that making a one-off regular web scraper using AI is so easy these days. The AI hype makes it seem simple and accessible even without coding knowledge. It's the reliability and continuity of scraping that the AI cannot guarantee, especially with websites employing their own blocking measures based on AI.

AI is the adjacent trend of the year in web scraping

🦾 AI and the hunt for data

**The AI revolution of 2023 only underscored the already growing need for data from the web. All large language models (LLMs) like GPT-4 and LLaMA-2 were trained on data scraped from the web. As demand for AI and LLM applications will continue to grow, so will grow the demand for web scraping and data extraction.

Jan Curn, Apify Founder & CEO**

The large language models that power ChatGPT and other AI chatbots get their mastery of language from essentially two things: massive amounts of training data scraped from the web and massive amounts of compute power to learn from that data. That second ingredient is very expensive, but the first ingredient, so far, has been completely free.

However, creators, publishers, and businesses increasingly see the data they put on the web as their property. If some tech company wants to use it to train its LLMs, they want to be paid. Just ask the Associated Press, which struck a training data licensing deal with OpenAI. Meanwhile, X (ne Twitter) has taken steps to block AI companies from scraping content on the platform.

Web data and RAG

The knowledge of LLMs is limited to the public data they were trained on. Building AI applications that can retrieve proprietary data or public data introduced after a models cutoff date and generate content based on it requires augmenting the knowledge of the model with specific information. That process is known as retrieval-augmented generation (RAG), and it has revolutionized search and information retrieval.

While the likes of LangChain and LlamaIndex swiftly took center stage in this field, web scraping (being the most efficient way to collect web data) has remained a significant part of RAG solutions.

**To work around the training data cutoff date problem to provide models with up-to-date knowledge, LLM applications often need to extract data from the web. This so-called retrieval-augmented generation (RAG) is what gives the LLMs the superpowers and arguably this is the strongest use case of LLMs.

Jan Curn, Apify Founder & CEO**

Adding data to custom GPTs

OpenAI launched GPTs (custom versions of ChatGPT) in November 2023. This was a really big deal, as suddenly, everyone had the means to build their own AI models. These GPTs can be customized not only with instructions but also with extra knowledge (by uploading files) and a combination of skills (with API specifications). In other words, you can give such GPTs web scraping capabilities with the right specs or scrape websites to upload knowledge to a GPT so it can base generated content on that information.

The hype around GPTs was quickly replaced by a huge furore around the firing and return of OpenAIs CEO. As a result, the debut of GPT Store, which lets users monetize their GPTs, was postponed and finally launched in early 2024.

EU AI Act represents break-through legislation for AI and web scraping

After the global shake-up in the world of personal data protection represented by GDPR, the EU reached a provisional agreement on the EU AI Act, which has similar ambitions for the world of artificial intelligence as GDPR had for personal data. Hailed by EU officials as global first and historic, the Act positions the EU as a frontrunner in the field of AI regulation.

The EU adopted a risk-based approach, defining four different classes of AI systems. The AI systems are divided into four categories: (1) unacceptable risk, (2) high risk, (3) limited risk, and (4) minimal/no risk.

Firstly, in the unacceptable risk category will belong to those AI systems which contravene EU values and are considered to be a threat to fundamental rights. These systems will be banned entirely. Among others, this category will include:

biometric categorization systems that use sensitive characteristics (e.g., political, religious, philosophical beliefs, sexual orientation, race, etc.);
untargeted scraping of facial images from the Internet or CCTV footage to create facial recognition databases;
emotion recognition, social scoring, AI systems manipulating human behavior, or exploiting vulnerabilities of people (due to their age, disability, social or economic situation, etc.).

However, the EU regulators incorporated several exceptions to using AI systems in this category, such as the use of biometric identification systems for law enforcement purposes, which will be subject to prior judicial authorization and only for a strictly defined list of crimes.

Secondly, the Act will include some AI systems in the high-risk category due to their significant potential harm to health, safety, fundamental rights, the environment, democracy, and the rule of law. Among others, this category will include AI systems in the field of medical devices, certain critical infrastructure, systems used to influence the outcome of elections or voter behavior, and more. These systems will be subject to comprehensive mandatory compliance obligations, such as fundamental rights impact assessment, conducting model evaluations and testing, reporting serious incidents, etc.

Thirdly, the AI systems classified as limited risk, such as chatbots, will be subject to minimal obligations, such as the requirement to inform users that they are interacting with an AI system and the obligation to mark the image, audio, or video content generated by AI.

Lastly, all AI systems not classified in one of the other three categories will be classified as minimal/no risk. The Act allows for the free use of minimal and no-risk AI systems, with voluntary codes of conduct encouraged.

Violations of the Act will be subject to fines, depending on the type of AI system, the size of the company, and the severity of the infringement.

Resources:

🌟 Apify's contributions

Of course, we could not pass up an opportunity to contribute to the party. For the 8 years that Apify has been on the market from the early days in Y Combinator to the transition from Apifier to now, it's been our goal to develop the cloud computing platform for automation and web scraping tools. So here's what we did this year to come a little bit closer to that goal.

Support for Python users: SDK, code templates, and Scrapy spiders

We've started the year off pretty strong by taking a significant and probably unexpected step forward. In March 2023 (on Pi Day, to be precise), we launched Python SDK to expand our toolset for Python developers. Now, if you know anything about Apify, you know that we have traditionally been on Node.js/JavaScript side of things. But things change, and so does the market and requests from our users. Being a start-up means venturing into different directions and trying different things when the situation calls for it. And since we consistently work on becoming the platform for web scraping and automation, launching a library for Python developers, giving them something to start from, just made sense.

As a follow-up step, we've rolled out web scraping templates aimed to simplify and improve the developer experience on our platform. We've realized not everyone wants to use ready-made tools in the Store or have complete control over every single aspect when building a scraper like with Crawlee. Web scraping templates seemed like a great third option, so here they are: in JavaScript, TypeScript, and Python. We've also launched the $1/month Creator Plan to support our most avid and enthusiastic users who are interested in building Actors.

Last but not least, we've made it possible to deploy Scrapy spiders to our cloud platform. All you have to do is use a Scrapy wrapper. The platform provides proxies and API and allows our Python users to run, schedule, monitor, and monetize their spiders.

Store and community growth

This year we've had to deal with unprecedented interest and growth of our Actors published in Store. The number of users engaging with Public Actors in Store has doubled , soaring from 8,971 to 17,070. In terms of new contributions, we've seen a significant influx, with 657 new Actors being published this year, a substantial increase compared to the 290 in 2022. Moreover, our community has been enriched by the addition of 96 new community developers , who have joined us with their Public Actors, doubling the number from the 48 who joined in 2022. This growth not only reflects the rising popularity of our platform but also underscores the expanding ecosystem for web scraping and automation we're building together.

New integrations and AI ventures

We've launched integrations with LlamaIndex and LangChain, marking a notable expansion in its collaboration network. These integrations mean you can load scraped datasets directly into LangChain or LlamaIndex vector indexes and build AI chatbots such as Intercom's Fin or other apps that query text data crawled from websites.

We've also introduced 3 AI tools in our Store to help fuel large language models and the likes: GPT Scraper and Extended GPT Scraper, Website Content Crawler, and AI Web Agent. Last but not least, we've launched a not LLM-related but nevertheless web scraping solution with AI at its core, AI Product Matcher.

Blog and YouTube

Regarding content, you can notice that our blog switched to a more technical approach, as well as our YouTube tutorials. We've also recorded our first podcast about the legality of web scraping. We've held three webinars on pretty extensive topics and experimented with posting Shorts. Our internal user engagement is as strong as ever with our newsletter reaching over 68K people every month, with around a 65% open rate. You can now subscribe to an online version of it on LinkedIn if you don't like your inbox crowded.

Apify platform

Our crown jewel, the Apify platform, is evolving day by day, not only design and UX-wise but also functionality-wise. We are currently working on a video of a new tour of Apify that will showcase all the new features and changes made this past year. But for now, here's something to look back on and appreciate the progress:

See you in the new year!

Groupon reaches new merchants thanks to web data collection

Theo Vasilis — Mon, 04 Dec 2023 23:00:00 +0000

Groupon (NASDAQ: GRPN) is the worlds most popular marketplace to find deals for activities, travel, goods, and services offered by local merchants in hundreds of cities around the globe. Groupon, originally meant as "group" + "coupon, was founded on the idea that the collective bargaining power of a large number of people can get them better deals than they could get individually.

In March 2023, Duan enkypl from Pale Fire Capital became Groupons new interim CEO and set an ambitious goal to rapidly expand business by reaching new merchants and thus offer more deals to consumers. Recognizing the potential of web data to find new and enrich existing leads, enkypl turned to Apify to leverage its expertise in web data collection.

Groupon is using web data collection for smart lead generation at scale

The challenge

Groupon was looking for a way to update information about existing merchants, as well as find new ones to ask them to join the network. Such information can be found on search engines, travel sites, online maps, and various other websites.

The web data-based lead generation and enrichment pipeline had to provide accurate and up-to-date data about tens of thousands of businesses and seamlessly integrate into Groupons existing Salesforce CRM platform.

The solution

Apify operates a cloud platform that provides serverless computation, data storage, proxies, open-source SDKs, and hundreds of ready-made web scraping Actors built by community developers. Apifys Enterprise solutions team helped Groupon set up various Actors to extract the required data and run it at scale in the cloud.

To ensure the data fits into Groupons specific Salesforce implementation, Apify built a new Actor to filter, organize, and match the business data. Thanks to the modularity of the Apify platform, this custom solution was prepared in a very short time, helping Groupon reach new merchants faster than with other solutions.

The outcome

Groupon's sales team now has a rich database filled with potential leads right at their fingertips. The automation of the entire data journeyfrom extraction to integrationtransformed into significant time savings, heightened efficiency, and, ultimately, a stronger position within the e-commerce space.

*"We selected Apify because of their vast experience with web data collection. The project has been delivered on a short schedule, and our sales teams are now empowered with fresh, unique leads that drive targeted campaigns and strategic outreach."

Filip Popovic, SVP Transformation & Product & HR at Groupon_*_

Technical details

The solution was composed of the following parts:

1. Configuring existing Actors

The data extraction process commenced with a custom-designed Actor, New Leads Runner , delivered by Apify, to fine-tune Groupon's search criteria and ensure that the data sourced from other Actors is as relevant and targeted as possible

2. Mining business information

After precise input preparation, Apify could pinpoint and collate business information aligning with Groupon's focus areas. This phase was not just about gathering data legally and ethically but doing so in a way that adhered to Groupon's stringent quality standards.

3. Ensuring data quality

Data duplication can be a significant issue when handling vast amounts of information. Thanks to Apify's Merge, Dedup, and Transform Datasets Actor, we could ensure each business entry was unique by eliminating duplicates and containing the most relevant information by merging attributes from various sources.

4. Integrating data with Salesforce

Once the lead generation pipeline was producing clean data, the next step was to integrate it into Groupons existing CRM. With another custom-built Actor - Salesforce Uploader - Groupon could transfer their newfound leads into their Salesforce. The uploader also cross-references the new data with existing entries to ensure that only new businesses are added.

Who are Groupon and Apify?

Groupon (NASDAQ: GRPN) is a global e-commerce marketplace based in Chicago that connects subscribers with local merchants.

Apify is a full-stack web scraping and browser automation platform. In addition to its vast range of pre-built data extraction tools, Apify offers enterprise solutions with its team of experts who know how to handle the challenges of collecting data from arbitrary websites at scale.

Crawlee data storage types: saving files, screenshots, and JSON results

Percival Villalva — Mon, 27 Nov 2023 23:00:00 +0000

We're Apify , a full-stack web scraping and browser automation platform. We are the maintainers of the open-source library Crawlee .

Managing and storing the data you collect is a crucial part of any web scraping and data extraction project. It's often a complex task, especially when handling large datasets and ensuring output accuracy. Fortunately, Crawlee simplifies this process with its versatile storage types.

In this article, we will look at Crawlee's storage types and demonstrate how they can make our lives easier when extracting data from the web.

Setting up Crawlee

Setting up a Crawlee project is straightforward, provided you have Node and npm installed. To begin, create a new Crawlee project using the following command:

npx crawlee create crawlee-data

After running the command, you will be given a few template options to choose from. We will go with the CheerioCrawler JavaScript template. Remember, Crawlee's storage types are consistent across all crawlers, so the concepts we discuss here apply to any Crawlee crawler.

Crawlee template options

Once installed, you'll find your new project in the crawlee-data directory, ready with a template code that scrapes the crawlee.dev website:

To test it, simply run npm start in your terminal. You'll notice a storage folder appear with subfolders like datasets, key_value_stores, and request_queues.

Crawlee's storage can be divided into two categories: Request Storage (Request Queue and Request List) and Results Storage (Datasets and Key Value Stores). Both are stored locally by default in the ./storage directory.

Also, remember that Crawlee, by default, clears its storages before starting a crawler run. This action is taken to prevent old data from interfering with new crawling sessions. In case you need to clear the storages earlier than this, Crawlee provides a handy purgeDefaultStorages() helper function for this purpose.

Crawlee request queue

The request queue is a storage of URLs to be crawled. It's particularly useful for deep crawling, where you start with a few URLs and then recursively follow links to other pages.

Each Crawlee project run is associated with a default request queue, which is typically used to store URLs for that specific crawler run.

To illustrate that, lets go to the routes.js file in the template we just generated. There you will find the code below:

import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ enqueueLinks, log }) => { log.info(`enqueueing new URLs`); // Add links found on page to the queue await enqueueLinks({ globs: ['https://crawlee.dev/**'], label: 'detail', });});router.addHandler('detail', async ({ request, $, log, pushData }) => { const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});

Let's take a closer look at the addDefaultHandler function, particularly focusing on the enqueueLinks function it contains. The enqueueLinks function in Crawlee is designed to automatically detect all links on a page and add them to the request queue. However, its utility extends further as it allows us to specify certain options for more precise control over which links are added.

For instance, in our example, we use the globs option to ensure that only links starting with https://crawlee.dev/ are queued. Furthermore, we assign a detail label to these links. This labeling is particularly useful as it lets us refer to these links in subsequent handler functions, where we can define specific data extraction operations for pages associated with this label.

💡 See all the available options for enqueueLinks in the Crawlee documentation.

In line with our discussion on data storage types, we can now find all the links that our crawler has navigated through in the request_queues storage, located within the crawlers ./storage/request_queues directory. Here, we can access detailed information about each request that has been processed in the request queue.

Crawlee request list

The request list differs from the request queue as it's not a form of storage in the conventional sense. Instead, it's a predefined collection of URLs for the crawler to visit.

This approach is particularly suited for situations where you have a set of known URLs to crawl and don't plan to add new ones as the crawl progresses. Essentially, the request list is set in stone once created, with no option to modify it by adding or removing URLs.

To demonstrate this concept, we'll modify our template to utilize a predefined set of URLs in the request list rather than the request queue. We'll begin with adjustments to the main.js file.

main.js

import { CheerioCrawler, RequestList } from 'crawlee';import { router } from './routes.js';const sources = [{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },];const requestList = await RequestList.open('my-list', sources);const crawler = new CheerioCrawler({ requestList, requestHandler: router,});await crawler.run();

With this new approach, we created a predefined list of URLs, named sources, and passed this list into a newly established requestList. This requestList was then passed into our crawler object.

As for the routes.js file, we simplified it to include just a single request handler. This handler is now responsible for executing the data extraction logic on the URLs specified in the request list.

routes.js

import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ request, $, log, pushData }) => { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});

Following these modifications, when you run your code, you'll observe that only the URLs explicitly defined in our request list are being crawled.

This brings us to an important distinction between the two types of request storages. The request queue is dynamic, allowing for the addition and removal of URLs as needed. On the other hand, the request list is static once initialized and is not meant for dynamic changes.

With the request storage out of the way, lets now explore the result storage in Crawlee, starting with datasets.

Crawlee datasets

Datasets in Crawlee serve as repositories for structured data, where every entry possesses consistent attributes.

Datasets are designed for append-only operations. This means we can only add new records to a dataset, and altering or deleting existing ones is not an option. Each project run in Crawlee is linked to a default dataset, which is commonly utilized for storing precise results from our web crawling activities.

You might have noticed that each time we ran the crawler, the folder ./storage/datasets was populated with a series of JSON files containing extracted data.

Storing scraped data into a dataset is remarkably simple using Crawlee's Dataset.pushData() function. Each invocation of Dataset.pushData() generates a new table row, with the property names of your data serving as the column headings. By default, these rows are stored as JSON files on your disk. However, Crawlee allows you to integrate other storage systems as well.

And if you take a closer look at our addDefaultHandler function in routes.js you will see just how the pushData() function was used to append the scraped results to the Dataset.

For a practical example, lets take another look at the addDefaultHandler function within routes.js. Here, you can see how we used pushData() function to append the scraped results to the Dataset.

routes.js

router.addDefaultHandler(async ({ request, $, log, pushData }) => { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});

Key-value store

The key-value sto re in Crawlee is designed for storing and retrieving data records or files. Each record is tagged with a unique key and linked to a specific MIME content type. This feature makes it perfect for storing various types of data, such as screenshots, PDFs, or even for maintaining the state of crawlers.

Saving screenshots

To showcase the flexibility of the key-value store in Crawlee, let's take a screenshot of each page we crawl and save it using Crawlee's key-value store.

However, to do that, we need to switch our crawler from CheerioCrawler to PuppeteerCrawler. The good news is that adapting our code to different crawlers is quite straightforward. For this demonstration, we'll temporarily set aside the routes.js file and concentrate our crawler logic in the main.js file.

To get started with PuppeteerCrawler, the first step is to install the Puppeteer library:

npm install puppeteer

Next, adapt the code in your main.js file as shown below:

main.js

import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Capture the screenshot await saveSnapshot({ key, saveHtml: false }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();

After running the code above, we should see three screenshots, one for each website crawled, pop up on our crawlers key_value_store.

Saving pages as PDF files

Suppose we want to convert the page content into a PDF file and save it in the key-value store. This is entirely feasible with Crawlee. Thanks to Crawlee's PuppeteerCrawler being built upon Puppeteer, we can fully utilize all the native features of Puppeteer. To achieve this, we simply need to tweak our code a bit. Here's how to do it:

import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ page, request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Save as PDF await page.pdf({ path: `./storage/key_value_stores/default/${key}.pdf`, format: 'A4', }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();

Similar to the earlier example involving screenshots, executing this code will create three PDF files, each capturing the content of the accessed websites. These files will then be saved into Crawlees key-value store.

Doing more with your Crawlee scraper

Thats it for an introduction to Crawlees data storage types. As a next step, I encourage you to take your scraper to the next level by deploying it on the Apify platform as an Actor.

With your scraper running on the Apify platform, you gain access to all of Apify's extensive list of features tailored for web scraping jobs, like cloud storage and various data export options. Not sure what it means or how to do it? Dont worry, all the information you need is in this link to the Crawlee documentation.

Deploy your Crawlee scrapers on the Apify platform

Web scraping for machine learning

Theo Vasilis — Sun, 26 Nov 2023 23:00:00 +0000

Hi, we're Apify , a cloud platform that helps you build reliable web scrapers fast and automate anything you can do manually in a web browser. This article on web scraping for machine learning was inspired by our work on collecting data for AI and ML applications .

What is web scraping?

At its simplest, web scraping is the automated extraction of data from websites. This process is akin to web crawling, which is about finding or discovering web links. The difference is that web scraping is focused on extracting that data.

Initially, web scraping was a manual, cumbersome process, but with technological advances being what they are, it has become an automated, sophisticated practice. Web scrapers can navigate websites, understand their structure, and extract specific information based on predefined criteria.

Web scraping 101: learn the basics

Why is web scraping used in machine learning?

In most cases, you cant build high-quality predictive models with just internal data.

Asif Syed, Vice President of Data Strategy, Hartford Steam Boiler

The ability to harvest and process data from a myriad of web sources is what makes web scraping indispensable for machine learning. Web scraping isn't just about accessing the data but transforming it from the unstructured format of web pages into structured datasets that can be efficiently used in machine learning algorithms.

You can't teach a machine to make predictions or carry out tasks based on data unless you have an awful lot of data to train it. From social media analytics to competitive market research, web scraping enables the gathering of diverse datasets to teach machines, such as today's so-called 'AI models', and provide them with a rich and nuanced understanding of the world.

Comparing data collection methods for machine learning

There are multiple ways to collect data for machine learning these range from traditional surveys and manually curated databases to cutting-edge techniques that utilize IoT devices. So, why choose web scraping over other methods of data acquisition?

Surveys: They can provide highly specific data but often suffer from biases and limited scope.
Databases: These offer structured information, yet they may lack the real-time aspect essential for certain machine learning applications.
IoT devices: These bring in a wave of real-time, sensor-based data, but they're constrained by the type and quantity of data they can collect. It's worth noting that implementing MQTT authentication enhances the security and efficiency of data transmission and allows these devices to communicate more reliably.
Web scraping: In contrast, web scraping provides access to an almost infinite amount of data available online, from text and images to metadata and more. Unlike surveys or databases, web scraping taps into real-time data, which is crucial for models requiring up-to-date information. Moreover, the diversity of data that can be scraped from the web is unparalleled, which allows for a more comprehensive training of machine learning models.

Learn about building functional AI models for web scraping

Quality and quantity of data in ML

**You can have all of the fancy tools, but if your data quality is not good, you're nowhere.

Veda Bawo, Director of Data Governance, Raymond James**

The adage "quality over quantity" holds a significant place in many fields, but in the world of machine learning, it's not a matter of choosing one over the other. The success of ML models is deeply rooted in the quality and quantity of data they're trained on.

Quality of data refers to its accuracy, completeness, and relevance. High-quality data is free from errors, inconsistencies, and redundancies, making it indispensable for dependable analysis and sound decision-making. On the other hand, the quantity of data pertains to its volume. A larger dataset provides more information, leading to more reliable models and improved outcomes. However, an abundance of low-quality data can be detrimental, potentially leading to inaccurate predictions and suboptimal decision-making.

When it comes to quantity, web scraping allows for the collection of vast amounts of data from various online sources. However, the web is full of low-quality data, so simply extracting raw data isn't enough. It needs to be cleaned and processed before it can be used for machine learning. More about that later.

Another crucial aspect of data for machine learning is variety. Web scraping provides access to diverse data to enhance a model's ability to understand and interpret varied inputs.

Cloud-based real-time data acquisition

In the context of machine learning, the ability to collect and process data in real time is increasingly becoming a necessity rather than a luxury. This is where cloud-based data acquisition plays a vital role, as - in opposition to Edge-based data acquisition - it offers scalability and flexibility, which are critical for handling the voluminous and dynamic nature of web data.

Cloud computing, with its vast storage and computational capabilities, allows for the handling of massive datasets that web scraping generates. It provides the infrastructure needed to collect, store, and process data from varied sources in real-time. This real-time aspect is especially important in applications like market analysis, social media monitoring, and predictive modeling, where the timeliness of data can be the difference between relevance and obsolescence.

Learn about the differences between Edge AI and Cloud AI

Web scraping challenges and techniques for machine learning

The efficacy of web scraping in machine learning hinges on several key techniques. These not only ensure the extraction of relevant data but also its transformation into a format that machine learning algorithms can effectively utilize.

Handling dynamic websites

A major challenge in web scraping is dealing with dynamic websites that continually update their content. These sites often use technologies like JavaScript, AJAX, and infinite scrolling, making data extraction more complex. To effectively scrape such sites, one must employ advanced techniques and tools, seeking the expertise of pioneers in the field, such as companies specializing in software development services. Advanced techniques and tools are required. These include methods for executing JavaScript, handling AJAX requests, and navigating through dynamically loaded content. Mastering these techniques enables the scraping of real-time data from these complex websites, a critical requirement for many machine-learning applications.

Blocking and blacklisting

Many websites have measures in place to detect and block scraping bots to prevent unauthorized data extraction. These measures include blacklisting IP addresses, deploying CAPTCHAs, and analyzing browser fingerprints. To counteract blocking, web scrapers employ techniques like rotating proxies, mimicking real browser behaviors, and making use of CAPTCHA-solving services.

Heavy server load

Web scrapers can inadvertently overload servers with too many requests, leading to performance issues or even server crashes. To prevent this, its essential to implement intelligent crawl delays, randomize scraping times, and distribute the load across multiple proxies. This approach ensures a polite and responsible scraping process that minimizes the impact on website servers.

What do you do with the scraped data?

Data preprocessing

We said earlier that scraping raw data isn't enough. The next critical step involves cleaning and transforming the raw data into a structured format suitable for machine learning models. This stage includes removing duplicates and inconsistencies, handling missing values, and normalizing data to ensure that it's free from noise and ready for analysis. Preprocessing ensures that the data fed into machine learning models is of high quality, which is essential for accurate results.

Feature selection

Once the data is preprocessed, the next step is to identify and extract the most relevant features from the dataset. This involves analyzing the data to determine which attributes are most significant for the problem at hand. By focusing on the most relevant features, the efficiency and performance of machine learning models are significantly enhanced. This step - known also as feature engineering - can also help in reducing the complexity of the model to make it faster and more efficient.

Integrating web data with ML applications

Once you have your data, you need a way to integrate it with other tools for machine learning. Here are some of the most renowned libraries and databases for ML:

LangChain

This open-source framework is revolutionizing the way developers integrate large language models (LLMs) with external components in ML applications. It simplifies the interaction with LLMs, facilitating data communication and the generation of vector embeddings. LangChain's ability to connect with diverse model providers and data stores makes it the ML developer's library of choice for building on top of large language models.

Learn more about LangChain

Hugging Face

Renowned for its datasets library, Hugging Face is one of the most popular frameworks in the machine learning community. It provides a platform for easily accessing, sharing, and processing datasets for a variety of tasks, including audio, computer vision, and NLP, making it a crucial tool for ML data readiness.

Learn more about Hugging Face

Haystack

This tool's ecosystem is vast, integrating with technologies like vector databases and various model providers. It serves as a flexible and dynamic solution for developers looking to incorporate complex functionalities in their ML projects.

Learn more about Haystack

LlamaIndex

LlamaIndex represents a significant advancement in the field of machine learning, particularly in its ability to augment large language models with custom data. This tool addresses a key challenge in ML: the integration of LLMs with private or proprietary data. It offers an approachable platform for even those with limited ML expertise, allowing for the effective use of private data in generating personalized insights.

With functionalities like retrieval-augmented generation (RAG), LlamaIndex enhances the capabilities of LLMs, making them more precise and informed in their responses. Its indexing and querying stages, coupled with various types of indexes, such as List, Vector Store, Tree, and Keyword indexes, provide a stable infrastructure for precise data retrieval and use in ML applications.

Learn how to integrate Apify with LlamaIndex to feed vector databases and LLMs with data crawled from the web

Pinecone and other vector databases

ML models need numerical data, known as embeddings in machine learning, so any data you've collected has to be stored in and retrieved from a vector database.

Pinecone

This vector database stands out for its high performance and scalability, which are crucial for ML applications. It's developer-friendly and allows for the creation and management of indexes with simple API calls. Pinecone excels in efficiently retrieving insights and offers capabilities like metadata filtering and namespace partitioning, making it a reliable tool for ML projects.

Learn more about Pinecone

Chroma

As an AI-native open-source embedding database, Chroma provides a comprehensive suite of tools for working with embeddings. It features rich search functionalities and integrates with other ML tools, including LangChain and LlamaIndex.

For more vector databases, check out 6 open-source Pinecone alternatives

Your first web scraping challenge

If you haven't done web scraping before, we've made it easy (and free) for you to get started. Apify has created a tool ideal for data acquisition for machine learning: Website Content Crawler.

Learn how to use Website Content Crawler

This tool was specifically designed to extract data for feeding, fine-tuning, or training machine learning models such as LLMs. You can retrieve the results using the API to formats such as JSON or CSV, which can be fed directly to your LLM or vector database. You can also integrate the data with LangChain using the Apify LangChain integration.

🌐 Website Content Crawler

10 Google search tricks (that are also Google scraping tricks)

Natasha Lekh — Thu, 23 Nov 2023 23:00:00 +0000

Can you apply Google search tricks to scraping and data extraction as well? Lets put it to the test!

We're Apify , the only full-stack web scraping platform. You can build, deploy, share, and monitor scrapers or APIs for any website on Apify.

The savvy Googlers among us always have a few tricks up their sleeve. The question is: can you apply these Google search tricks to scraping and data extraction as well? Lets put it to the test! But before we start, we need to address the elephant in the blog.

🤔 What do Google search shortcuts have to do with web scraping?

Millions of people rely on Google search every day. Be it for school, research, or simple entertainment, if you know a few Google search shortcuts, your search process is more efficient. The thing is, a lot of those Google search tricks can also apply to web scraping.

The reason for this is simple: what a Google scraper does is very similar to what a Google visitor does: it goes to the google.com website, types in a query (even if it contains a shortcut), and receives results. The only difference is that the scraper also copies the results at lightning speed and packages them into a file.

This means that, if you are familiar with Google tricks and shortcuts, you can use that knowledge to upgrade your Google scraping process. When we built our Google Search Scraper 🔗 back in the day, we didn't count on this. Now that Google Scraper (also known as Google SERP API) has over 40,000 users, we feel obliged to everyone know about this interesting peculiarity.

https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/

Learn how to use our Google Search Scraper without tricks

💡 10 Google search tricks (that are also Google scraping tricks)

So let's put 10 well-known tricks to the test and level up your Google scraping. In other words, let's learn how to scrape Google like a pro 😎

Use site: to scrape specific sites

#1. Add site: after your keyword to narrow down the search to a specific website. For example, visualize site:blog.apify.com

This is probably the most well-known Google search shortcut to narrow down your search on a specific website without visiting it. The thing is, you can use this same trick not only to search but also to scrape content from that particular website. The syntax is very simple keyword + site:website.com. The screenshot above shows how you can apply it to our Google Scraper 🔗.

Our query will scrape all content from Google related to the word visualize but only from our Blog, blog.apify.com. All other scraping results will be filtered out. If you need to scrape specific content from a particular site, this is the shortcut to go for.

Quotation marks for exact scraping queries

#2. Surround your phrase or word with " " quotation marks for exact scraping queries

For a regular search, Google (and Google Scraper by extension) will get content containing the words of your query in any order. But you can use quotes to make your Google scraping query laser-accurate. No similar phrases, no swapping words around, no adjacent topics, just word-by-word accuracy.

Let's see whether we can get away with this by choosing a specific, very long-tail keyword to scrape for: "Headless browsers, infrastructure scaling, sophisticated blocking. Meet the full-stack platform that makes it all easy." This whole phrase can only be found on the Apify homepage. Will the Google SERP Scraper find it?

It did! So, surrounding your scraping keyword with quotes will instruct the scraping tool to scrape Google with that specific phrase in mind. This tip can also piggyback on the previous one: you can include quotes to search for specific wording on any website. We'll come back to mixing up various tricks in tip #10.

Hyphen to exclude certain results

#3. Add - in front to exclude certain results beforehand

This shortcut is useful for cases when you want to scrape data about one topic but filter out content about another. In other words, when you don't want a specific term to show up among your Google scraped results. For example, you want to scrape information about web scraping (going slightly meta there) but exclude any Python-related pages.

You can set this up by using - in front of unwanted keywords. In our example, the hyphen instructs the Google scraping tool to ignore any content that contains the word Python. And you won't find any Python-related pages among the results. The best part about this trick is that you can filter out information that you want to keep even before you start scraping.

Link: to scrape websites with backlinks

#4. Use link: to scrape websites containing backlinks of your choice

This Google scraping tip is no.1 for all SEO enthusiasts out there. Tracking backlinks is one of the most basic SEO practices because, as a rule, the more backlinks your page has, the better your Google ranking. Even better if those backlinks are "high quality," as in coming from domains with high domain authority. Essentially, the number of backlinks is a number one indicator that your website's content is valuable (since it's trusted by websites that decide to share it).

So the gist of this Google scraping trick is: instead of just scraping a page, we're going to scrape all pages that link to that specific page. Let's extract pages with a backlink to apify.com, a.k.a all pages that mention apify.com on their page. Phew, that was a mouthful, but with a simple link: apify.com we were able to catch them all.

Keep in mind that the more targeted your query is (focusing on a specific URL, for example link: apify.com/product-matching-ai/faq), the fewer results you'll get. This happens because most pages link to the main domain page rather than specific ones.

Related: to scrape similar websites or competition

#5: use related: to scrape similar websites or competition

The related: trick is a scraping technique that could be a game-changer for market researchers. When you apply related to, let's say amazon.com, you won't scrape links to Amazon. Instead, what you'll get are links to online stores similar to Amazon. Think of any e-commerce platform, such as Walmart, Kohl's, and other retailers that sell goods online. The scraping results will depend on the domain you've chosen.

By scraping with related: you can see which companies, organizations, or other entities are perceived as competition to the page you've indicated. So you can think of this Google scraping trick as a fast way to identify competitors in a given industry at least the ones that count in a digital space.

OR to scrape Google using multiple keywords at once

#6. Use OR to scrape using multiple keywords at once

This Google scraping trick allows you to scrape for multiple queries at once. For instance, let's say we want to scrape pages featuring recipes for both mustard dressing and vinaigrette dressing. By placing a simple OR between these phrases, we can make sure that our search (and subsequent scraping query) includes pages containing both of these delicious terms. To make this Google scraping trick even more precise, consider using quotation marks around your queries.

Asterisk to scrape wildcard data from Google

#7. Use * asterisk to scrape wildcard data

The asterisk wildcard is another nifty trick for Google scraping. When you insert an * into your scraping query, it acts as a flexible placeholder, which the Google scraper can later fill in. This tip is particularly handy when you don't have all the words at your fingertips. To best illustrate this, let's use an example with song lyrics. So, for our example, let's search for the lyrics of a famous Queen song by taking two random parts from verse three and placing an asterisk between them.

As the scraping tool works its magic, it understands that the asterisks could represent any word or series of words bridging our queries. More often than not, the result will include the exact lyrics of the song we're targeting. But this trick isn't limited to just song lyrics. Whether it's a specific social media post, an elusive item, a lengthy name, or an article title that's on the tip of your tongue, the asterisk wildcard can make your Google scraping just a little bit more interesting.

Filetype: to scrape files of specific format

#8. Use keyword + filetype: to scrape files of specific format

filetype: is as simple as it sounds. This Google scraping trick will get you any file on the open web. Just enter your keyword + filetype: followed by a file extension type: PDF, DOCX, or HTML. So for example, for your scraping query harry potter filetype:pdf you'll get a collection of Harry Potter-related PDFs. But the scope of this scraping trick isn't confined to these formats alone. You can scrape Google for any type of file it accepts, including PowerPoint Presentations (PPT), LaTeX Documents (TEX), and even Google Earth maps (KML).

Scrape results before, after, and between periods of time using BEFORE, AFTER and . .

#9. Scrape results before, after, and between periods of time using BEFORE, AFTER and . .

This trick lets you scrape Google pages in chronological order. Enter your keyword followed by the desired time frame before, after, or within a specific period. For example, if we're aiming to scrape Google Maps scraping tutorials published after 2022, our query would be: google maps scraping tutorial AFTER:2022. After applying this, our Google scraping results will exclusively feature tutorials from 2022 onwards, sparing us the effort of sifting through older, irrelevant data. A little caveat, though: you can't scrape anything earlier than the dawn of the internet.

Mix them up!

#10. Challenge the Google Pages Scraper by mixing up " ", * *, and site: search

Last but not least, you can combine a lot of the scraping tricks you've just learned our Google SERP Scraper loves a challenge. In our example, we'll be looking for a very specific article that we don't fully remember the name of and narrow down our search to blog.apify.com. So we're using quotation marks, an asterisk, and a site search. Let's see if the search engine and scraper can find and get that article for us. It did! So go ahead and try out all of the tricks one by one or all at once.

🤹 Know any other tricks? Try them out on Google Scraper !

Google scraping fueled by the Apify platform

The best part of Google Scraper is that it enables you to scrape anything and everything you could ever need from the World Wide Web. It can do that because Google SERP Scraper is more than just a standalone tool; it's actually supercharged by the versatility of the Apify platform.

Because of the platform support, you're not limited to simply exporting scraped Google data in a range of formats or getting results for various Google domains. You also gain the convenience of accessing that data through an API, crafting custom integrations with other scrapers or your favorite apps, and scheduling and monitoring your scraping projects with ease.

Last but not least, the Apify platform makes sure our 40K+ users can scrape Google pages with confidence thanks to our specialized SERP proxies that are tailor-made for the job). All this to make data extraction from Google easy and reliable.

Edge AI vs. Cloud AI

Apify — Wed, 22 Nov 2023 23:00:00 +0000

Hi, we're Apify , a cloud platform that helps you build reliable web scrapers fast and automate anything you can do manually in a web browser. This article on Edge AI vs. Cloud AI was inspired by our work on collecting data for AI and machine learning applications .

The rise of Edge AI

Five years ago, Gartner predicted that 75% of enterprise data would be created and processed outside the cloud by 2025. Whether that will prove entirely accurate remains to be seen. But what is clear is that Edge AI is rapidly growing in popularity.

The rise of edge computing accelerated in the 2010s with the explosion of IoT devices necessitating smarter, faster processing at the edge, in other words, closer to the data source. This gave rise to Edge AI, where AI algorithms are processed locally on a hardware device.

The growing interest in Edge AI has generated a myth that edge computing will replace cloud computing. But in reality, Edge and Cloud can work hand-in-hand by synchronizing a decentralized edge and a centralized cloud.

The purpose of this article isn't to tell you which of the two - Edge or Cloud - is better but to highlight the pros and cons of each so you can know which is suitable for your AI tasks.

What is Edge AI?

Sometimes referred to as AI at the edge, Edge AI is the implementation of artificial intelligence in an edge computing environment. In other words, Edge AI allows computation to be done close to where data is collected rather than at an offsite data center. That means it processes data locally to provide swift response times due to on-device processing.

Pros and cons of Edge AI

Edge AI advantages

Reduced latency and bandwidth: By processing data close to the edge, the need to transmit information over the network is reduced.
Swift response times: Fully on-device processing provides quick services, which eliminates wait times due to remote server responses.
Privacy and security: Edge AI offers better security for personal data than transmitting it across networks, where it can be vulnerable to cyberattacks.

Edge AI disadvantages

Limited computing power: Edge devices often have less computing power than cloud servers, limiting the complexity of AI models they can run.
Cost and scalability challenges: Scaling Edge AI solutions across numerous devices can be complex and expensive due to the amount of money required to acquire, maintain, and operate computing resources.
Maintenance and upgrades: Regular maintenance and updates of each edge device can be more challenging compared to centralized cloud updates.
Machine variations: There's more variation in machine types with edge devices, leading to more common failures.

What is Cloud AI?

Cloud AI refers to artificial intelligence systems where the data processing and AI model execution occur in cloud-based servers rather than on local devices.

The foundation for Cloud AI was laid with the advent of cloud computing in the early noughties. The introduction of cloud AI services, such as Google's Cloud AI, AWS's SageMaker, and Microsoft's Azure AI, some ten years later, was a significant milestone. These platforms provided tools for machine learning, data analytics, and cognitive services (like natural language processing and computer vision).

Because Cloud AI operates on data sent to remote servers, it's more scalable and flexible than Edge AI. That's the main thing that gives Cloud the edge (see what I did there?), but there are other advantages too:

Pros and cons of Cloud AI

Cloud AI advantages

Big data handling

AI algorithms thrive on voluminous data for training and accuracy. Cloud storage is integral here, providing the capacity to store and process terabytes of data. This capability is essential for developing machine learning models that learn from vast, varied datasets to enhance their predictive accuracy and reliability.

Parallel processing

Before cloud infrastructure, processing limitations were a significant bottleneck in AI development. Cloud computing introduced parallel processing nodes, which dramatically enhanced computing power. This means complex AI models can be computed much faster, accelerating the development and deployment of AI solutions.

GPU acceleration

Advanced AI computations, especially those in machine learning and deep learning, require significant processing power. GPUs, known for their parallel processing capabilities, are ideal for these tasks. Cloud AI utilizes GPU acceleration to handle intensive AI computations efficiently.

Scalability and flexibility

One of the most significant advantages of cloud storage in AI is scalability. Cloud-based AI systems can adapt to varying computational demands, scaling up or down as needed. This flexibility allows for efficient management of resources and costs, which is particularly vital for fluctuating AI workloads.

Cloud AI disadvantages

Latency issues: Depending on internet connectivity, there can be latency in data processing, which may not be suitable for real-time applications.
Security concerns: Transmitting data to and from cloud servers can pose security risks, especially if sensitive data is involved. That being said, cloud providers can provide strong security measures and compliance standards, so they can be a viable option if properly configured.
Dependence on internet connectivity: Cloud AI's effectiveness is contingent on reliable internet connectivity, which can be a limitation in remote or unstable network areas.

Key takeaways: when to use Edge and when to use Cloud

Edge computing minimizes latency by processing data locally but has limitations in terms of computational resources.
Cloud computing provides powerful processing capabilities but introduces latency due to data transmission.
The choice between Edge and Cloud depends on the latency tolerance of your application, the available network bandwidth, and the computational needs of your AI tasks.
Use Edge AI when real-time processing, data privacy, and reduced bandwidth usage are critical.
Use Cloud AI for complex computations, large-scale data analysis, and applications where latency is less of a concern.

Apify as a data cloud platform for AI

If data in the cloud is what you need, Apify is a cloud platform that helps you build reliable web scrapers for real-time data collection, and automate anything you can do manually in a web browser. This makes it an ideal platform for extracting web data at scale for AI and machine learning.

🧑🏻💻 Web scraping for AI data

Apify excels in extracting vast amounts of data from the web, which is crucial for training and fine-tuning AI models like ChatGPT and LLaMA. Its ability to crawl and extract relevant information from various sources makes it a go-to solution for feeding AI algorithms.

🦾You might be interested in how you can add custom actions to your GPTs

DEV Community: Apify

Firecrawl vs. Apify: 2025 guide for AI and data teams

Pricing

Firecrawl’s pricing model

Apify’s pricing model

Firecrawl and Apify pricing compared

Performance

Firecrawl: Unified AI-driven scraping

Apify: A comprehensive, flexible ecosystem

How each platform scales

Firecrawl

Apify

Integrations

Integrating Firecrawl into AI pipelines

Apify integrations for production workflows

Picking the right platform for your workload

When Firecrawl is the better fit

When Apify delivers more value

5 best JavaScript web scraping libraries in 2025

Even if you're not a JavaScript developer, it's worth knowing these libraries for parsing HTML, interacting with pages, and dealing with dynamic content.

1. Crawlee

Code snapshot (Cheerio crawler)

Why this matters for your workflow

2. Impit

Code snapshot (impersonating Firefox)

Why this matters for your workflow

3. Cheerio

Code snapshot (parsing Hacker News)

Why this matters for your workflow

4. Playwright

Code snapshot (Amazon product page)

Why this matters for your workflow

5. Puppeteer

Code snapshot (basic Puppeteer scraper)

Why this matters for your workflow

In summary

Build and deploy MCP servers in minutes with a TypeScript template

Transform any stdio MCP server into a scalable, cloud-hosted service.

Why deploy MCP servers on Apify?

Understanding the architecture

Step-by-step implementation

Step 1: Choose and create your Actor from the template

Step 2: Configure your MCP server

Step 3: Install your MCP server dependencies

Step 4: Set up monetization (optional)

Step 5: Configure Actor settings

Step 6: Deploy to Apify

Step 7: Configure standby mode

Step 8: Connect your MCP client

Advanced configuration

Environment variables

TypeScript template capabilities

Debugging and monitoring

Local development

Production monitoring

Troubleshooting common issues

What next?

Conclusion

10 web scraping challenges (+ solutions) in 2025

1. Dynamic content

2. User agents and browser fingerprinting

3. Rate limiting

4. IP bans

5. Honeypot traps

7. Data storage and organization

8. Automation and monitoring

9. Scalability and reliability

10. Real-time data scraping

Conclusion

11 best open-source web crawlers and scrapers in 2024

What are open-source web crawlers and web scrapers?

Top 11 open-source web crawlers and scrapers in 2024

1. Crawlee

2. Scrapy

3.MechanicalSoup

4. Node Crawler

5. Selenium

6. Heritrix

7. Apache Nutch

8.Webmagic

Integrating Firecrawl into AI pipelines