<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Apify</title>
    <description>The latest articles on DEV Community by Apify (@apify).</description>
    <link>https://dev.to/apify</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2171%2F8c96d506-957a-4ad7-8e96-f083077b4d3f.png</url>
      <title>DEV Community: Apify</title>
      <link>https://dev.to/apify</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/apify"/>
    <language>en</language>
    <item>
      <title>Firecrawl vs. Apify: 2025 guide for AI and data teams</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Mon, 11 Aug 2025 08:42:57 +0000</pubDate>
      <link>https://dev.to/apify/firecrawl-vs-apify-2025-guide-for-ai-and-data-teams-42e3</link>
      <guid>https://dev.to/apify/firecrawl-vs-apify-2025-guide-for-ai-and-data-teams-42e3</guid>
      <description>&lt;p&gt;&lt;em&gt;A detailed comparison of Firecrawl's unified AI-driven scraping and Apify's comprehensive, flexible ecosystem. We explain what each does best&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Fresh, structured &lt;a href="https://apify.com/use-cases/data-for-ai-agents" rel="noopener noreferrer"&gt;web data is the fuel for AI&lt;/a&gt;, including agents, RAG pipelines, competitive‑intelligence dashboards, and change‑monitoring services. Two platforms dominate that space in 2025:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Firecrawl&lt;/strong&gt; – an API‑first crawler that turns any URL into LLM‑ready Markdown/JSON in seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apify&lt;/strong&gt; – a full‑stack scraping platform with thousands of reusable data collection tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We'll go deep into the differences and what they do best, so you can choose (or combine) wisely.&lt;/p&gt;

&lt;p&gt;Let's kick off with a table of each platform’s features, benefits, and trade‑offs so you can see where each one excels.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core value prop&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, semantic scraping API geared for AI&lt;/td&gt;
&lt;td&gt;End-to-end scraping platform (6,000+ data collection tools, proxies, compliance)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Stand‑out features&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pre-warmed browsersNL extractionAGPL OSS &amp;amp; self-hostStealth proxies&lt;/td&gt;
&lt;td&gt;Scraper marketplaceJS/TS &amp;amp; Py SDKsGlobal proxy pool + CAPTCHACron-scheduler, retries, webhooksSOC 2 Type II and GDPR compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary benefits&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sub-second latency on cached pagesPrompt-based (no selectors)Predictable credit pricing&lt;/td&gt;
&lt;td&gt;Fine‑grained session &amp;amp; proxy controlNo‑code operation for analysts; devs extend via code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pros&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✔ Fast single‑page fetches✔ Self‑host to avoid vendor lock‑in✔ Low entry cost (free + $16 Hobby tier)&lt;/td&gt;
&lt;td&gt;✔ Breadth: 6,000 off-the-shelf scrapers✔ Effective anti‑blocking technology✔ Monetize your own scrapers; earn rev‑share&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cons&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✖ Credits can disappear fast on large crawls✖ Limited built-in scheduling✖ AGPL copyleft for forks&lt;/td&gt;
&lt;td&gt;✖ Actors / CU concepts add a learning curve✖ Consumption costs can spike with inefficient code✖ Cold-start ≈1.5s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Get data for AI with Apify&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl’s pricing model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Firecrawl uses a simple credit-based model: 1 page = 1 credit (under standard conditions). That makes it very easy to predict costs. The free tier lets you scrape up to 500 pages before committing financially (no credit card required). Paid plans range from $16 to $333 for standard pricing.&lt;/p&gt;

&lt;p&gt;Extraction tasks that go beyond simple scraping consume additional credits or tokens, and Firecrawl offers dedicated “Extract” plans ranging from $89 to $719/month, depending on token volume.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff614pdy4bwhfc10bhzcm.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff614pdy4bwhfc10bhzcm.jpeg" alt="Firecrawl pricing" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify’s pricing model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify combines a subscription and consumption model. You get a base amount of platform credit with each pricing plan, but the consumption rate depends on the resources used. 1 compute unit = 1 gigabyte-hour of RAM.&lt;/p&gt;

&lt;p&gt;It’s harder to predict costs this way because scraping a JavaScript-rendered site that requires browser automation consumes a lot more than a simple HTML scraper.&lt;/p&gt;

&lt;p&gt;That’s why Apify has introduced pay-per-event pricing. Scrapers with this pricing model charge for specific actions rather than just results. For example, &lt;strong&gt;a scraper that charges $5 per run start and $2 per 1,000 results would cost $15 for 5,000 results&lt;/strong&gt;.  This can make large-scale scraping jobs cheaper in the long run.&lt;/p&gt;

&lt;p&gt;Apify's forever-free tier gives you $5 of credit that renews automatically every month, so you can test any scraper on Apify Store without financial commitment (no credit card required). Paid plans range from $39 to $999 per month.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3a2v1j16kjq8a0dindk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3a2v1j16kjq8a0dindk.png" alt="Apify pricing" width="800" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl and Apify pricing compared&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl(flat credits)&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify(pre-paid credit+CU)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3,000 pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hobby: $16 ✅&lt;/td&gt;
&lt;td&gt;Starter: $39 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;100,000 pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Standard: $83 ✅&lt;/td&gt;
&lt;td&gt;Scale: $199 ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;500,000 pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Growth: $333 or Enterprise&lt;/td&gt;
&lt;td&gt;Business: $999 or Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Summary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;500k pages/month? &lt;strong&gt;Firecrawl&lt;/strong&gt; is usually cheaper.Millions of lightweight pages or heavy anti‑bot workflows? A well‑optimized &lt;strong&gt;Apify&lt;/strong&gt; scraper can win on total cost.&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl: Unified AI-driven scraping&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxj0swzvdr3vkuzfzotx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffxj0swzvdr3vkuzfzotx.png" alt="Firecrawl - Unified AI-driven scraping" width="800" height="489"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Firecrawl offers a single, consistent API that handles scraping, crawling, and AI‑driven site navigation, so developers never have to juggle multiple endpoints or bespoke parameters.&lt;/p&gt;

&lt;p&gt;When you request a page, the service decides on‑the‑fly whether it needs a headless browser, waits for all dynamic elements to render, and then applies extraction models that automatically ignore ads, menus, and other noise.&lt;/p&gt;

&lt;p&gt;Instead of writing brittle CSS or XPath rules, you ask for the data in plain language — “product prices and availability,” for example — and Firecrawl returns a clean JSON block that stays stable even when the site’s markup changes.&lt;/p&gt;

&lt;p&gt;The same intelligence powers the crawler: it goes through internal links without sitemaps, skips duplicate content, and can infer which pages matter most from their position in the site hierarchy and linking patterns, all while respecting any boundaries you set.&lt;/p&gt;

&lt;p&gt;For sites that hide information behind clicks, forms, or paginated views, you can enable the FIRE‑1 agent. It mimics human behavior by clicking “Load More,” filling search fields, and even solving simple CAPTCHA, eliminating special‑case code.&lt;/p&gt;

&lt;p&gt;Performance optimizations run throughout the platform. Recently scraped pages come from cache in milliseconds, hundreds of URLs can be batched in a single call for parallel processing, and converting HTML to lightweight Markdown cuts the token count for downstream LLMs by roughly two‑thirds. The net effect is faster, lower‑maintenance data collection that remains reliable as websites evolve.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify: A comprehensive, flexible ecosystem&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2w9320o95aph7gfz8g8u.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2w9320o95aph7gfz8g8u.jpeg" alt="Apify: A comprehensive, flexible ecosystem, not just a web scraping API" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apify approaches web scraping as an ecosystem problem rather than a single‑tool exercise. Its foundation is the Actor system — self‑contained programs that run in Apify’s cloud with uniform input/output, shared storage, and common scheduling and monitoring. Because every Actor behaves the same way, you can link them into multistep workflows just by passing data from one to the next.&lt;/p&gt;

&lt;p&gt;This standardization fuels &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt;, a marketplace of more than 6,000 pre‑built scrapers and automation tools maintained by domain specialists. If you need Amazon product data, Instagram follower stats, or a one‑off government registry crawl, chances are an Actor already exists and is kept up to date as site layouts change, so you rarely start from a blank page.&lt;/p&gt;

&lt;p&gt;When an off‑the‑shelf Actor doesn’t fit, you can build your own with the Apify SDK (also released as the open‑source &lt;a href="https://crawlee.dev/?__hstc=160404322.3b3599158ad751038b19c751c187f023.1696600257709.1753948779266.1753948924434.739&amp;amp;__hssc=160404322.1.1753948924434&amp;amp;__hsfp=3246212229" rel="noopener noreferrer"&gt;Crawlee library&lt;/a&gt;). The SDK offers high‑level helpers — request queues, automatic retries, error handling, parallel processing — while still letting you drop down to raw Puppeteer, Playwright, or HTTP calls when necessary. It supports both JavaScript/TypeScript and Python, making it &lt;strong&gt;easy to slot into existing codebases&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Infrastructure management is largely hands‑off. Apify autoscales compute instances, rotates datacenter or residential proxies by geography, and applies anti‑detection tactics such as browser‑fingerprint randomization, human‑like delays, and outsourced CAPTCHA solving. Enterprise‑grade features cover the operational side: detailed run statistics, alerting, conditional scheduling (e.g., “scrape 100 sites at 06:00 only if yesterday’s run succeeded”), real‑time webhooks for downstream systems, and configurable result retention to satisfy compliance audits or historical analyses.&lt;/p&gt;

&lt;p&gt;In practice, the platform lets teams mix and match ready‑made Actors, custom code, and reliable infrastructure to &lt;strong&gt;solve diverse scraping tasks without rebuilding each time&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Technical feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API surface&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single base URL with multiple HTTP/JSON endpoints under one uniform API&lt;/td&gt;
&lt;td&gt;Each Actor exposes a standard REST interface: you POST to the “run” endpoint with a JSON input and then GET results from key‑value stores or datasets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynamic‑content handling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automatically spins up a headless browser when JS rendering is detected, with no extra configuration&lt;/td&gt;
&lt;td&gt;You choose per Actor: Puppeteer, Playwright, Cheerio, raw HTTP, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extraction approach&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;“Zero‑selector” extraction: ML/NLP models parse pages into JSON or Markdown out‑of‑the‑box&lt;/td&gt;
&lt;td&gt;Code‑based extraction inside each Actor using Crawlee's page handlers or raw selectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Workflow composition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In a single request you can switch between scrape, crawl, or the FIRE‑1 “agent” mode using flags&lt;/td&gt;
&lt;td&gt;Chain Actors/tasks asynchronously using shared storage, datasets, or webhooks to build multistep pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Browser automation / navigation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FIRE‑1 agent clicks buttons, paginates, fills forms, solves simple CAPTCHAs&lt;/td&gt;
&lt;td&gt;Automation coded per Actor; CAPTCHA solving baked in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anti‑detection &amp;amp; proxy support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully managed proxies, rotating sessions, fingerprint spoofing and geotargeting are all built‑in&lt;/td&gt;
&lt;td&gt;Apify Proxy (datacenter &amp;amp; residential) with session rotation, geo‑targeting and spoofing; you enable it per Actor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Caching &amp;amp; batching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global intelligent cache for recently fetched pages; supports batching hundreds of URLs in one API call&lt;/td&gt;
&lt;td&gt;Request queues with autoscaled parallelism; per‑Actor caching logic is up to you (e.g. dataset dedupe, custom cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Firecrawl auto‑scales browser instances for concurrent requests&lt;/td&gt;
&lt;td&gt;Platform scales Actors and browser instances elastically according to queue depth and configured concurrency limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitoring &amp;amp; scheduling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metrics and scheduling built into the single API dashboard&lt;/td&gt;
&lt;td&gt;Per‑Actor run metrics, alerts, conditional scheduling, and webhook triggers via Apify Console&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data format optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Converts HTML → Markdown to cut LLM tokens by ≈ 67%&lt;/td&gt;
&lt;td&gt;Returns whatever the Actor emits (HTML, JSON, CSV, etc.); no built‑in token reduction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language / SDK support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Any language that can call HTTP/JSON (official clients in Python, Node, Go, Rust, C#, etc.)&lt;/td&gt;
&lt;td&gt;Official SDKs in JavaScript/TypeScript and Python (Apify SDK / Crawlee); other languages via raw REST calls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How each platform scales
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Firecrawl&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Firecrawl scales like a classic SaaS API. Your subscription tier defines a fixed pool of headless‑browser workers, and the service queues requests behind those workers. Because the queueing and resource allocation are handled centrally, you get stable latency and never touch infrastructure. Even the mid‑tier plans can move thousands of pages per minute thanks to caching and smart routing, so the raw worker counts rarely become a choke point.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify takes a looser, cloud‑native approach. You launch as many Actors as your budget allows, and the platform spins up containers to run them in parallel. That “elastic infinity” is perfect for bursty jobs — say, crawling an entire retail catalog overnight or tracking tens of thousands of social‑media accounts — because you can flood the platform with work and pay only for the compute you burn. Automatic retries and detailed run logs keep large Actor swarms reliable.&lt;/p&gt;

&lt;p&gt;The marketplace magnifies &lt;strong&gt;Apify’s scale advantage&lt;/strong&gt;: if your project hits 50 different sites, chances are someone has already published an Actor for most of them. You trade weeks of scraper authoring for coordination logic — managing many Actors’ versions, schedules, and output formats — but you &lt;strong&gt;gain speed and capacity on day one&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Integrations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Integrating Firecrawl into AI pipelines&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Firecrawl treats integration the same way it treats scraping: make the routine path effortless and keep the escape hatches open.&lt;/p&gt;

&lt;p&gt;Official SDKs for Python, JavaScript, Rust, and Go expose idiomatic methods, but the real strength lies in direct hooks to AI tooling. A native LangChain loader, for example, can be wired up in just a few lines; it fetches pages, paginates automatically, preserves attribution metadata, and delivers chunks that are ready for embeddings or other RAG workflows. LlamaIndex receives similar first‑class support with retrievers that fetch, de‑duplicate, and format content for chatbots or summarization agents while keeping token counts under control.&lt;/p&gt;

&lt;p&gt;Outside pure code, Firecrawl blocks appear in Make.com and Zapier, so non‑developers can drag‑and‑drop flows — say, watch a competitor’s site, extract new product data in plain English, and update a shared spreadsheet — without touching a script. The result is a scraper that plugs into AI stacks and no‑code tools with equal ease.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify integrations for production workflows&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify’s connectors reflect a more mature, enterprise‑centric agenda: it's designed to sit inside CI/CD pipelines and data‑engineering stacks, not just trigger one‑off jobs.&lt;/p&gt;

&lt;p&gt;The Zapier app goes beyond “run an Actor” by adding triggers for job completion, built‑in data filters, and error‑handling branches, so a sales team can scrape LinkedIn, enrich leads with a second Actor, and post qualified prospects to a CRM in a single Zap.&lt;/p&gt;

&lt;p&gt;The GitHub integration lets teams treat scraper code like any other service — commit, test, review, and deploy via GitHub Actions — bringing familiar DevOps discipline to crawling logic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foubbz5i3hiyhctiz1706.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foubbz5i3hiyhctiz1706.jpeg" alt="LangChain and LlamaIndex integration - Apify" width="800" height="252"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For AI tooling, Apify has loaders for LangChain and LlamaIndex, and integrations for HuggingFace, Pinecone, Haystack, Qdrant, and other vector databases.&lt;/p&gt;

&lt;p&gt;For data pipelines, Apify ships connectors that push results straight into S3, GCS, or Azure Blob, or stream them through webhooks to Kafka or Pub/Sub, &lt;strong&gt;turning the platform into a managed data‑ingestion layer&lt;/strong&gt; that feeds downstream analytics with no extra glue code.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Integration feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Firecrawl&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Official SDKs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Python, JavaScript/TypeScript (official), Rust, Go (community)&lt;/td&gt;
&lt;td&gt;JavaScript/TypeScript &amp;amp; Python (via Apify SDK / Crawlee)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI‑framework hooks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native loaders/retrievers for LangChain and LlamaIndex (handles pagination, chunking, metadata, token‑cost control)&lt;/td&gt;
&lt;td&gt;Official loaders for LangChain and LlamaIndex; individual Actors may embed extra AI logic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;No‑code / automation tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Make and Zapier blocks for drag‑and‑drop flows&lt;/td&gt;
&lt;td&gt;Zapier app with branching, filters, and error handling for complex automations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DevOps / CI · CD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;GitHub/Bitbucket integration for version control, tests, and CI‑based deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data‑pipeline connectors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;– (pull via API or SDK)&lt;/td&gt;
&lt;td&gt;Export Actors to S3, GCS, Azure; real‑time streaming to Kafka/Pub/Sub/webhooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Webhook support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Crawl/scrape lifecycle webhooks for async callbacks&lt;/td&gt;
&lt;td&gt;Webhooks on Actor events; plus metamorph &amp;amp; transform steps for streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Token‑usage optimization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTML→Markdown plus loader‑level chunk sizing &amp;amp; token‑budget helpers&lt;/td&gt;
&lt;td&gt;Up to individual Actor / downstream loader; no platform‑wide token controls&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Picking the right platform for your workload
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;When Firecrawl is the better fit&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Choose Firecrawl when &lt;strong&gt;low‑latency&lt;/strong&gt; access to web data and &lt;strong&gt;tight coupling with AI pipelines&lt;/strong&gt; are top priorities. Its uniform API, natural‑language extraction, and sub‑second response times let you build chatbots, RAG systems, or research agents without wrestling with scraping logic. Pricing is credit‑based and predictable. If you know you’ll process about  50k pages a month, you can budget the spend to the dollar and count on the same performance every day. The open‑source core offers a self‑host option for teams with data‑residency mandates or those who simply want an exit ramp from the hosted service.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;When Apify delivers more value&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Apify is the best option when &lt;strong&gt;breadth&lt;/strong&gt;, &lt;strong&gt;flexibility&lt;/strong&gt;, and &lt;strong&gt;enterprise&lt;/strong&gt; guarantees matter. Its marketplace of 6,000‑plus maintained Actors means you can &lt;strong&gt;cover dozens of sites in hours&lt;/strong&gt; instead of building each scraper yourself. Non‑developers can launch and schedule those Actors through a web UI, while engineering teams still have full SDK control when needed. SOC 2 compliance, GDPR alignment, and a history of &lt;strong&gt;large‑scale deployments&lt;/strong&gt; satisfy procurement checklists, and the platform’s elastic Actor model handles everything from tiny HTML scrapes to overnight catalog crawls without manual capacity planning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Get started with Apify&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>5 best JavaScript web scraping libraries in 2025</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Thu, 24 Jul 2025 18:30:00 +0000</pubDate>
      <link>https://dev.to/apify/5-best-javascript-web-scraping-libraries-in-2025-4mf2</link>
      <guid>https://dev.to/apify/5-best-javascript-web-scraping-libraries-in-2025-4mf2</guid>
      <description>&lt;h3&gt;
  
  
  Even if you're not a JavaScript developer, it's worth knowing these libraries for parsing HTML, interacting with pages, and dealing with dynamic content.
&lt;/h3&gt;

&lt;p&gt;Websites are becoming increasingly complex and dynamic. The modern web is full of JavaScript-rendered apps that load content asynchronously, use auth systems involving multiple steps and JavaScript-based token handling, and block scraping bots.&lt;/p&gt;

&lt;p&gt;That's why JavaScript is still a great choice for collecting web data in 2025. But if you're a developer new to web scraping or unfamiliar with the JavaScript language, you're probably wondering which libraries and frameworks you should try.&lt;/p&gt;

&lt;p&gt;At Apify, we've been scraping the web with JavaScript and Node.js for a decade. This selection of 5 libraries is informed by our experience of using them for data extraction, from parsing HTML to navigating web pages and scraping dynamic content.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Crawlee
&lt;/h2&gt;

&lt;p&gt;Tackling complexity, stealth, and scalability in one package&lt;/p&gt;

&lt;p&gt;Juggling multiple libraries for requests, parsing, browser automation, and crawling logic quickly becomes a maintenance headache. You end up writing glue code to handle queues, rotate proxies, and merge results, only to find you still get blocked or stranded when scale increases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/js?__hstc=160404322.210f49410a426b754a1e9d8c7a300289.1753096943609.1753284250052.1753339598956.12&amp;amp;__hssc=160404322.11.1753339598956&amp;amp;__hsfp=2557066438" rel="noopener noreferrer"&gt;Crawlee&lt;/a&gt;, developed by the Apify team, unifies everything under a single interface. Out of the box, it mimics real browsers (headers, TLS fingerprints, and even stealth plugins) so you avoid common anti-bot defenses without manual header or fingerprint tweaking. Instead of wiring together Cheerio + Playwright/Puppeteer + queue managers, Crawlee provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Switchable crawler classes&lt;/strong&gt;: &lt;code&gt;CheerioCrawler&lt;/code&gt; for static HTML, &lt;code&gt;PlaywrightCrawler&lt;/code&gt; or &lt;code&gt;PuppeteerCrawler&lt;/code&gt; for dynamic pages, all sharing a common configuration style.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Built-in queue management&lt;/strong&gt;: Breadth-first or depth-first crawling with concurrency settings, retry logic, and automatic backoff. You define start URLs; Crawlee handles enqueuing, prioritization, and scheduling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic proxy rotation and session handling&lt;/strong&gt;: Effectively rotate proxies or manage cookies and browser contexts, so you stay under rate limits and maintain logins across multiple pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pluggable data storages&lt;/strong&gt;: Datasets (JSON, CSV, or key-value stores) appear in a local “datasets” directory, making it trivial to persist results or resume failed crawls.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lifecycle hooks and customizability&lt;/strong&gt;: Logging, error handling, and custom request handlers via routers, so you can insert your own logic at enqueue, request success, or failure without rewriting core code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Native integration with the Apify platform&lt;/strong&gt;: Once your crawler is ready, running &lt;code&gt;apify push&lt;/code&gt; deploys it, and Apify handles autoscaling, proxy billing, and data exports. No extra configuration needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Starter templates and file structure&lt;/strong&gt;: When you run &lt;code&gt;npx crawlee create my-crawler&lt;/code&gt;, you get a &lt;code&gt;main.js&lt;/code&gt; and &lt;code&gt;routes.js&lt;/code&gt; setup. Boilerplate code means you can focus on selectors rather than instantiating browser instances, setting headers, or wiring queues. The default file structure looks like this:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/my-crawler
├── main.js      &lt;span class="c"&gt;# entry point: initializes crawler class and starts run()&lt;/span&gt;
├── routes.js    &lt;span class="c"&gt;# defines request handlers via createCheerioRouter/createPlaywrightRouter&lt;/span&gt;
├── storages /
       datasets/    &lt;span class="c"&gt;# where results are stored as JSON files per page&lt;/span&gt;
├──    key-value-stores/  &lt;span class="c"&gt;# storage for arbitrary binary data (images, videos, JSON files…)&lt;/span&gt;
└── package.json

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (Cheerio crawler)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// routes.js&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;createCheerioRouter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;crawlee&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;createCheerioRouter&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;router&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addDefaultHandler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`enqueueing new URLs`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// Finds “next” pages and enqueues them&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;globs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/?p=*&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="c1"&gt;// Extract post URL, title, rank&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;postUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="na"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toArray&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// Push to dataset for automatic file output&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simplicity:&lt;/strong&gt; No more separate proxy rotation libraries, queue managers, or manual header generators. Crawlee handles it all.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling&lt;/strong&gt;: You can start locally and then deploy to the Apify platform, where it auto‐scales, monitors memory/CPU, and logs failures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance&lt;/strong&gt;: Switching from CheerioCrawler to PlaywrightCrawler only requires changing one import and maybe tweaking selectors. The core logic stays the same.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Impit
&lt;/h2&gt;

&lt;p&gt;Making browser impersonation simple&lt;/p&gt;

&lt;p&gt;Sending vanilla HTTP requests often gets you blocked by modern anti-scraping systems. You might spend hours rotating user-agents, randomizing delays, or implementing captchas manually, only to still find your IP banned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/apify/impit" rel="noopener noreferrer"&gt;Impit&lt;/a&gt; is an HTTP client for Node.js and Python, based on Rust’s &lt;code&gt;reqwest&lt;/code&gt;, specifically tailored for scraping. Instead of wrestling with header spoofing or TLS fingerprinting yourself, you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic fingerprint spoofing:&lt;/strong&gt; Pick from a library of existing browser fingerprints, and impit builds a full set of realistic HTTP headers and matching TLS settings. This makes your requests indistinguishable from browser requests and reduces detection risk.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated &lt;code&gt;tough-cookie&lt;/code&gt; support&lt;/strong&gt;: Handle session cookies out of the box, so you can maintain login sessions or track redirects using the most popular JS cookie library.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;fetch&lt;/code&gt; API:&lt;/strong&gt; Impit implements a subset of the well-known &lt;code&gt;fetch&lt;/code&gt; API (&lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Fetch_API" rel="noopener noreferrer"&gt;MDN&lt;/a&gt;), so you can write your scrapers without having to read lengthy docs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy integration&lt;/strong&gt;: Support for HTTP and HTTPS proxies via a single option, so you can rotate IPs with minimal code.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (impersonating Firefox)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Impit&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;impit&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchHtml&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;impit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Impit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;firefox&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;http3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;impit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// raw HTML&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;fetchHtml&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stealth:&lt;/strong&gt; You no longer manually assemble user-agent strings or randomize headers; impit covers 95% of common anti-bot checks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling:&lt;/strong&gt; Configurable retries and timeouts mean fewer surprises when a request fails.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3. Cheerio
&lt;/h2&gt;

&lt;p&gt;Converting unruly HTML into structured data&lt;/p&gt;

&lt;p&gt;Plain HTML is cluttered: nested tags, inconsistent class names, and no programmatic way to navigate the DOM on the server. If you’ve written custom regex or string-based parsers, you know how brittle that can be.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cheerio.js.org/" rel="noopener noreferrer"&gt;Cheerio&lt;/a&gt; loads raw HTML into a fast, jQuery-like API on the server. You can query for elements, attributes, and text using familiar CSS selectors, then extract exactly what you need without worrying about manual string manipulation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (parsing Hacker News)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;gotScraping&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;got-scraping&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cheerio&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;fetchTitles&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;gotScraping&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cheerio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rank&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;title&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;fetchTitles&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Robustness:&lt;/strong&gt; No more fragile regex. With Cheerio, you use &lt;code&gt;.find()&lt;/code&gt;, &lt;code&gt;.text()&lt;/code&gt;, and &lt;code&gt;.attr()&lt;/code&gt; just like jQuery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance:&lt;/strong&gt; Cheerio is lightweight and blazingly fast, so parsing large HTML documents doesn’t become your scraper’s bottleneck - especially when compared to full-headless browsers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Familiar syntax:&lt;/strong&gt; If you’ve used jQuery on the front end, there’s almost zero onboarding time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  4. Playwright
&lt;/h2&gt;

&lt;p&gt;Handling JavaScript‐driven, dynamic content reliably&lt;/p&gt;

&lt;p&gt;Many modern websites rely on client-side JavaScript to populate the DOM, for lazy loading, infinite scrolling, or data fetched via XHR/AJAX. Cheerio has no power here.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://playwright.dev/" rel="noopener noreferrer"&gt;Playwright&lt;/a&gt;, on the other hand, spins up a real browser (Chromium, Firefox, or WebKit), navigates pages as a human would, waits for selectors or network to idle, and then gives you a fully rendered DOM snapshot. You can even intercept requests to block ads or unwanted resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (Amazon product page)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;firefox&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;playwright&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scrapeAmazon&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://www.amazon.com/Hitchhikers-Guide-Galaxy-Douglas-Adams-ebook/dp/B000XUBC2C/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;book&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#productTitle&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;author&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;span.author a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;kindlePrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#formats span.ebook-price-value&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;paperbackPrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#tmm-grid-swatch-PAPERBACK .slot-price span&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;hardcoverPrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;#tmm-grid-swatch-HARDCOVER .slot-price span&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;book&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;scrapeAmazon&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reliability:&lt;/strong&gt; If the data isn’t in the initial HTML, you need a browser to run the page’s JS. Playwright ensures you get exactly what a real user sees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible waits:&lt;/strong&gt; You can &lt;code&gt;await page.waitForSelector()&lt;/code&gt; or specify &lt;code&gt;.waitUntil("networkidle")&lt;/code&gt; so you only scrape once all resources load, reducing flaky results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intercepting resources:&lt;/strong&gt; Block images, CSS, or analytics endpoints to speed up scrapes and reduce noise in your logs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  5. Puppeteer
&lt;/h2&gt;

&lt;p&gt;A Chrome-centric approach to browser automation&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pptr.dev/" rel="noopener noreferrer"&gt;Puppeteer&lt;/a&gt; and Playwright are pretty much the same thing, except for some minor differences in API, and unlike Playwright, it's limited to JavaScript and Node.js. Puppeteer is older, but only recently did Firefox &lt;a href="https://hacks.mozilla.org/2024/08/puppeteer-support-for-firefox/" rel="noopener noreferrer"&gt;add official support&lt;/a&gt; for Puppeteer.&lt;/p&gt;

&lt;p&gt;Switching to Playwright isn't difficult, but if you prefer Chromium’s engine, this is a good option.&lt;/p&gt;

&lt;p&gt;Puppeteer gives you a headless (or headed) Chrome instance with an easy API for navigation, selection, and evaluation. It supports intercepting requests, generating PDFs, and capturing screenshots. While it doesn’t include the same cross-browser support as Playwright, it’s been around for longer and integrates well with Chrome DevTools Protocol.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Code snapshot (basic Puppeteer scraper)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight jsx"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;puppeteer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;scrapeSite&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;puppeteer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newPage&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;waitUntil&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;networkidle2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;articles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="nf"&gt;$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;posts&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nx"&gt;posts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;post&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
      &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;href&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;post&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)?.&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;articles&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;scrapeSite&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Why this matters for your workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Existing Puppeteer code:&lt;/strong&gt; Migrate incrementally or reuse libraries that depend on Puppeteer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome-only features:&lt;/strong&gt; Use DevTools Protocol to capture screenshots, trace performance, or emulate network conditions without additional dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight automation needs:&lt;/strong&gt; if you only need a headless Chrome for a few pages and already have an IP rotation or session management solution, Puppeteer might be the simplest choice.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  In summary
&lt;/h2&gt;

&lt;p&gt;JavaScript remains a top choice for web scraping in 2025, thanks to its solid ecosystem of open-source libraries, which make it easier to parse HTML, interact with web pages, and deal with dynamic content. In our opinion, these five are the best:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Crawlee&lt;/strong&gt; - A comprehensive, all-in-one scraping framework that handles browser automation, proxy rotation, session management, queuing, and data storage. It simplifies scaling and maintenance by unifying multiple tools under one interface.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Impit&lt;/strong&gt; - A stealthy HTTP client tailored for scraping, with automatic realistic header generation, cookie jar support, and proxy integration - ideal for scraping without a full browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cheerio&lt;/strong&gt; - A fast and lightweight HTML parser that mimics jQuery, perfect for extracting structured data from static HTML without using a browser.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playwright&lt;/strong&gt; - A full browser automation library for scraping JavaScript-rendered sites. It supports multiple browsers, waits for content to load, and intercepts resources, making it highly reliable for dynamic pages.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Puppeteer&lt;/strong&gt; - A headless Chrome automation tool with strong DevTools support. It’s suitable for existing Puppeteer codebases or lightweight scraping needs focused on Chromium.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Whether you’re parsing static HTML or navigating complex, JavaScript-rendered pages, this toolkit helps you choose and combine the best options for performance, stealth, and scalability.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Build and deploy MCP servers in minutes with a TypeScript template</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Thu, 24 Jul 2025 07:33:07 +0000</pubDate>
      <link>https://dev.to/apify/build-and-deploy-mcp-servers-in-minutes-with-a-typescript-template-4mi5</link>
      <guid>https://dev.to/apify/build-and-deploy-mcp-servers-in-minutes-with-a-typescript-template-4mi5</guid>
      <description>&lt;h3&gt;
  
  
  Transform any stdio MCP server into a scalable, cloud-hosted service.
&lt;/h3&gt;

&lt;p&gt;Model Context Protocol (&lt;a href="https://blog.apify.com/what-is-model-context-protocol/" rel="noopener noreferrer"&gt;MCP&lt;/a&gt;) is transforming and simplifying how AI applications connect with external tools. While &lt;a href="https://blog.apify.com/how-to-use-mcp/" rel="noopener noreferrer"&gt;we’ve covered how to use MCP with tools that give agents context from the web&lt;/a&gt;, this guide digs deeper into the developer side: how to build and deploy your own MCP servers on the Apify platform.&lt;/p&gt;

&lt;p&gt;With Apify's MCP templates, you can transform any stdio or remote MCP server into a scalable, cloud-hosted service in minutes. There are currently two templates available: &lt;a href="https://apify.com/templates/python-mcp-server" rel="noopener noreferrer"&gt;for Python&lt;/a&gt; and &lt;a href="https://apify.com/templates/ts-mcp-server" rel="noopener noreferrer"&gt;for TypeScript&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll show you how to build an MCP server on Apify with TypeScript.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why deploy MCP servers on Apify?
&lt;/h2&gt;

&lt;p&gt;Before we get into implementation, here’s why the Apify platform is ideal for hosting MCP servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1. Instant scalability&lt;/strong&gt;: Apify's infrastructure automatically scales based on demand, from a single request to thousands of concurrent connections.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2. Built-in monetization&lt;/strong&gt;: With the &lt;a href="https://help.apify.com/en/articles/10700066-what-is-pay-per-event" rel="noopener noreferrer"&gt;pay-per-event&lt;/a&gt; (PPE) model, you can charge users for each tool request, API call, or custom event, turning your MCP server into a revenue stream.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3. Persistent URLs&lt;/strong&gt;: &lt;a href="https://docs.apify.com/platform/actors/development/programming-interface/standby" rel="noopener noreferrer"&gt;Standby mode&lt;/a&gt; provides stable endpoints like &lt;code&gt;https://your-username--your-mcp-server.apify.actor/sse&lt;/code&gt;, perfect for MCP client configurations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4. Zero infrastructure management&lt;/strong&gt;: No servers to maintain, no Docker orchestration, no SSL certificates. Just deploy and run.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding the architecture
&lt;/h2&gt;

&lt;p&gt;When you deploy an MCP server on Apify, you can work with two types of MCP servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio MCP servers&lt;/strong&gt;: Local servers that communicate via standard input/output, which Apify converts to SSE (Server-Sent Events) for remote access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE MCP servers&lt;/strong&gt;: Remote servers that already communicate via HTTP/SSE, which Apify can proxy and enhance with monetization features&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This flexible architecture allows you to turn any type of MCP server into an &lt;a href="https://apify.com/actors" rel="noopener noreferrer"&gt;Apify Actor&lt;/a&gt;, whether it's a local stdio-based tool or a remote SSE endpoint, and expose it through a unified SSE interface with built-in scaling and monetization.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Actors are lightweight, containerized programs that take JSON inputs, execute tasks, and return structured outputs.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step-by-step implementation
&lt;/h2&gt;

&lt;p&gt;Let’s walk through building an MCP server on Apify using the TypeScript template.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 1: Choose and create your Actor from the template&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create the TypeScript MCP server from template
apify create my-mcp-server --template ts-mcp-server
cd my-mcp-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 2: Configure your MCP server&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Open &lt;code&gt;src/main.ts&lt;/code&gt; and set the &lt;code&gt;MCP_COMMAND&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="c1"&gt;// For stdio servers:&lt;/span&gt;
&lt;span class="c1"&gt;// Example: Everything MCP server&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MCP_COMMAND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;npx @modelcontextprotocol/server-everything&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// For SSE servers (requires mcp-remote package):&lt;/span&gt;
&lt;span class="c1"&gt;// Custom SSE endpoint&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MCP_COMMAND&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;npx mcp-remote https://your-domain.com/mcp-endpoint&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 3: Install your MCP server dependencies&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Update &lt;code&gt;package.json&lt;/code&gt; with the MCP server dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"dependencies"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"@modelcontextprotocol/server-everything"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^2025.5.12"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"mcp-remote"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"^0.1.16"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you’re developing the Actor locally, you can use &lt;code&gt;npm install&lt;/code&gt; instead of editing &lt;code&gt;package.json&lt;/code&gt; directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 4: Set up monetization (optional)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;.actor/pay_per_event.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tool-request"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventTitle"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Tool Request"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventDescription"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Charge for each tool execution"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"eventPriceUsd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trigger charges in your code:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For TypeScript:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;charge&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;eventName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tool-request&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5: Configure Actor settings&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;.actor/actor.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"actorSpecification"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-mcp-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"usesStandbyMode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"minMemoryMbytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"maxMemoryMbytes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"webServerMcpPath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/sse"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;So&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Actor&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;recognized&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;MCP&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;server&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 6: Deploy to Apify&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apify login
apify push

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 7: Configure standby mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In Apify Console:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Go to your Actor’s settings&lt;/li&gt;
&lt;li&gt;Enable “Standby mode”&lt;/li&gt;
&lt;li&gt;Set idle timeout (e.g., 300 seconds)&lt;/li&gt;
&lt;li&gt;Adjust memory allocation as needed&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Step 8: Connect your MCP client&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use this URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;https://your-username--my-mcp-server.apify.actor/sse

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure your client by pointing your MCP client to the URL, and be sure to set the Authorization headers with the bearer auth token &lt;code&gt;Authorization: Bearer your-token&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Advanced configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Environment variables&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To set non-sensitive environment variables, use &lt;code&gt;actor.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"environmentVariables"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"RATE_LIMIT"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Do not put API tokens or other sensitive values in the &lt;code&gt;actor.json&lt;/code&gt; file. Instead, set sensitive environment variables in the Apify Console UI under your Actor's settings when doing the build. For more information on setting custom environment variables, see the &lt;a href="https://docs.apify.com/platform/actors/development/programming-interface/environment-variables#custom-environment-variables" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;TypeScript template capabilities&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The TypeScript template supports both stdio and SSE server types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;stdio servers&lt;/strong&gt;: Simple command string configuration for local MCP servers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE servers&lt;/strong&gt;: Remote server proxying using mcp-remote for connecting to external SSE endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Debugging and monitoring
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Local development&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To run the MCP server locally, use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;APIFY_META_ORIGIN="STANDBY" ACTOR_WEB_SERVER_PORT=8080 apify run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can use the &lt;a href="https://github.com/modelcontextprotocol/inspector" rel="noopener noreferrer"&gt;MCP inspector&lt;/a&gt; on GitHub for debugging and testing locally.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Production monitoring&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;View Actor logs in Apify Console&lt;/li&gt;
&lt;li&gt;Set up error/usage alerts&lt;/li&gt;
&lt;li&gt;Track monetization in Analytics&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Troubleshooting common issues&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory errors&lt;/strong&gt;: Increase memory or optimize code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth failures&lt;/strong&gt;: Check tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What next?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Add more tools&lt;/li&gt;
&lt;li&gt;Integrate with Apify Actors&lt;/li&gt;
&lt;li&gt;Build custom clients&lt;/li&gt;
&lt;li&gt;Share your server on &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Apify Store&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Start building&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Deploying MCP servers on the Apify platform turns local tools into scalable, monetizable cloud services. With our TypeScript template and standby mode, you can get a production-ready server running in minutes, whether it's a local stdio server or an existing SSE server.&lt;/p&gt;

&lt;p&gt;Explore what’s possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://apify.com/templates/python-mcp-server" rel="noopener noreferrer"&gt;Python MCP server template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://apify.com/templates/ts-mcp-server" rel="noopener noreferrer"&gt;TypeScript MCP server template&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.apify.com/platform/integrations/mcp" rel="noopener noreferrer"&gt;Apify MCP documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;Model Context Protocol specification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discord.com/invite/jyEM2PRvMU" rel="noopener noreferrer"&gt;Join the Apify Discord&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>webdev</category>
      <category>javascript</category>
    </item>
    <item>
      <title>10 web scraping challenges (+ solutions) in 2025</title>
      <dc:creator>Dávid Lukáč</dc:creator>
      <pubDate>Thu, 05 Dec 2024 15:04:45 +0000</pubDate>
      <link>https://dev.to/apify/10-web-scraping-challenges-solutions-in-2025-5bhd</link>
      <guid>https://dev.to/apify/10-web-scraping-challenges-solutions-in-2025-5bhd</guid>
      <description>&lt;p&gt;Web scraping comes with its fair share of challenges. Websites are becoming increasingly difficult to scrape due to the rise of anti-scraping measures like CAPTCHAs and browser fingerprinting. At the same time, the demand for data, especially to fuel AI, is higher than ever. &lt;/p&gt;

&lt;p&gt;As you probably know, web scraping isn’t always a stress-free process, but learning how to navigate these obstacles can be incredibly rewarding.&lt;/p&gt;

&lt;p&gt;In this guide, we’ll cover 10 common problems you’re likely to encounter when scraping the web and, just as importantly, how to solve them:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#1-dynamic-content" rel="noopener noreferrer"&gt;Dynamic content
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://blog.apify.com/web-scraping-challenges/#2-user-agents-and-browser-fingerprinting" rel="noopener noreferrer"&gt;User agents and browser fingerprinting
&lt;/a&gt;
3.&lt;a href="https://blog.apify.com/web-scraping-challenges/#2-user-agents-and-browser-fingerprinting" rel="noopener noreferrer"&gt;Rate limiting&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#4-ip-bans" rel="noopener noreferrer"&gt;IP bans
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#5-honeypot-traps" rel="noopener noreferrer"&gt;Honeypot traps&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#6-captchas" rel="noopener noreferrer"&gt;CAPTCHAs
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#7-data-storage-and-organization" rel="noopener noreferrer"&gt;Data storage and organization
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt; &lt;a href="https://blog.apify.com/web-scraping-challenges/#8-automation-and-monitoring" rel="noopener noreferrer"&gt;Automation and monitoring
&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#9-scalability-and-reliability" rel="noopener noreferrer"&gt;Scalability and reliability
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/web-scraping-challenges/#10-real-time-data-scraping" rel="noopener noreferrer"&gt;Real-time data scraping&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the solutions, we’ll use Crawlee, an open-source library for Python and Node.js, and the Apify platform. These tools make life easier, but the techniques we’ll talk about can be used with other tools as well. By the end, you’ll have a solid understanding of how to overcome some of the toughest hurdles web scraping can throw at you. &lt;/p&gt;

&lt;h2&gt;
  
  
  1. Dynamic content
&lt;/h2&gt;

&lt;p&gt;Modern websites often use JavaScript frameworks like React, Angular, or Vue.js to create dynamic and interactive experiences. These &lt;a href="https://blog.apify.com/scraping-single-page-applications-with-playwright/" rel="noopener noreferrer"&gt;single-page applications (SPAs)&lt;/a&gt; load content on the fly without refreshing the page, which is great for users but can complicate web scraping.&lt;/p&gt;

&lt;p&gt;Traditional scrapers that pull raw HTML often miss data generated by JavaScript after the page loads. To &lt;a href="https://blog.apify.com/scrape-dynamic-websites-with-python/" rel="noopener noreferrer"&gt;capture dynamically loaded content&lt;/a&gt;, scrapers need to execute JavaScript and interact with the page, just like a browser.&lt;/p&gt;

&lt;p&gt;That’s where headless browsers like Playwright, Puppeteer, or Selenium come in. They mimic real browsers, loading JavaScript and revealing the data you need.&lt;/p&gt;

&lt;p&gt;In the example below, we’re using &lt;a href="https://github.com/apify/crawlee" rel="noopener noreferrer"&gt;Crawlee&lt;/a&gt;, an open-source web scraping library, with Playwright to scrape a dynamic page (MintMobile). While Playwright alone could handle this, Crawlee adds powerful web scraping features you’ll learn about in the next sections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;PlaywrightCrawler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;crawlee&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;firefox&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;playwright&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;PlaywrightCrawler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;launchContext&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Here you can set options that are passed to the playwright .launch() function.&lt;/span&gt;
        &lt;span class="na"&gt;launchOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;headless&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="na"&gt;launcher&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;firefox&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;requestHandler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Extract data&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;productInfo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;$eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;#WebPage&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;h1[data-qa="device-name"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;a[data-qa="storage-selection"] p:nth-child(1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
                &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;devicePrice&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;a[data-qa="storage-selection"] p:nth-child(2)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
                &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;innerText&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;productInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`No product info found on &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Extracted product info from &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="c1"&gt;// Save the extracted data, e.g., push to Apify dataset&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;productInfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Start the crawler with a list of product review pages&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://www.mintmobile.com/devices/samsung-galaxy-z-flip6/6473480/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. User agents and browser fingerprinting
&lt;/h2&gt;

&lt;p&gt;If a website blocks your scraper, you can’t access the data, which makes all your efforts pointless. To avoid this, you want to make your scrapers mimic real users as much as possible. Two basic elements of anti-bot defenses to keep in mind are &lt;strong&gt;user agents&lt;/strong&gt; and &lt;strong&gt;browser fingerprinting.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A user agent is a piece of metadata sent with every HTTP request, telling the website what browser and device are making the request. It looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your scraper uses something obvious like the Axios default User Agent, &lt;code&gt;axios/1.7.2&lt;/code&gt; , the site will likely flag you as a bot and block your access.&lt;/p&gt;

&lt;p&gt;Fingerprinting takes it a step further. Websites analyze details like your screen resolution, installed fonts, timezone, language, and even whether the browser is running in headless mode. All this data creates a unique “fingerprint” for your scraper. If your fingerprint looks too uniform or lacks variety, like using the same resolution or timezone across all requests, you’re more likely to get caught. Some sites can even track you across sessions, bypassing tactics like IP rotation.&lt;/p&gt;

&lt;p&gt;As you can imagine, manually managing user agents and fingerprints can be a headache, it’s time-consuming, error-prone, and hard to keep up with as websites constantly improve their defenses.&lt;/p&gt;

&lt;p&gt;Thankfully, modern open-source tools like Crawlee take care of these challenges for us. Crawlee automatically applies the correct user agent and fingerprints to our request to ensure our bots appear “human-like.” Its &lt;code&gt;PlaywrightCrawler&lt;/code&gt; and &lt;code&gt;PuppeteerCrawler&lt;/code&gt; also make headless browsers behave like real ones, lowering your chances of detection, which is why I opted for using Playwright with Crawlee in the first section 😉&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Rate limiting
&lt;/h2&gt;

&lt;p&gt;Rate limiting is how websites keep things under control by capping the number of requests a user or IP can make within a set time frame. This helps prevent server overload, defend against DoS attacks, and discourage automated scrapers. If your scraper goes over the limit, the server might respond with a &lt;strong&gt;429 Too Many Requests&lt;/strong&gt; error or even block your IP temporarily. This can be a major roadblock, interrupting your data collection and leaving you with incomplete results.&lt;/p&gt;

&lt;p&gt;To solve this issue, you need to manage your request rates and stay within the website’s limits. Crawlee makes this easy by offering options to fine-tune how many requests your scraper sends at once, how many it sends per minute, and how it scales based on your system’s resources. This gives you the flexibility to adjust your scraper to avoid hitting rate limits while maintaining strong performance.&lt;/p&gt;

&lt;p&gt;Here’s an example of how to handle rate limiting using Crawlee’s &lt;strong&gt;CheerioCrawler&lt;/strong&gt; with adaptive concurrency to scrape &lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News:&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;CheerioCrawler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;crawlee&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CheerioCrawler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="c1"&gt;// Ensure there will always be at least 2 concurrent requests&lt;/span&gt;
    &lt;span class="na"&gt;minConcurrency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// Prevent the crawler from exceeding 20 concurrent requests&lt;/span&gt;
    &lt;span class="na"&gt;maxConcurrency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// ...but also ensure the crawler never exceeds 250 requests per minute&lt;/span&gt;
    &lt;span class="na"&gt;maxRequestsPerMinute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;requestHandler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Processing &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Extract data using Cheerio&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="na"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="na"&gt;href&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;};&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// Store the results to the default dataset.&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Find a link to the next page and enqueue it if it exists.&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;infos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.morelink&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;infos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;processedRequests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; is the last page!`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addRequests&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="c1"&gt;// Run the crawler and wait for it to finish.&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Crawler finished.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. IP bans
&lt;/h2&gt;

&lt;p&gt;Building on the discussion about rate limiting, IP bans are another common issue you might have come across when scraping the web. Simply put, when a scraper sends too many requests too quickly or behaves in ways that don’t seem natural, the server might block the IP address, either temporarily or permanently. When that happens, your data collection comes to a complete halt, and naturally, we want to prevent this from happening.&lt;/p&gt;

&lt;p&gt;While managing your scraper’s concurrency can help avoid this, sometimes it’s not enough. If you’re still running into blocks, using proxy rotation is a great next step. By rotating IP addresses, you can spread out your requests and make it harder for websites to flag and block your crawler’s activity.&lt;/p&gt;

&lt;p&gt;With Crawlee, adding proxies is straightforward. Whether you’re using your own servers or working with a third-party provider, Crawlee handles the rotation automatically, ensuring your requests come from different IPs.&lt;/p&gt;

&lt;p&gt;If you already have a list of proxies ready, integrating them into your Crawlee scraper takes just a few lines of code. Here’s how you can do it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ProxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;CheerioCrawler&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;crawlee&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ProxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;proxyUrls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://proxy-1.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://proxy-2.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newUrl&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CheerioCrawler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;// ...rest of the code&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively, you can use a third-party tool like &lt;a href="https://apify.com/proxy" rel="noopener noreferrer"&gt;Apify Proxy&lt;/a&gt; to access a large pool of residential and datacenter proxies, making proxy management even easier. It also gives you added flexibility by letting you control proxy groups and country codes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;Actor&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;apify&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Actor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createProxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;RESIDENTIAL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="na"&gt;countryCode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;US&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proxyUrl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;proxyConfiguration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;newUrl&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Honeypot traps
&lt;/h2&gt;

&lt;p&gt;Honeypot traps are hidden elements in a website’s HTML designed to detect and block automated bots and scrapers. These traps, like hidden links, forms, or buttons, are invisible to regular users but can be accidentally triggered by scrapers that process every element indiscriminately. When this happens, it signals bot activity to the website, often resulting in blocks, IP bans, and other issues. In short, you want to keep your scraper far away from these traps.&lt;/p&gt;

&lt;p&gt;One way to avoid these traps is by filtering out hidden elements. You can check for CSS properties such as &lt;code&gt;display: none&lt;/code&gt; and &lt;code&gt;visibility: hidden&lt;/code&gt; to exclude them from your scraping process.&lt;/p&gt;

&lt;p&gt;Another approach is to simulate real user behavior. Instead of scraping the entire HTML, focus on specific sections of the page where the data is located. Mimicking real interactions, like clicking on visible elements or navigating the page, helps your scraper appear more human-like and prevents it from interacting with invisible elements that a user wouldn’t be aware of.&lt;/p&gt;

&lt;p&gt;Here’s an example of how you could modify the Hacker News scraper from the earlier section to filter out Honeypot traps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;CheerioCrawler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;crawlee&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CheerioCrawler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;requestHandler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;log&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Processing &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Function to check if an element is visible (filter out Honeypots)&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isElementVisible&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;style&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;display&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visibility&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;opacity&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;height&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;width&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;]);&lt;/span&gt;
            &lt;span class="k"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;display&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
                &lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;visibility&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hidden&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
                &lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;opacity&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;0&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;

        &lt;span class="c1"&gt;// Extract data using Cheerio while avoiding Honeypot traps&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.athing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;isElementVisible&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="na"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.rank&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                    &lt;span class="na"&gt;href&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;$element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.title a&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;href&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="p"&gt;};&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="c1"&gt;// Store the results to the default dataset.&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;Dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pushData&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Find a link to the next page and enqueue it if it exists.&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;infos&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;enqueueLinks&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;.morelink&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;infos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;processedRequests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; is the last page!`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addRequests&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://news.ycombinator.com/&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

&lt;span class="c1"&gt;// Run the crawler and wait for it to finish.&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;crawler&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Crawler finished.&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;CAPTCHAs
CAPTCHAs, or Completely Automated Public Turing tests to tell Computers and Humans Apart, are those familiar challenges we’ve all seen, clicking on traffic lights or selecting crosswalks in image grids. While frustrating for humans, they are designed to block bots, making them one of the toughest obstacles for scrapers. Encountering one during scraping can bring your process to a halt, as bots can’t solve these puzzles on their own.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The good news is that much of what we’ve already covered, like &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/" rel="noopener noreferrer"&gt;avoiding honeypot traps, rotating IPs, and making your scraper mimic human behavior&lt;/a&gt;, also helps reduce the chances of triggering CAPTCHAs. Websites generally try to show CAPTCHAs only when the activity looks suspicious. By blending in with regular traffic through techniques like rotating IPs, randomizing interactions, and managing request patterns thoughtfully, your scraper can often bypass CAPTCHAs entirely.&lt;/p&gt;

&lt;p&gt;However, CAPTCHAs can still appear, even when precautions are in place. In such cases, your best bet is to integrate a CAPTCHA-solving service. Tools like Apify’s &lt;a href="https://apify.com/petr_cermak/anti-captcha-recaptcha" rel="noopener noreferrer"&gt;Anti Captcha Recaptcha Actor&lt;/a&gt;, which works with &lt;a href="https://anti-captcha.com/" rel="noopener noreferrer"&gt;Anti-Captcha&lt;/a&gt;, can help you equip your crawlers with CAPTCHA-solving capabilities to handle these challenges automatically and avoid disrupting your scraping.&lt;/p&gt;

&lt;p&gt;Here is an example of how you could use the Apify API to integrate the Anti Captcha Recaptcha Actor into your code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ApifyClient&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;apify-client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Initialize the ApifyClient with API token&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ApifyClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Prepare Actor input&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cookies&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;name=value; name2=value2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;key&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;anticaptcha-key&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;proxyAddress&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;8.8.8.8&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;proxyLogin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;theLogin&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;proxyPassword&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thePassword&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;proxyPort&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;proxyType&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;siteKey&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;6LfD3PIbAAAAAJs_eEHvoOl75_83eXSqpPSRFJ_u&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;userAgent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Opera 6.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;webUrl&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://2captcha.com/demo/recaptcha-v2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Run the Actor and wait for it to finish&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;run&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;petr_cermak/anti-captcha-recaptcha&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;})();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Data storage and organization
&lt;/h2&gt;

&lt;p&gt;Storing and organizing data effectively is often overlooked in smaller projects but is actually a core component of any successful web scraping operation.&lt;/p&gt;

&lt;p&gt;While collecting data is the first step, how you store, access, and present it has a huge impact on its usability and scalability. Web scraping generates a mix of data types, from structured information like prices and reviews to unstructured content like PDFs and images. This variety demands flexible storage solutions. For small projects, simple CSV or JSON files stored locally might work, but as your needs grow, these methods can quickly fall short.&lt;/p&gt;

&lt;p&gt;For larger datasets or ongoing scraping, cloud-based solutions like &lt;a href="https://www.mongodb.com/" rel="noopener noreferrer"&gt;MongoDB&lt;/a&gt;, &lt;a href="https://aws.amazon.com/" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; or &lt;a href="https://apify.com/storage" rel="noopener noreferrer"&gt;Apify Storage&lt;/a&gt; become necessary. They’re designed to handle large volumes of data and offer quick querying capabilities.&lt;/p&gt;

&lt;p&gt;One standout advantage of &lt;a href="https://apify.com/storage" rel="noopener noreferrer"&gt;Apify Storage&lt;/a&gt; is that it’s specifically designed to meet the needs of web scraping. It offers Datasets for structured data, Key-Value Stores for storing metadata or configurations, and Request Queues to help manage and track your scraping workflows. It integrates seamlessly with tools like Crawlee, provides API access for straightforward data retrieval and management, and supports exporting data in multiple formats.&lt;/p&gt;

&lt;p&gt;Best of all, Apify Storage is just one piece of the comprehensive Apify platform, which delivers a full-stack solution for all your web scraping needs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ycky6muzpocyjz216v8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ycky6muzpocyjz216v8.png" alt=" " width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Automation and monitoring
&lt;/h2&gt;

&lt;p&gt;Manually running scrapers every time you need fresh data is not practical, especially for projects requiring regular updates like price tracking, market research, or monitoring real-time changes.&lt;/p&gt;

&lt;p&gt;Automation ensures your workflows run on schedule, minimizing errors and keeping your data current, while monitoring helps detect and address issues like failed requests, CAPTCHAs, or website structure changes before they cause disruptions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.apify.com/platform/monitoring" rel="noopener noreferrer"&gt;Apify Platform Monitoring&lt;/a&gt; simplifies this process by providing tools specifically designed for automating and monitoring web scraping workflows. With task scheduling, you can set your scrapers to run at specific intervals, ensuring consistent data updates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqy82h2sk3qfw0rp1z04f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqy82h2sk3qfw0rp1z04f.png" alt=" " width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As well as helping you automate scraping, Apify offers monitoring features to view task statuses, detailed logs, and error messages. These features keep you informed about your scraper’s performance, including notifications and alerts, which can be configured to inform you of task completions or errors via email, Slack, or other integrations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0nw017pc4jkziffk03g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz0nw017pc4jkziffk03g.png" alt=" " width="800" height="440"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Scalability and reliability
&lt;/h2&gt;

&lt;p&gt;Building a scalable and reliable scraping operation relies on the key principles we’ve covered: avoiding blocks, maintaining data consistency, storing collected data efficiently, and automating tasks with proper monitoring. Together, these elements create a solid foundation for a system that can grow with your needs while ensuring quality and performance remain intact.&lt;/p&gt;

&lt;p&gt;One crucial yet often overlooked aspect of scalability is infrastructure management. Handling your own servers can quickly turn into a costly and time-consuming challenge, especially as your project expands. That’s why choosing a robust cloud-based solution like Apify from the very start of your project is a smart choice. Designed for scalability, it automatically adjusts to your project’s needs, so you never have to worry about provisioning servers or hitting capacity limits. You only pay for what you use, keeping costs manageable while ensuring your scrapers keep running without interruption.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" rel="noopener noreferrer"&gt;Get a free Apify plan now!&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  10. Real-time data scraping
&lt;/h2&gt;

&lt;p&gt;The idea behind real-time data scraping is to continuously collect data as soon as it becomes available. This is often a critical requirement for projects involving time-sensitive data, such as stock market analysis, price monitoring, news aggregation, and tracking live trends.&lt;/p&gt;

&lt;p&gt;To achieve this, you need to &lt;a href="https://docs.apify.com/academy/deploying-your-code/deploying" rel="noopener noreferrer"&gt;deploy your code to a cloud platform&lt;/a&gt; and automate your scraping process with a proper schedule. For example, you can deploy your scraping script as an Apify Actor and schedule it to run at intervals that match how “fresh” you need the data to be. Apify’s scheduling and monitoring tools make it easy for you to implement this automation, ensuring a constant flow of real-time data while helping you promptly handle any errors to maintain accuracy and reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;And here we are at the end of the article. I hope you’ve found it helpful and can use it as a reference when dealing with the challenges we’ve discussed. Of course, every scraping project is unique, and it’s impossible to cover every scenario in one post. That’s where the value of a strong developer community comes in.&lt;/p&gt;

&lt;p&gt;Connecting with other developers who have faced and solved similar challenges can make a big difference. It’s a chance to exchange ideas, get advice, and share your own experiences.&lt;/p&gt;

&lt;p&gt;If you haven’t already, I encourage you to &lt;a href="https://discord.com/invite/jyEM2PRvMU" rel="noopener noreferrer"&gt;join the Apify &amp;amp; Crawlee Developer Community on Discord&lt;/a&gt;. It’s a great space to learn, collaborate, and grow alongside others who share your interest in web scraping.&lt;/p&gt;

&lt;p&gt;Hope to see you there!&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>crawlee</category>
      <category>antiscraping</category>
    </item>
    <item>
      <title>11 best open-source web crawlers and scrapers in 2024</title>
      <dc:creator>Dávid Lukáč</dc:creator>
      <pubDate>Tue, 29 Oct 2024 14:33:22 +0000</pubDate>
      <link>https://dev.to/apify/11-best-open-source-web-crawlers-and-scrapers-in-2024-16pe</link>
      <guid>https://dev.to/apify/11-best-open-source-web-crawlers-and-scrapers-in-2024-16pe</guid>
      <description>&lt;p&gt;Free software libraries, packages, and SDKs for web crawling? Or is it a web scraper that you need?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hey, we're &lt;a href="https://apify.com/pricing" rel="noopener noreferrer"&gt;Apify&lt;/a&gt;. You can build, deploy, share, and monitor your scrapers and crawlers on the Apify platform. &lt;a href="https://apify.it/platform-pricing" rel="noopener noreferrer"&gt;Check us out&lt;/a&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're tired of the limitations and costs of proprietary &lt;a href="https://blog.apify.com/best-web-scraping-tools/" rel="noopener noreferrer"&gt;web scraping tools&lt;/a&gt; or being locked into a single vendor, open-source web crawlers and scrapers offer a flexible, customizable alternative.&lt;/p&gt;

&lt;p&gt;But not all open-source tools are the same.&lt;/p&gt;

&lt;p&gt;Some are full-fledged libraries capable of handling large-scale &lt;a href="https://blog.apify.com/web-data-extraction/" rel="noopener noreferrer"&gt;data extraction&lt;/a&gt; projects, while others excel at &lt;a href="https://blog.apify.com/what-is-a-dynamic-page/" rel="noopener noreferrer"&gt;dynamic content&lt;/a&gt; or are ideal for smaller, lightweight tasks. The right tool depends on your project’s complexity, the type of data you need, and your preferred programming language.&lt;/p&gt;

&lt;p&gt;The libraries, frameworks, and SDKs we cover here take into account the diverse needs of developers, so you can choose a tool that meets your requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are open-source web crawlers and web scrapers?
&lt;/h2&gt;

&lt;p&gt;Open-source web crawlers and scrapers let you adapt code to your needs without the cost of licenses or restrictions. Crawlers gather broad data, while scrapers target specific information. Open-source solutions like the ones below offer community-driven improvements, flexibility, and scalability—free from vendor lock-in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Top 11 open-source web crawlers and scrapers in 2024
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Crawlee
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Node.js, Python | GitHub: 15.4K+ stars | &lt;a href="https://github.com/apify/crawlee" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Crawlee is a complete web scraping and browser automation library designed for quickly and efficiently building reliable crawlers. With built-in anti-blocking features, it makes your bots look like real human users, &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/" rel="noopener noreferrer"&gt;reducing the likelihood of getting blocked.&lt;br&gt;
&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4qrz5pe5ow097063ioo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4qrz5pe5ow097063ioo.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Available in both &lt;a href="https://crawlee.dev/?__hstc=160404322.72540665235755e5af5a21367ab1294a.1713784601604.1730191216477.1730206072795.287&amp;amp;__hssc=160404322.1.1730206072795&amp;amp;__hsfp=3227366388&amp;amp;_gl=1*bl1l4a*_gcl_aw*R0NMLjE3MjU2MTE1OTEuQ2owS0NRancwT3EyQmhDQ0FSSXNBQTVodWJVMDVaYVVXd291ZmtZMzhCY2pGMmpadFdBandUa1Q3YmNWZHY0dHZuSVQ1aUF5R2Zva2tySWFBdl81RUFMd193Y0I.*_gcl_au*MjA3NDM2MDE2Ni4xNzI5NTA0MzkyLjE4MDExMDkzMzguMTcyOTg0OTE5NC4xNzI5ODQ5NDAw*_ga*MTYzNDgwNjc5NC4xNzEzNzg4OTYz*_ga_62P18XN9NS*MTczMDIwNjA3MS4zMTIuMS4xNzMwMjA2NTU5LjMxLjAuMA.." rel="noopener noreferrer"&gt;Node.js&lt;/a&gt; and &lt;a href="https://crawlee.dev/python?__hstc=160404322.72540665235755e5af5a21367ab1294a.1713784601604.1730191216477.1730206072795.287&amp;amp;__hssc=160404322.1.1730206072795&amp;amp;__hsfp=3227366388&amp;amp;_gl=1*bl1l4a*_gcl_aw*R0NMLjE3MjU2MTE1OTEuQ2owS0NRancwT3EyQmhDQ0FSSXNBQTVodWJVMDVaYVVXd291ZmtZMzhCY2pGMmpadFdBandUa1Q3YmNWZHY0dHZuSVQ1aUF5R2Zva2tySWFBdl81RUFMd193Y0I.*_gcl_au*MjA3NDM2MDE2Ni4xNzI5NTA0MzkyLjE4MDExMDkzMzguMTcyOTg0OTE5NC4xNzI5ODQ5NDAw*_ga*MTYzNDgwNjc5NC4xNzEzNzg4OTYz*_ga_62P18XN9NS*MTczMDIwNjA3MS4zMTIuMS4xNzMwMjA2NTU5LjMxLjAuMA.." rel="noopener noreferrer"&gt;Python&lt;/a&gt;, Crawlee offers a unified interface that supports HTTP and headless browser crawling, making it versatile for various scraping tasks. It integrates with libraries like Cheerio and &lt;a href="https://blog.apify.com/how-to-parse-html-in-python/" rel="noopener noreferrer"&gt;Beautiful Soup for efficient HTML parsing&lt;/a&gt; and headless browsers like Puppeteer and &lt;a href="https://blog.apify.com/scrape-dynamic-websites-with-python/" rel="noopener noreferrer"&gt;Playwright for JavaScript rendering.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The library excels in scalability, automatically managing concurrency based on system resources, &lt;a href="https://blog.apify.com/rotating-proxies/" rel="noopener noreferrer"&gt;rotating proxies to enhance efficiency&lt;/a&gt;, and employing human-like browser fingerprints to avoid detection. Crawlee also ensures robust data handling through persistent URL queuing and pluggable storage for data and files.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Check out Crawlee&lt;/a&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy switching between simple HTTP request/response handling and complex JavaScript-heavy pages by changing just a few lines of code.&lt;/li&gt;
&lt;li&gt;Built-in sophisticated anti-blocking features like proxy rotation and generation of human-like fingerprints.&lt;/li&gt;
&lt;li&gt;Integrating tools for common tasks like link extraction, infinite scrolling, and blocking unwanted assets, along with support for both Cheerio and JSDOM, provides a comprehensive scraping toolkit right out of the box.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its comprehensive feature set and the requirement to understand HTTP and browser-based scraping can create a steep learning curve.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;🟧 &lt;a href="https://blog.apify.com/crawlee-web-scraping-tutorial/" rel="noopener noreferrer"&gt;Crawlee web scraping tutorial for Node.js&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Crawlee is ideal for developers and teams seeking to manage simple and complex web scraping and automation tasks in JavaScript/TypeScript and Python. It is particularly effective for scraping web applications that combine static and dynamic pages, as it allows easy switching between different types of crawlers to handle each scenario. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Deploy your scraping code to the cloud&lt;/a&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Scrapy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Python | GitHub: 52.9k stars | &lt;a href="https://github.com/scrapy/scrapy" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scrapy is one of the most complete and popular &lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; frameworks within the Python ecosystem. It is written using Twisted, an event-driven networking framework, giving Scrapy asynchronous capabilities.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6kdxyqefyhlty3mvncp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6kdxyqefyhlty3mvncp.png" alt=" " width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a comprehensive &lt;a href="https://blog.apify.com/web-crawling-vs-web-scraping/" rel="noopener noreferrer"&gt;web crawling&lt;/a&gt; framework designed specifically for data extraction, Scrapy provides built-in support for handling requests, processing responses, and exporting data in multiple formats, including CSV, JSON, and XML.&lt;/p&gt;

&lt;p&gt;Its main drawback is that it cannot natively handle dynamic websites. However, you can &lt;a href="https://blog.apify.com/scrapy-playwright/" rel="noopener noreferrer"&gt;configure Scrapy with a browser automation tool like Playwright&lt;/a&gt; or Selenium to unlock these capabilities.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 &lt;a href="https://blog.apify.com/web-scraping-with-scrapy/" rel="noopener noreferrer"&gt;Learn more about using Scrapy for web scraping&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Significant performance boost due to its asynchronous nature.&lt;/li&gt;
&lt;li&gt;Specifically designed for web scraping, providing a robust foundation for such tasks.&lt;/li&gt;
&lt;li&gt;Extensible &lt;a href="https://blog.apify.com/scrapy-middleware/" rel="noopener noreferrer"&gt;middleware architecture&lt;/a&gt; makes adjusting Scrapy’s capabilities to fit various scraping scenarios easy.&lt;/li&gt;
&lt;li&gt;Supported by a well-established community with a wealth of resources available online.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Steep learning curve, which can be challenging for less experienced web scraping developers.&lt;/li&gt;
&lt;li&gt;Lacks the ability to handle content generated by JavaScript natively, requiring integration with tools like Selenium or Playwright to scrape dynamic pages.&lt;/li&gt;
&lt;li&gt;More complex than necessary for simple and small-scale scraping tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Scrapy is ideally suited for developers, data scientists, and researchers embarking on large-scale web scraping projects who require a reliable and scalable solution for extracting and processing vast amounts of data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💡 Run multiple Scrapy spiders in the cloud&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.apify.com/cli/docs/integrating-scrapy" rel="noopener noreferrer"&gt;Read the docs&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3.MechanicalSoup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Python | GitHub: 4.7K+ stars | &lt;a href="https://github.com/MechanicalSoup/MechanicalSoup" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MechanicalSoup is a Python library designed to automate website interactions. It provides a simple API to access and interact with HTML content, similar to interacting with web pages through a web browser, but programmatically. MechanicalSoup essentially combines the best features of libraries like Requests for HTTP requests and Beautiful Soup for HTML parsing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbshzbw663nkcome2ytc7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbshzbw663nkcome2ytc7.png" alt=" " width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, you might wonder when to use MechanicalSoup over the traditional combination of BS4+ Requests. MechanicalSoup provides some distinct features &lt;a href="https://blog.apify.com/mechanicalsoup-tutorial/" rel="noopener noreferrer"&gt;particularly useful for specific web scraping tasks.&lt;/a&gt; These include submitting forms, handling login authentication, navigating through pages, and extracting data from HTML.&lt;/p&gt;

&lt;p&gt;MechanicalSoup makes it possible by creating a &lt;code&gt;StatefulBrowser&lt;/code&gt; object in Python that can store cookies and session data and handle other aspects of a browsing session.&lt;/p&gt;

&lt;p&gt;However, while MechanicalSoup offers some browser-like functionalities akin to what you'd expect from a browser automation tool such as Selenium, it does so without launching an actual browser. This approach has its advantages but also comes with certain limitations, which we'll explore next:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Great choice for simple automation tasks such as filling out forms and scraping data from pages that do not require JavaScript rendering.&lt;/li&gt;
&lt;li&gt;Lightweight tool that interacts with web pages through requests without a graphical browser interface. This makes it faster and less demanding on system resources.&lt;/li&gt;
&lt;li&gt;Directly integrates Beautiful Soup, offering all the benefits you would expect from BS4, plus some extra features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unlike real browser automation tools like Playwright and Selenium, MechanicalSoup cannot execute JavaScript. Many modern websites require JavaScript for dynamic content loading and user interactions, which MechanicalSoup cannot handle.&lt;/li&gt;
&lt;li&gt;Unlike Selenium and Playwright, MechanicalSoup does not support advanced browser interactions such as moving the mouse, dragging and dropping, or keyboard actions that might be necessary to retrieve dates from more complex websites.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; MechanicalSoup is a more efficient and lightweight option for more basic scraping tasks, especially for static websites and those with straightforward interactions and navigation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🍲 &lt;a href="https://blog.apify.com/mechanicalsoup-tutorial/" rel="noopener noreferrer"&gt;Learn more about MechanicalSoup&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. Node Crawler
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Node.js  | GitHub: 6.7K+ stars | &lt;a href="https://github.com/bda-research/node-crawler" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node Crawler, often referred to as 'Crawler,' is a popular web crawling library for Node.js. At its core, Crawler utilizes Cheerio as the default parser, but it can be configured to use JSDOM if needed. The library offers a wide range of customization options, including robust queue management that allows you to enqueue URLs for crawling while it manages concurrency, rate limiting, and retries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2gpkzblwaeu2tdwe7a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2gpkzblwaeu2tdwe7a2.png" alt=" " width="721" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built on Node.js, Node Crawler excels at efficiently handling multiple, simultaneous web requests, which makes it ideal for high-volume web scraping and crawling.&lt;/li&gt;
&lt;li&gt;Integrates directly with Cheerio (a fast, flexible, and lean implementation of core jQuery designed specifically for the server), simplifying the process of HTML parsing and data extraction.&lt;/li&gt;
&lt;li&gt;Provides extensive options for customization, from user-agent strings to request intervals, making it suitable for a wide range of web crawling scenarios.&lt;/li&gt;
&lt;li&gt;Easy to set up and use, even for those new to Node.js or web scraping.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does not handle JavaScript rendering natively. For dynamic JavaScript-heavy sites, you need to integrate it with something like Puppeteer or a &lt;a href="https://blog.apify.com/headless-browsers-what-are-they-and-how-do-they-work/" rel="noopener noreferrer"&gt;headless browser&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;While Node Crawler simplifies many tasks, the asynchronous model and event-driven architecture of Node.js can present a learning curve for those unfamiliar with such patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Node Crawler is a great choice for developers familiar with the Node.js ecosystem who need to handle large-scale or high-speed web scraping tasks. It provides a flexible solution for web crawling that leverages the strengths of Node.js's asynchronous capabilities.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📖 &lt;a href="https://blog.apify.com/mechanicalsoup-tutorial/" rel="noopener noreferrer"&gt;Related: Web scraping with Node.js guide&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  5. Selenium
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Multi-language | GitHub: 30.6K stars | &lt;a href="https://github.com/SeleniumHQ/selenium" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Selenium is a widely-used open-source framework for automating web browsers. It allows developers to write scripts in various programming languages to control browser actions. This makes it suitable for crawling and scraping dynamic content. Selenium provides a rich API that supports multiple browsers and platforms, so you can simulate user interactions like clicking buttons, filling forms, and navigating between pages. Its ability to handle JavaScript-heavy websites makes it particularly valuable for scraping modern web applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5oylhhk9dpocsocegm9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5oylhhk9dpocsocegm9f.png" alt=" " width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cross-browser support:&lt;/strong&gt; Works with all major browsers (Chrome, Firefox, Safari, etc.), allowing for extensive testing and scraping.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic content handling:&lt;/strong&gt; Capable of interacting with JavaScript-rendered content, making it effective for modern web applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich community and resources:&lt;/strong&gt; A large ecosystem of tools and libraries that enhance its capabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource-intensive:&lt;/strong&gt; Running a full browser can consume significant system resources compared to headless solutions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Steeper learning curve:&lt;/strong&gt; Requires understanding of browser automation concepts and may involve complex setup for advanced features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Selenium is ideal for developers and testers needing to automate web applications or scrape data from sites that heavily rely on JavaScript. Its versatility makes it suitable for both testing and data extraction tasks.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📖 &lt;a href="https://blog.apify.com/web-scraping-with-selenium-and-python/" rel="noopener noreferrer"&gt;Related: How to do web scraping with Selenium in Python&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  6. Heritrix
&lt;/h3&gt;

&lt;p&gt;Language: Java | GitHub: 2.8K+ stars | &lt;a href="https://github.com/internetarchive/heritrix3" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Heritrix is open-source web crawling software developed by the Internet Archive. It is primarily used for web archiving - collecting information from the web to build a digital library and support the Internet Archive's preservation efforts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjqrmycqeb00sqiirxrg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpjqrmycqeb00sqiirxrg.png" alt=" " width="800" height="420"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Optimized for large-scale web archiving, making it ideal for institutions like libraries and archives needing to preserve digital content systematically.&lt;/li&gt;
&lt;li&gt;Detailed configuration options that allow users to customize crawl behavior deeply, including deciding which URLs to crawl, how to treat them, and how to manage the data collected.&lt;/li&gt;
&lt;li&gt;Able to handle large datasets, which is essential for archiving significant web portions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;As it is written in Java, running Heritrix might require more substantial system resources than lighter, script-based crawlers, and it might limit usability for those unfamiliar with Java.&lt;/li&gt;
&lt;li&gt;Optimized for capturing and preserving web content rather than extracting data for immediate analysis or use.&lt;/li&gt;
&lt;li&gt;Does not render JavaScript, which means it cannot capture content from websites that rely heavily on JavaScript for dynamic content generation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Heritrix is best suited for organizations and projects that aim to archive and preserve digital content on a large scale, such as libraries, archives, and other cultural heritage institutions. Its specialized nature makes it an excellent tool for its intended purpose but less adaptable for more general web scraping needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Apache Nutch
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Java | GitHub: 2.9K+ stars | &lt;a href="https://github.com/apache/nutch" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apache Nutch is an extensible open-source web crawler often used in fields like data analysis. It can fetch content through protocols such as HTTPS, HTTP, or FTP and extract textual information from document formats like HTML, PDF, RSS, and ATOM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22mfcswqw3ffwiqf3bec.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F22mfcswqw3ffwiqf3bec.png" alt=" " width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highly reliable for continuous, extensive crawling operations given its maturity and focus on enterprise-level crawling.&lt;/li&gt;
&lt;li&gt;Being part of the Apache project, Nutch benefits from strong community support, continuous updates, and improvements.&lt;/li&gt;
&lt;li&gt;Seamless integration with Apache Solr and other Lucene-based search technologies, making it a robust backbone for building search engines.&lt;/li&gt;
&lt;li&gt;Leveraging Hadoop allows Nutch to efficiently process large volumes of data, which is crucial for processing the web at scale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setting up Nutch and integrating it with Hadoop can be complex and daunting, especially for those new to these technologies.&lt;/li&gt;
&lt;li&gt;Overly complicated for simple or small-scale crawling tasks, whereas lighter, more straightforward tools could be more effective.&lt;/li&gt;
&lt;li&gt;Since Nutch is written in Java, it requires a Java environment, which might not be ideal for environments focused on other technologies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Apache Nutch is ideal for organizations building large-scale search engines or collecting and processing vast amounts of web data. Its capabilities are especially useful in scenarios where scalability, robustness, and integration with enterprise-level search technologies are required.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.Webmagic
&lt;/h3&gt;

&lt;p&gt;Language: Java | GitHub: 11.4K+ stars | &lt;a href="https://github.com/code4craft/webmagic" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Webmagic is an open-source, simple, and flexible Java framework dedicated to web scraping. Unlike large-scale data crawling frameworks like Apache Nutch, WebMagic is designed for more specific, targeted scraping tasks, which makes it suitable for individual and enterprise users who need to extract data from various web sources efficiently.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3nh5arogcm1twjghch2l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3nh5arogcm1twjghch2l.png" alt=" " width="800" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easier to set up and use than more complex systems like Apache Nutch, designed for broader web indexing and requires more setup.&lt;/li&gt;
&lt;li&gt;Designed to be efficient for small to medium-scale scraping tasks, providing enough power without the overhead of larger frameworks.&lt;/li&gt;
&lt;li&gt;For projects already within the Java ecosystem, integrating WebMagic can be more seamless than integrating a tool from a different language or platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Being Java-based, it might not appeal to developers working with other programming languages who prefer libraries available in their chosen languages.&lt;/li&gt;
&lt;li&gt;WebMagic does not handle JavaScript rendering natively. For dynamic content loaded by JavaScript, you might need to integrate with headless browsers, which can complicate the setup.&lt;/li&gt;
&lt;li&gt;While it has good documentation, the community around WebMagic might not be as large or active as those surrounding more popular frameworks like Scrapy, potentially affecting the future availability of third-party extensions and support.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; WebMagic is a suitable choice for developers looking for a straightforward, flexible Java-based web scraping framework that balances ease of use with sufficient power for most web scraping tasks. It's particularly beneficial for users within the Java ecosystem who need a tool that integrates smoothly into larger Java applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  9. Nokogiri
&lt;/h3&gt;

&lt;p&gt;Language: Ruby | GitHub: 6.1K+ stars | &lt;a href="https://github.com/sparklemotion/nokogiri" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Like Beautiful Soup, Nokogiri is also great at parsing HTML and XML documents via the programming language Ruby. Nokogiri relies on native parsers such as the libxml2 libxml2, libgumbo, and xerces. If you want to read or edit an XML document using Ruby programmatically, Nokogiri is the way to go.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y6ou2ql0xc1zcqbo21k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9y6ou2ql0xc1zcqbo21k.png" alt=" " width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Due to its underlying implementation in C (libxml2 and libxslt), Nokogiri is extremely fast, especially compared to pure Ruby libraries.&lt;/li&gt;
&lt;li&gt;Able to handle both HTML and XML with equal proficiency, making it suitable for a wide range of tasks, from web scraping to RSS feed parsing.&lt;/li&gt;
&lt;li&gt;Straightforward and intuitive API for performing complex parsing and querying tasks.&lt;/li&gt;
&lt;li&gt;Strong, well-maintained community ensures regular updates and good support through forums and documentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Specific to Ruby, which might not be suitable for those working in other programming environments.&lt;/li&gt;
&lt;li&gt;Installation can sometimes be problematic due to its dependencies on native C libraries.&lt;/li&gt;
&lt;li&gt;Can be relatively heavy regarding memory usage, especially when dealing with large documents.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Nokogiri is particularly well-suited for developers already working within the Ruby ecosystem and needs a robust, efficient tool for parsing and manipulating HTML and XML data. Its speed, flexibility, and Ruby-native design make it an excellent choice for a wide range of web data extraction and transformation tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  10. Playwright
&lt;/h3&gt;

&lt;p&gt;Language: Multi-language | GitHub: 67K+ stars| &lt;a href="https://github.com/microsoft/playwright" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright&lt;/strong&gt; an open-source Node.js library introduced in 2020, is widely used for automated browser testing and web scraping. It is cross-platform, supports multiple languages like TypeScript, JavaScript, Python, and Java, and works with Chromium, Firefox, and Webkit. Playwright offers unique features for web automation, including headless mode, autowaits, browser contexts, authentication state persistence, and custom selector engines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v7l56mztdkayhs6uk01.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v7l56mztdkayhs6uk01.png" alt=" " width="800" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Playwright supports multiple browsers including Chromium, Firefox, and WebKit, for consistent scraping across different platforms. It can also be utilized with various programming languages such as JavaScript, Python, Java, and .NET, which makes it accessible to a broader range of developers.&lt;/li&gt;
&lt;li&gt;Playwright can operate in headless mode, which reduces resource consumption and allows for faster execution of scraping tasks without a graphical interface. The framework automatically waits for elements to be ready before interacting with them. This reduces the need for manual delays and improves reliability.&lt;/li&gt;
&lt;li&gt;It effectively manages websites that rely on JavaScript and AJAX for content loading, so it's suitable for modern web applications. The framework automatically waits for elements to be ready before interacting with them. This reduces the need for manual delays and improves reliability.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running multiple browser instances can consume significant system resources, particularly when scraping large volumes of data.&lt;/li&gt;
&lt;li&gt;While capable, Playwright is primarily designed for browser automation and testing rather than dedicated web crawling, which can complicate extensive scraping tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Playwright is best suited for developers looking to automate interactions with web applications that utilize modern frameworks like React or Angular. Its ability to handle dynamic content makes it ideal for scenarios where traditional HTTP request libraries fall short. It is particularly advantageous in projects that require frequent updates or interactions with complex web interfaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Katana
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Language:&lt;/strong&gt; Go | GitHub: 11.1k | &lt;a href="https://github.com/projectdiscovery/katana" rel="noopener noreferrer"&gt;link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Katana is a web scraping framework focused on speed and efficiency. Developed by Project Discovery, it is designed to facilitate data collection from websites while providing a strong set of features tailored for security professionals and developers. Katana lets you create custom scraping workflows using a simple configuration format. It supports various output formats and integrates easily with other tools in the security ecosystem, which makes it a versatile choice for web crawling and scraping tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2jdwahwfyc5ylp7s89b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx2jdwahwfyc5ylp7s89b.png" alt=" " width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High performance:&lt;/strong&gt; Built with efficiency in mind, allowing for fast data collection from multiple sources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible architecture:&lt;/strong&gt; Easily integrates with other tools and libraries, enhancing its functionality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security-focused features:&lt;/strong&gt; Includes capabilities that cater specifically to the needs of security researchers and penetration testers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Limited community support:&lt;/strong&gt; As a newer tool, it does not have as extensive resources or community engagement as more established frameworks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Niche use case focus:&lt;/strong&gt; Primarily designed for security professionals, which may limit its appeal for general-purpose web scraping tasks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Katana is best suited for security professionals and developers looking for a fast, efficient framework tailored to web scraping needs within the cybersecurity domain. Its integration capabilities make it particularly useful in security testing scenarios where data extraction is required.&lt;/p&gt;

&lt;h2&gt;
  
  
  All-in-one crawling and scraping solution: Apify
&lt;/h2&gt;

&lt;p&gt;Apify is a full-stack web scraping and browser automation platform for building crawlers and scrapers in any programming language. It provides infrastructure for successful scraping at scale: storage, integrations, scheduling, proxies, and more.&lt;/p&gt;

&lt;p&gt;So, whichever library you want to use for your scraping scripts, you can deploy them to the cloud and benefit from all the features the Apify platform has to offer.&lt;/p&gt;

&lt;p&gt;Apify also hosts a library of ready-made data extraction and automation tools (Actors) created by other developers, which you can customize for your use case. That means you don't have to build everything from scratch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ku3d1uzldkhzqglsl2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ku3d1uzldkhzqglsl2k.png" alt=" " width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://console.apify.com/sign-up" class="ltag_cta ltag_cta--branded" rel="noopener noreferrer"&gt;Sign up now and start scraping&lt;/a&gt;
&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>javascript</category>
      <category>python</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to scrape dynamic websites with Python</title>
      <dc:creator>Saurav Jain</dc:creator>
      <pubDate>Fri, 07 Jun 2024 07:27:24 +0000</pubDate>
      <link>https://dev.to/apify/how-to-scrape-dynamic-websites-with-python-h0m</link>
      <guid>https://dev.to/apify/how-to-scrape-dynamic-websites-with-python-h0m</guid>
      <description>&lt;p&gt;Scraping dynamic websites that load content through JavaScript after the initial page load can be a pain in the neck, as the data you want to scrape may not exist in the raw HTML source code.&lt;/p&gt;

&lt;p&gt;I'm here to help you with that problem.&lt;/p&gt;

&lt;p&gt;In this article, you'll learn how to scrape dynamic websites with Python and Playwright. By the end, you'll know how to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setup and install Playwright&lt;/li&gt;
&lt;li&gt;Create a browser instance&lt;/li&gt;
&lt;li&gt;Navigate to the page&lt;/li&gt;
&lt;li&gt;Interact with the page&lt;/li&gt;
&lt;li&gt;Scrape the data you need&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What are dynamic websites?
&lt;/h2&gt;

&lt;p&gt;Dynamic websites load content dynamically using client-side scripting languages like JavaScript. Unlike static websites, where the content is pre-rendered on the server, dynamic websites generate content on the fly based on user interactions, data fetched from APIs, or other dynamic sources. This makes them more complex to scrape compared to static websites.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's the difference between a dynamic and static web page?
&lt;/h2&gt;

&lt;p&gt;Static web pages are pre-rendered on the server and delivered as complete HTML files. Their content is fixed and does not change unless the underlying HTML file is modified. Dynamic web pages, on the other hand, generate content on-the-fly using client-side scripting languages like JavaScript.&lt;/p&gt;

&lt;p&gt;Dynamic content is often generated using JavaScript frameworks and libraries like React, Angular, and Vue.js, which manipulate the Document Object Model (DOM) based on user interactions or data fetched from APIs using technologies like AJAX (Asynchronous JavaScript and XML). This dynamic content is not initially present in the HTML source code and requires additional processing to be captured.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tools and Libraries for Scraping Dynamic Content
&lt;/h2&gt;

&lt;p&gt;To scrape dynamic content, you need tools that can execute JavaScript and interact with web pages like a real browser. One such tool is Playwright, a Python library for automating Chromium, Firefox, and WebKit browsers. Playwright allows you to simulate user interactions, execute JavaScript, and capture the resulting DOM changes.&lt;/p&gt;

&lt;p&gt;In addition to Playwright, you may also need libraries like BeautifulSoup for parsing HTML and extracting relevant data from the rendered DOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step-by-Step Guide to Using Playwright
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Setup and Installation&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Install the Python Playwright library: &lt;code&gt;pip install Playwright&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Install the required browser binaries (e.g., Chromium): P&lt;code&gt;laywright install chromium&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scraping a Dynamically-loaded Website&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Import the necessary Playwright modules and create a browser instance.
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from Playwright.sync_api import sync_playwright

    with sync_playwright() as p:
        browser = p.chromium.launch()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Launch a new browser context and create a new page.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    page = browser.new_page()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Navigate to the target website.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;page.goto("&amp;lt;https://example.com/infinite-scroll&amp;gt;")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Interact with the page as needed (e.g., scroll, click buttons, fill forms) to trigger dynamic content loading.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
# Scroll to the bottom to load more content
while True:
    page.evaluate("window.scrollTo(0, document.body.scrollHeight);")
    new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
    if not new_content_loaded:
        break
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Wait for the desired content to load using Playwright's built-in wait mechanisms.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
new_content_loaded = page.wait_for_selector(".new-content", timeout=1000)
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Extract the desired data from the rendered DOM using Playwright's evaluation mechanisms or in combination with BeautifulSoup.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
content = page.inner_html("body")
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Here's the complete example of scraping an infinite scrolling page using Playwright:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
from Playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # Launch a new Chromium browser instance
    browser = p.chromium.launch()

    # Create a new page object
    page = browser.new_page()

    # Navigate to the target website with infinite scrolling
    page.goto("&amp;lt;https://example.com/infinite-scroll&amp;gt;")

    # Scroll to the bottom to load more content
    while True:
        # Execute JavaScript to scroll to the bottom of the page
        page.evaluate("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for new content to load (timeout after 1 second)
        new_content_loaded = page.wait_for_selector(".new-content", timeout=1000) # Check for a specific class

        # If no new content is loaded, break out of the loop
        if not new_content_loaded:
            break

    # Extract the desired data from the rendered DOM
    content = page.inner_html("body")

    # Close the browser instance
    browser.close()
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Challenges and Solutions
&lt;/h2&gt;

&lt;p&gt;Web scraping dynamic content can present several challenges, such as handling CAPTCHAs, IP bans, and other anti-scraping measures implemented by websites. Here are some common solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CAPTCHAs&lt;/strong&gt;: Playwright provides mechanisms to solve CAPTCHAs using third-party services or custom solutions. You can leverage libraries like &lt;code&gt;python-anticaptchacloud&lt;/code&gt; or &lt;code&gt;python-anti-captcha&lt;/code&gt; to solve CAPTCHAs programmatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP bans&lt;/strong&gt;: Use rotating proxies or headless browsers to avoid IP bans and mimic real user behavior. Libraries like &lt;code&gt;requests-html&lt;/code&gt; and &lt;code&gt;selenium&lt;/code&gt; can be used in conjunction with proxy services like Bright Data or Oxylabs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-scraping measures&lt;/strong&gt;: Implement techniques like randomized delays, user agent rotation, and other tactics to make your scraper less detectable. Libraries like &lt;code&gt;fake-useragent&lt;/code&gt; and &lt;code&gt;scrapy-fake-useragent&lt;/code&gt; can help with user agent rotation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Summary and Next Steps
&lt;/h3&gt;

&lt;p&gt;Scraping dynamic websites requires tools that can execute JavaScript and interact with web pages like a real browser. Playwright is a powerful Python library that enables you to automate Chromium, Firefox, and WebKit browsers, making it suitable for scraping dynamic content.&lt;/p&gt;

&lt;p&gt;However, it's essential to understand that web scraping dynamic content can be more challenging than scraping static websites due to anti-scraping measures implemented by websites. You may need to employ additional techniques like rotating proxies, handling CAPTCHAs, and mimicking real user behavior to avoid detection and ensure successful scraping.&lt;/p&gt;

&lt;p&gt;For further learning and additional resources, consider exploring &lt;a href="https://playwright.dev/python/docs/intro" rel="noopener noreferrer"&gt;Playwright's official documentation&lt;/a&gt; or one of our more in-depth tutorials:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/playwright-web-scraping/" rel="noopener noreferrer"&gt;Playwright web scraping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.apify.com/python-playwright/" rel="noopener noreferrer"&gt;Python Playwright: a complete guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>beginners</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Web scraping in 2024: breakthroughs and challenges ahead</title>
      <dc:creator>Natasha Lekh</dc:creator>
      <pubDate>Sun, 28 Jan 2024 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/web-scraping-in-2024-breakthroughs-and-challenges-ahead-1kel</link>
      <guid>https://dev.to/apify/web-scraping-in-2024-breakthroughs-and-challenges-ahead-1kel</guid>
      <description>&lt;p&gt;&lt;em&gt;This article was first published on December 15, 2023, and updated on January 29, 2024, to reflect recent updates in the legal landscape of web scraping.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;How did 2023 treat the web scraping industry? Let's take a short walk through the bad, the good, and the different of yesteryear. Welcome to a summary of the key events and trends that emerged in 2023, setting the stage for the landscape of 2024.&lt;/p&gt;

&lt;p&gt;🎄 &lt;strong&gt;Want to compare to what web scraping was like in 2022?&lt;/strong&gt; &lt;a href="https://blog.apify.com/future-of-web-scraping-in-2023/" rel="noopener noreferrer"&gt;&lt;strong&gt;Check out our overview from the last year&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🧑 Irony of the year&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The year started off funny. In 2022, Meta &lt;a href="https://blog.apify.com/future-of-web-scraping-in-2023/#%F0%9F%A7%91%E2%80%8D%E2%9A%96-legal-developments" rel="noopener noreferrer"&gt;was very keen on sui&lt;/a&gt;&lt;a href="https://blog.apify.com/future-of-web-scraping-in-2023/#%F0%9F%A7%91%E2%80%8D%E2%9A%96-legal-developments" rel="noopener noreferrer"&gt;ng individuals and com&lt;/a&gt;pan&lt;a href="https://blog.apify.com/future-of-web-scraping-in-2023/#%F0%9F%A7%91%E2%80%8D%E2%9A%96-legal-developments" rel="noopener noreferrer"&gt;ies for web scraping;&lt;/a&gt; in 2023, it continued to zero in even on its recent allies. The culprit in question, Bright Data, got &lt;a href="https://www.theregister.com/2023/02/02/meta_web_scraping/" rel="noopener noreferrer"&gt;sued by Facebo&lt;/a&gt;&lt;a href="https://www.theregister.com/2023/02/02/meta_web_scraping/" rel="noopener noreferrer"&gt;ok for scraping&lt;/a&gt; Facebook &lt;a href="https://www.theregister.com/2023/02/02/meta_web_scraping/" rel="noopener noreferrer"&gt;data. The trick&lt;/a&gt; is that Facebook was using Bright Data's services previously for scraping data (just from other websites). Essentially, Meta inadvertently revealed its practice of collecting data from other websites through its lawsuit against a firm it employed for this very purpose. Quite some web scraping uroboros there. This situation once more highlighted the two aspects of an age-old industry question: who really owns publicly accessible data, and is it okay to gather it?&lt;/p&gt;

&lt;p&gt;🆕 In 2024, the &lt;a href="https://techcrunch.com/2024/01/24/court-rules-in-favor-of-a-web-scraper-bright-data-which-meta-had-used-and-then-sued/" rel="noopener noreferrer"&gt;court ruled against Meta and in favor of web scraping&lt;/a&gt;&lt;a href="https://techcrunch.com/2024/01/24/court-rules-in-favor-of-a-web-scraper-bright-data-which-meta-had-used-and-then-sued/" rel="noopener noreferrer"&gt;. The judge dismissed Meta's breach of contract cla&lt;/a&gt;im, arguing that even though Bright Data has accepted the terms of service of Facebook and Instagram, the company was not acting as a "user" of the services when it was scraping but only as a logged-out "visitor," who is not bound by the terms.&lt;/p&gt;

&lt;p&gt;In a cruel twist of fate, later last year, Meta got &lt;a href="https://qz.com/meta-s-new-record-setting-eu-fine-is-nearly-as-big-as-i-1850461159" rel="noopener noreferrer"&gt;another billion-sized fine&lt;/a&gt; (as big as the previous 6 combined, apparently) from the Irish DPC for not protecting the data of EU citizens from surveillance. The Irish Data Protection Commission and EU are not playing when it comes to data privacy. The penalty relates to an inquiry that was opened by the DPC &lt;a href="https://curia.europa.eu/juris/fiche.jsf?id=C%3B311%3B18%3BRP%3B1%3BP%3B1%3BC2018%2F0311%2FJ" rel="noopener noreferrer"&gt;back in 2020&lt;/a&gt;. And it seems like in 2024 Meta will be facing several other lawsuits regarding ad space and its pay-or-consent policy, this time &lt;a href="https://www.theregister.com/2023/12/05/spanish_media_meta_lawsuit/?td=keepreading" rel="noopener noreferrer"&gt;from Spanish media&lt;/a&gt; and &lt;a href="https://noyb.eu/en/noyb-files-gdpr-complaint-against-meta-over-pay-or-okay" rel="noopener noreferrer"&gt;Austrian data protection authority&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The pack of plaintiffs claiming the violation of terms and conditions was extended by a new member, with Air Canada filing suit against travel search site &lt;em&gt;seats.aero&lt;/em&gt; in a &lt;a href="https://storage.courtlistener.com/recap/gov.uscourts.ded.83894/gov.uscourts.ded.83894.1.0_1.pdf" rel="noopener noreferrer"&gt;similar case&lt;/a&gt;, alleging unlawful scraping of its website and thus violating its terms of conditions. Interestingly however, Air Canada also claims breach of criminal law under the Computer Fraud and Abuse Act (CFAA). This move could signal that, although claims on these grounds have been in the past dismissed by courts in Van Buren 2021 and the hiQ April 2022 ruling built on it, the CFAA has still not lost its allure for all of the websites wanting to sue web scraping companies.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;👀 The non-event of the year&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;There are always a few things in life to be grateful for because they did not happen: the extinction of the bees, the eruption of Yellowstone Volcano, and the Google WEI Proposal. The Web Environment Integrity (WEI) proposal, which was pushed by Google, was eventually &lt;a href="https://www.theregister.com/2023/11/02/google_abandons_web_environment_integrity/" rel="noopener noreferrer"&gt;abandoned&lt;/a&gt; this year (not in the least due to the protest of the defenders of the free web see screenshot from &lt;a href="https://github.com/explainers-by-googlers/Web-Environment-Integrity/issues?q=is%3Aissue+is%3Aclosed" rel="noopener noreferrer"&gt;explainers-by-googlers&lt;/a&gt; issues).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqw1zcpyp51whsn2g62h5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqw1zcpyp51whsn2g62h5.png" alt="Issues in explainers-by-googlers after the announcement of the WEI Proposal" width="800" height="513"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Issues in &lt;a href="https://github.com/explainers-by-googlers/Web-Environment-Integrity/issues?q=is%3Aissue+is%3Aclosed" rel="noopener noreferrer"&gt;explainers-by-googlers&lt;/a&gt; after the announcement of the WEI Proposal&lt;/p&gt;

&lt;p&gt;Google was &lt;a href="https://github.com/explainers-by-googlers/Web-Environment-Integrity/blob/main/explainer.md" rel="noopener noreferrer"&gt;trying to follow&lt;/a&gt; the likes of Apple to replace Captchas with a digitally signed token, an API containing a digitally signed token, to be precise. The reason seemed innocuous: to help separate real users from bot users, and real traffic from bot traffic, and limit online fraud and abuse all this without enabling privacy issues like cross-site tracking or browser fingerprinting. Sounds like a dream, right?&lt;/p&gt;

&lt;p&gt;However, while it might aid in reducing ad fraud, Google's proposed method of authentication also carries the risk of curtailing web freedom by allowing websites or third parties to directly influence the choice of browsers and software used by visitors. It could also potentially lead to misuses, such as rejecting visitors using certain tools like ad blockers or download managers.&lt;/p&gt;

&lt;p&gt;Besides, Google intended to implement the Web Environment Integrity API in Chromium, the open-source base for Chrome and several other browsers, excluding Firefox and Safari. This, in comparison, makes Apple's &lt;a href="https://developer.apple.com/videos/play/wwdc2022/10077/" rel="noopener noreferrer"&gt;Private Access Token&lt;/a&gt; seem way less dangerous, not in the least part because Safari has a much smaller browser market share than Chrome.&lt;/p&gt;

&lt;p&gt;The drawbacks were quickly &lt;a href="https://news.ycombinator.com/item?id=36875226" rel="noopener noreferrer"&gt;noticed&lt;/a&gt; by the open web proponents in the tech community. Critics quickly recognized the potential for this to evolve into a kind of digital rights/restriction management for the web. They also highlighted that this change would wildly benefit the ad companies but create high risks of disadvantaging the users. It would also make scraping and web automation activities significantly harder. Well, for everyone except for Google, of course.&lt;/p&gt;

&lt;p&gt;The rejection of WEI by the tech community again highlights the importance of maintaining an open and accessible web.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🁫 First domino of the year&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Scraping social media is the most common web scraping use case. In the old internet days, websites kept their APIs free and accessible, and even if they backed down from that, they often left a free version for the developers. The year started with X (Twitter)'s move to a &lt;a href="https://techcrunch.com/2023/02/01/twitter-to-end-free-access-to-its-api/" rel="noopener noreferrer"&gt;paid API model&lt;/a&gt;, which meant discontinuing free access even for developers. A few months later, &lt;a href="https://techcrunch.com/2023/04/18/reddit-will-begin-charging-for-access-to-its-api/" rel="noopener noreferrer"&gt;Reddit followed suit&lt;/a&gt; with its API transition to a paid model which caused significant uproar and &lt;a href="https://gizmodo.com/reddit-news-blackout-protest-is-finally-over-reddit-won-1850707509" rel="noopener noreferrer"&gt;protests&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;X's API policy changes might have contributed to the more frequent occurrences of Twitter scraping. With many projects forced to shut down due to the three price tiers, it's very likely that some developers had to turn to web scraping and browser automation as an alternative. We tried to keep up with these changes ourselves as providers of a more affordable &lt;a href="https://apify.com/quacker/twitter-scraper" rel="noopener noreferrer"&gt;Twitter API&lt;/a&gt; and &lt;a href="https://apify.com/trudax/reddit-scraper-lite" rel="noopener noreferrer"&gt;Reddit API&lt;/a&gt;. But it's becoming increasingly difficult or inconvenient to scrape these websites without a reliable infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;👺 Troublemaker of the year&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Last year the web scraping case law &lt;a href="https://blog.apify.com/developments-in-hiq-v-linkedin-case/#the-district-courts-judgment-of-october-27-2022" rel="noopener noreferrer"&gt;made strides&lt;/a&gt; with the hiQ vs. LinkedIn case. 2023 had been rather calm on the legal side of things, if not for one particular persona. If, in 2022, Meta was the one suing individuals and companies for harvesting data, this year was a debut for X (Twitter). To be fair, the year 2023 was a debut for a lot of things for Twitter, but let's focus on the thing in question.&lt;/p&gt;

&lt;p&gt;Elon Musk, the tech mogul, made headlines with public promises to take legal action against web scraping companies. This move was &lt;a href="https://techcrunch.com/2023/07/05/twitter-silently-removes-login-requirement-for-viewing-tweets/" rel="noopener noreferrer"&gt;followed&lt;/a&gt; by X adding and then silently removing the login requirement for viewing posts (tweets) and following through with the promise by initiating lawsuits against 4 unknown individuals. And before all that, Musk has made Twitter API paid. But let's follow step-by-step here.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqltpfhefxw85ehyaycw0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqltpfhefxw85ehyaycw0.png" width="720" height="720"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgao2tfnr5te5ph15g338.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgao2tfnr5te5ph15g338.png" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In July 2023, Elon Musk (well, X Corp, if we're being precise) gave us all some heat by &lt;a href="https://www.theverge.com/2023/7/13/23794163/elon-musk-lawsuit-data-scraping-twitter-x-corp" rel="noopener noreferrer"&gt;initiating legal action&lt;/a&gt; against four anonymous entities who were scraping Twitter. Apparently, the four defendants overwhelmed Twitter's registration page with automated requests to such an extent that it caused a significant server strain and disruption of service for users. The culprits are accused of overburdening Twitter's servers, diminishing user experience, and profiting unjustly at the company's expense.&lt;/p&gt;

&lt;p&gt;And as a regular cherry on top, the lawsuit further accuses them of scraping Twitter user data in breach of the platform's user agreement. These days, breach of Terms of Service has become companies &lt;a href="https://blog.apify.com/future-of-web-scraping-in-2023/#%F0%9F%A7%91%E2%80%8D%E2%9A%96-legal-developments" rel="noopener noreferrer"&gt;favorite reference&lt;/a&gt; when instigating lawsuits against web scraping. Seconded only by scraping data for large language models training which is a concern raised by Elon Musk as well. Despite these latter allegations, he did confirm that his recently launched firm, xAI, &lt;a href="https://techcrunch.com/2023/09/01/xs-privacy-policy-confirms-it-will-use-public-data-to-train-ai-models/" rel="noopener noreferrer"&gt;will use X posts&lt;/a&gt; for training purposes. So go figure.&lt;/p&gt;

&lt;p&gt;The lawsuit suggests that the intensive data scraping led to such severe performance issues that X had to enforce a login requirement for access for everyone. Users are now required to have an account to view tweets and must subscribe to Twitter Blue's "verified" service to see over 600 posts per day.&lt;/p&gt;

&lt;p&gt;Now, we don't know for sure whether the AI data scraping was so intense that it could have impacted the website as much. However, this lawsuit and the argumentation behind it raise concerns about the potential for misrepresenting ethical data scraping practices, especially companies that adhere to legal and ethical standards in data collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-ethical-web-scraping-and-how-do-you-do-it/" rel="noopener noreferrer"&gt;&lt;strong&gt;What is ethical web scraping and how do you do it? 5 principles of web scraping ethics&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/enforceability-of-terms-of-use/" rel="noopener noreferrer"&gt;&lt;strong&gt;Are website terms of use enforced?&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://podcasts.apple.com/us/podcast/responsible-web-scraping-challenges-and-approaches/id1660735956?i=1000593712286" rel="noopener noreferrer"&gt;&lt;strong&gt;Ethical data, Explained. Responsible web scraping: challenges and approaches.&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;📈 Trend of the year&lt;/strong&gt;
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;**AI brings a new way to easily process large amounts of data something that required developing complex and special machine learning models before. These days anybody can do, for instance, sentiment analysis with LLMs.&lt;/p&gt;

&lt;p&gt;Marek Trunkat, CTO of Apify**&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Web scraping really became the household term after the waves caused by ChatGPT and OpenAI this year. Why? Because web scraping was heavily involved in the training process. In Google Trends, among the regular adjacent topics such as point-and-click or proxy, we see AI. And this trend is here to stay.&lt;/p&gt;

&lt;p&gt;We were happy to observe that making a one-off regular web scraper using AI is so easy these days. The AI hype makes it seem simple and accessible even without coding knowledge. It's the reliability and continuity of scraping that the AI cannot guarantee, especially with websites employing their own blocking measures based on AI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdyme4f2h31ogki8iqiya.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdyme4f2h31ogki8iqiya.png" alt="AI is the adjacent trend of the year in web scraping" width="800" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI is the adjacent trend of the year in web scraping&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🦾 AI and the hunt for data&lt;/strong&gt;
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;**The AI revolution of 2023 only underscored the already growing need for data from the web. All large language models (LLMs) like GPT-4 and LLaMA-2 were trained on data scraped from the web. As demand for AI and LLM applications will continue to grow, so will grow the demand for web scraping and data extraction.&lt;/p&gt;

&lt;p&gt;Jan Curn, Apify Founder &amp;amp; CEO**&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The &lt;a href="https://www.fastcompany.com/90884581/what-is-a-large-language-model" rel="noopener noreferrer"&gt;large language models&lt;/a&gt; that power ChatGPT and other AI chatbots get their mastery of language from essentially two things: massive amounts of training data scraped from the web and massive amounts of compute power to learn from that data. That second ingredient is very expensive, but the first ingredient, so far, has been completely free.&lt;/p&gt;

&lt;p&gt;However, creators, publishers, and businesses increasingly see the data they put on the web as their property. If some tech company wants to use it to train its LLMs, they want to &lt;a href="https://www.nytimes.com/2023/04/18/technology/reddit-ai-openai-google.html" rel="noopener noreferrer"&gt;be paid&lt;/a&gt;. Just ask the Associated Press, which struck a training data licensing deal with OpenAI. Meanwhile, X (ne Twitter) has &lt;a href="https://cybernews.com/news/twitter-blocks-non-users-reading-tweets-ai-scraping/" rel="noopener noreferrer"&gt;taken steps&lt;/a&gt; to block AI companies from scraping content on the platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Web data and RAG&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The knowledge of LLMs is limited to the public data they were trained on. Building AI applications that can retrieve proprietary data or public data introduced after a models cutoff date and generate content based on it requires augmenting the knowledge of the model with specific information. That process is known as retrieval-augmented generation (RAG), and it has revolutionized search and information retrieval.&lt;/p&gt;

&lt;p&gt;While the likes of LangChain and LlamaIndex swiftly took center stage in this field, web scraping (being the most efficient way to collect web data) has remained a significant part of RAG solutions.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;**To work around the training data cutoff date problem to provide models with up-to-date knowledge, LLM applications often need to extract data from the web. This so-called retrieval-augmented generation (RAG) is what gives the LLMs the superpowers and arguably this is the strongest use case of LLMs.&lt;/p&gt;

&lt;p&gt;Jan Curn, Apify Founder &amp;amp; CEO**&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Adding data to custom GPTs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;OpenAI launched GPTs (custom versions of ChatGPT) in November 2023. This was a really big deal, as suddenly, everyone had the means to build their own AI models. These GPTs can be customized not only with instructions but also with extra knowledge (by uploading files) and a combination of skills (with API specifications). In other words, you can give such GPTs web scraping capabilities with the right specs or scrape websites to upload knowledge to a GPT so it can base generated content on that information.&lt;/p&gt;

&lt;p&gt;The hype around GPTs was quickly replaced by a huge furore around the firing and return of OpenAIs CEO. As a result, the debut of GPT Store, which lets users monetize their GPTs, was postponed and finally launched in early 2024.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;EU AI Act represents break-through legislation for AI and web scraping&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After the global shake-up in the world of personal data protection represented by GDPR, the EU reached a provisional agreement on the EU AI Act, which has similar ambitions for the world of artificial intelligence as GDPR had for personal data. Hailed by EU officials as &lt;em&gt;global first&lt;/em&gt; and &lt;em&gt;historic&lt;/em&gt;, the Act positions the EU as a frontrunner in the field of AI regulation.&lt;/p&gt;

&lt;p&gt;The EU adopted a risk-based approach, defining four different classes of AI systems. The AI systems are divided into four categories: (1) unacceptable risk, (2) high risk, (3) limited risk, and (4) minimal/no risk.&lt;/p&gt;

&lt;p&gt;Firstly, in the unacceptable risk category will belong to those AI systems which contravene EU values and are considered to be a threat to fundamental rights. These systems will be banned entirely. Among others, this category will include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;biometric categorization systems that use sensitive characteristics (e.g., political, religious, philosophical beliefs, sexual orientation, race, etc.); &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;untargeted scraping of facial images from the Internet or CCTV footage to create facial recognition databases; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;emotion recognition, social scoring, AI systems manipulating human behavior, or exploiting vulnerabilities of people (due to their age, disability, social or economic situation, etc.). &lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, the EU regulators incorporated several exceptions to using AI systems in this category, such as the use of biometric identification systems for law enforcement purposes, which will be subject to prior judicial authorization and only for a strictly defined list of crimes.&lt;/p&gt;

&lt;p&gt;Secondly, the Act will include some AI systems in the high-risk category due to their significant potential harm to health, safety, fundamental rights, the environment, democracy, and the rule of law. Among others, this category will include AI systems in the field of medical devices, certain critical infrastructure, systems used to influence the outcome of elections or voter behavior, and more. These systems will be subject to comprehensive mandatory compliance obligations, such as fundamental rights impact assessment, conducting model evaluations and testing, reporting serious incidents, etc.&lt;/p&gt;

&lt;p&gt;Thirdly, the AI systems classified as limited risk, such as chatbots, will be subject to minimal obligations, such as the requirement to inform users that they are interacting with an AI system and the obligation to mark the image, audio, or video content generated by AI.&lt;/p&gt;

&lt;p&gt;Lastly, all AI systems not classified in one of the other three categories will be classified as minimal/no risk. The Act allows for the free use of minimal and no-risk AI systems, with voluntary codes of conduct encouraged.&lt;/p&gt;

&lt;p&gt;Violations of the Act will be subject to fines, depending on the type of AI system, the size of the company, and the severity of the infringement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/how-to-use-langchain/" rel="noopener noreferrer"&gt;How to use LangChain&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-retrieval-augmented-generation/" rel="noopener noreferrer"&gt;What is retrieval-augmented generation?&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/how-to-ai-chatbot-python/" rel="noopener noreferrer"&gt;How to create a custom AI chatbot&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/llamaindex-vs-langchain/" rel="noopener noreferrer"&gt;LlamaIndex vs. LangChain&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/add-custom-actions-to-your-gpts/" rel="noopener noreferrer"&gt;How to add custom actions to GPTs&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/ai-web-scraping-trends-predictions/" rel="noopener noreferrer"&gt;AI and web scraping in 2024: trends and predictions&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/how-to-do-question-answering-from-a-pdf/" rel="noopener noreferrer"&gt;How to do question answering from a PDF&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/webscraping-ai-data-for-llms/" rel="noopener noreferrer"&gt;How to collect data for LLMs&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://blog.apify.com/intercom-customer-support-ai-chatbot-web-scraping/" rel="noopener noreferrer"&gt;How Intercom uses Apify to feed web data to its AI chatbot&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🌟 Apify's contributions&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Of course, we could not pass up an opportunity to contribute to the party. For the 8 years that Apify has been on the market from the &lt;a href="https://blog.apify.com/our-experience-of-the-inaugural-y-combinator-fellowship-yc-f1-309cdcd021df/#.xh3dw7bzg" rel="noopener noreferrer"&gt;early days in Y Combinator&lt;/a&gt; to the transition from &lt;a href="https://blog.apify.com/apifier-is-now-apify/" rel="noopener noreferrer"&gt;Apifier&lt;/a&gt; to now, it's been our goal to develop the cloud computing platform for automation and web scraping tools. So here's what we did this year to come a little bit closer to that goal.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Support for Python users: SDK, code templates, and Scrapy spiders&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We've started the year off pretty strong by taking a significant and probably unexpected step forward. In March 2023 (on Pi Day, to be precise), we launched &lt;a href="https://blog.apify.com/apify-python-sdk/" rel="noopener noreferrer"&gt;Python SDK&lt;/a&gt; to expand our toolset for Python developers. Now, if you know anything about Apify, you know that we have traditionally been on Node.js/JavaScript side of things. But things change, and so does the market and requests from our users. Being a start-up means venturing into different directions and trying different things when the situation calls for it. And since we consistently work on becoming the platform for web scraping and automation, launching a library for Python developers, giving them something to start from, just made sense.&lt;/p&gt;

&lt;p&gt;As a follow-up step, we've rolled out web scraping templates aimed to simplify and improve the developer experience on our platform. We've realized not everyone wants to use ready-made tools in the &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Store&lt;/a&gt; or have complete control over every single aspect when building a scraper like with &lt;a href="https://crawlee.dev/?%20__hstc=160404322.72540665235755e5af5a21367ab1294a.1713784601604.1713784601604.1713788960146.2&amp;amp;__%20hssc=160404322.1.1713788960146&amp;amp;__hsfp=3439275840" rel="noopener noreferrer"&gt;Crawlee&lt;/a&gt;. &lt;a href="https://apify.com/templates" rel="noopener noreferrer"&gt;Web scraping templates&lt;/a&gt; seemed like a great third option, so here they are: in JavaScript, TypeScript, and Python. We've also launched the &lt;a href="https://apify.com/pricing/creator-plan" rel="noopener noreferrer"&gt;$1/month Creator Plan&lt;/a&gt; to support our most avid and enthusiastic users who are interested in building Actors.&lt;/p&gt;

&lt;p&gt;Last but not least, we've made it possible to &lt;a href="https://apify.com/run-scrapy-in-cloud" rel="noopener noreferrer"&gt;deploy Scrapy spiders to our cloud&lt;/a&gt; platform. All you have to do is use a Scrapy wrapper. The platform provides proxies and API and allows our Python users to run, schedule, monitor, and monetize their spiders.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Store and community growth&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This year we've had to deal with unprecedented interest and growth of our Actors published in Store. The number of users engaging with &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;Public Actors in Store&lt;/a&gt; &lt;strong&gt;has doubled&lt;/strong&gt; , soaring from 8,971 to 17,070. In terms of new contributions, we've seen a significant influx, with &lt;strong&gt;657 new Actors&lt;/strong&gt; being published this year, a substantial increase compared to the 290 in 2022. Moreover, our community has been enriched by the addition of &lt;strong&gt;96 new community developers&lt;/strong&gt; , who have joined us with their Public Actors, doubling the number from the 48 who joined in 2022. This growth not only reflects the rising popularity of our platform but also underscores the expanding ecosystem for web scraping and automation we're building together.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;New integrations and AI ventures&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We've launched integrations with &lt;a href="https://llamahub.ai/l/apify-actor" rel="noopener noreferrer"&gt;LlamaIndex&lt;/a&gt; and &lt;a href="https://help.apify.com/en/articles/7888045-how-to-integrate-langchain-with-apify-actors" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt;, marking a notable expansion in its collaboration network. These integrations mean you can load scraped datasets directly into LangChain or LlamaIndex vector indexes and build AI chatbots such as &lt;a href="https://blog.apify.com/intercom-customer-support-ai-chatbot-web-scraping/" rel="noopener noreferrer"&gt;Intercom's Fin&lt;/a&gt; or other apps that query text data crawled from websites.&lt;/p&gt;

&lt;p&gt;We've also introduced 3 AI tools in our Store to help fuel large language models and the likes: &lt;a href="https://apify.com/drobnikj/gpt-scraper" rel="noopener noreferrer"&gt;GPT Scraper&lt;/a&gt; and &lt;a href="https://apify.com/drobnikj/extended-gpt-scraper" rel="noopener noreferrer"&gt;Extended GPT Scraper&lt;/a&gt;, &lt;a href="https://apify.com/apify/website-content-crawler" rel="noopener noreferrer"&gt;Website Content Crawler&lt;/a&gt;, and &lt;a href="https://apify.com/apify/ai-web-agent" rel="noopener noreferrer"&gt;AI Web Agent&lt;/a&gt;. Last but not least, we've launched a not LLM-related but nevertheless web scraping solution with AI at its core, &lt;a href="https://apify.com/equidem/ai-product-matcher" rel="noopener noreferrer"&gt;AI Product Matcher&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Blog and YouTube&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Regarding content, you can notice that our blog switched to a more technical approach, as well as our &lt;a href="https://www.youtube.com/channel/UCTgwcoeGGKmZ3zzCXN2qo_A" rel="noopener noreferrer"&gt;YouTube tutorials&lt;/a&gt;. We've also recorded our first &lt;a href="https://www.youtube.com/channel/UCTgwcoeGGKmZ3zzCXN2qo_A" rel="noopener noreferrer"&gt;podcast about the legality of web scraping&lt;/a&gt;. We've held &lt;a href="https://www.youtube.com/@Apify/streams" rel="noopener noreferrer"&gt;three webinars&lt;/a&gt; on pretty extensive topics and experimented with posting &lt;a href="https://www.youtube.com/@Apify/shorts" rel="noopener noreferrer"&gt;Shorts&lt;/a&gt;. Our internal user engagement is as strong as ever with our newsletter reaching over 68K people every month, with around a 65% open rate. You can now subscribe to an &lt;a href="https://www.linkedin.com/newsletters/pro-web-scraping-7133073105995845632/" rel="noopener noreferrer"&gt;online version of it on LinkedIn&lt;/a&gt; if you don't like your inbox crowded.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apify platform&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our crown jewel, the Apify platform, is evolving day by day, not only design and UX-wise but also functionality-wise. We are currently working on a video of a new tour of Apify that will showcase all the new features and changes made this past year. But for now, here's something to look back on and appreciate the progress:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See you in the new year!&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>ai</category>
    </item>
    <item>
      <title>Groupon reaches new merchants thanks to web data collection</title>
      <dc:creator>Theo Vasilis</dc:creator>
      <pubDate>Mon, 04 Dec 2023 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/groupon-reaches-new-merchants-thanks-to-web-data-collection-bb5</link>
      <guid>https://dev.to/apify/groupon-reaches-new-merchants-thanks-to-web-data-collection-bb5</guid>
      <description>&lt;p&gt;&lt;a href="https://www.groupon.com/" rel="noopener noreferrer"&gt;Groupon&lt;/a&gt; (NASDAQ: &lt;a href="https://www.nasdaq.com/market-activity/stocks/grpn" rel="noopener noreferrer"&gt;GRPN&lt;/a&gt;) is the worlds most popular marketplace to find deals for activities, travel, goods, and services offered by local merchants in hundreds of cities around the globe. Groupon, originally meant as "group" + "coupon, was founded on the idea that the collective bargaining power of a large number of people can get them better deals than they could get individually.&lt;/p&gt;

&lt;p&gt;In March 2023, Duan enkypl from &lt;a href="https://palefirecapital.com/" rel="noopener noreferrer"&gt;Pale Fire Capital&lt;/a&gt; became Groupons new interim CEO and set an ambitious goal to rapidly expand business by reaching new merchants and thus offer more deals to consumers. Recognizing the potential of web data to find new and enrich existing leads, enkypl turned to &lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;Apify&lt;/a&gt; to leverage its expertise in web data collection.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa4eu0vqc3kntsp963sjw.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa4eu0vqc3kntsp963sjw.jpg" alt="Groupon is using web data collection for smart lead generation at scale" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Groupon is using web data collection for smart lead generation at scale&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The challenge&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Groupon was looking for a way to update information about existing merchants, as well as find new ones to ask them to join the network. Such information can be found on search engines, travel sites, online maps, and various other websites.&lt;/p&gt;

&lt;p&gt;The web data-based lead generation and enrichment pipeline had to provide accurate and up-to-date data about tens of thousands of businesses and seamlessly integrate into Groupons existing Salesforce CRM platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The solution&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Apify operates a cloud platform that provides serverless computation, data storage, proxies, open-source SDKs, and hundreds of &lt;a href="https://apify.com/store" rel="noopener noreferrer"&gt;ready-made web scraping Actors&lt;/a&gt; built by community developers. Apifys Enterprise solutions team helped Groupon set up various Actors to extract the required data and run it at scale in the cloud.&lt;/p&gt;

&lt;p&gt;To ensure the data fits into Groupons specific Salesforce implementation, Apify built a new Actor to filter, organize, and match the business data. Thanks to the modularity of the Apify platform, this custom solution was prepared in a very short time, helping Groupon reach new merchants faster than with other solutions.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The outcome&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Groupon's sales team now has a rich database filled with potential leads right at their fingertips. The automation of the entire data journeyfrom extraction to integrationtransformed into significant time savings, heightened efficiency, and, ultimately, a stronger position within the e-commerce space.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;*&lt;/em&gt;"We selected Apify because of their vast experience with web data collection. The project has been delivered on a short schedule, and our sales teams are now empowered with fresh, unique leads that drive targeted campaigns and strategic outreach."&lt;/p&gt;

&lt;p&gt;Filip Popovic, SVP Transformation &amp;amp; Product &amp;amp; HR at Groupon_*_&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Technical details&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The solution was composed of the following parts:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Configuring existing Actors&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The data extraction process commenced with a custom-designed Actor, &lt;strong&gt;&lt;em&gt;New Leads Runner&lt;/em&gt;&lt;/strong&gt; , delivered by Apify, to fine-tune Groupon's search criteria and ensure that the data sourced from other Actors is as relevant and targeted as possible&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Mining business information&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After precise input preparation, Apify could pinpoint and collate business information aligning with Groupon's focus areas. This phase was not just about gathering data legally and ethically but doing so in a way that adhered to Groupon's stringent quality standards.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Ensuring data quality&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data duplication can be a significant issue when handling vast amounts of information. Thanks to Apify's &lt;a href="https://apify.com/lukaskrivka/dedup-datasets" rel="noopener noreferrer"&gt;&lt;strong&gt;&lt;em&gt;Merge, Dedup, and Transform Datasets&lt;/em&gt;&lt;/strong&gt;&lt;/a&gt; Actor, we could ensure each business entry was unique by eliminating duplicates and containing the most relevant information by merging attributes from various sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Integrating data with Salesforce&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once the lead generation pipeline was producing clean data, the next step was to integrate it into Groupons existing CRM. With another custom-built Actor - &lt;strong&gt;&lt;em&gt;Salesforce Uploader&lt;/em&gt;&lt;/strong&gt; - Groupon could transfer their newfound leads into their Salesforce. The uploader also cross-references the new data with existing entries to ensure that only new businesses are added.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Who are Groupon and Apify?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.groupon.com/" rel="noopener noreferrer"&gt;Groupon&lt;/a&gt; (NASDAQ: GRPN) is a global e-commerce marketplace based in Chicago that connects subscribers with local merchants.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;Apify&lt;/a&gt; is a full-stack &lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; and browser automation platform. In addition to its vast range of pre-built data extraction tools, Apify offers &lt;a href="https://apify.com/enterprise" rel="noopener noreferrer"&gt;enterprise solutions&lt;/a&gt; with its team of experts who know how to handle the challenges of collecting data from arbitrary websites at scale.&lt;/p&gt;

</description>
      <category>casestudy</category>
      <category>webscraping</category>
      <category>data</category>
    </item>
    <item>
      <title>Crawlee data storage types: saving files, screenshots, and JSON results</title>
      <dc:creator>Percival Villalva</dc:creator>
      <pubDate>Mon, 27 Nov 2023 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/crawlee-data-storage-types-saving-files-screenshots-and-json-results-j9o</link>
      <guid>https://dev.to/apify/crawlee-data-storage-types-saving-files-screenshots-and-json-results-j9o</guid>
      <description>&lt;p&gt;&lt;strong&gt;We're&lt;/strong&gt; &lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;, a full-stack web scraping and browser automation platform. We are the maintainers of the open-source library&lt;/strong&gt; &lt;a href="https://crawlee.dev/" rel="noopener noreferrer"&gt;&lt;strong&gt;Crawlee&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Managing and storing the data you collect is a crucial part of any &lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; and data extraction project. It's often a complex task, especially when handling large datasets and ensuring output accuracy. Fortunately, Crawlee simplifies this process with its versatile storage types.&lt;/p&gt;

&lt;p&gt;In this article, we will look at Crawlee's storage types and demonstrate how they can make our lives easier when extracting data from the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Setting up Crawlee&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Setting up a Crawlee project is straightforward, provided you &lt;a href="https://blog.apify.com/how-to-install-nodejs/" rel="noopener noreferrer"&gt;have Node&lt;/a&gt; and npm installed. To begin, create a new Crawlee project using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx crawlee create crawlee-data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running the command, you will be given a few template options to choose from. We will go with the CheerioCrawler JavaScript template. Remember, Crawlee's storage types are consistent across all crawlers, so the concepts we discuss here apply to any Crawlee crawler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z38knby90ahxbr4bsu3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4z38knby90ahxbr4bsu3.png" alt="Crawlee template options" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Crawlee template options&lt;/p&gt;

&lt;p&gt;Once installed, you'll find your new project in the &lt;code&gt;crawlee-data&lt;/code&gt; directory, ready with a template code that scrapes the &lt;a href="https://crawlee.dev/" rel="noopener noreferrer"&gt;crawlee.dev&lt;/a&gt; website:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08o7ya1h2bylhtofqbjl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F08o7ya1h2bylhtofqbjl.png" alt="CheerioCrawler template code" width="800" height="265"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To test it, simply run &lt;code&gt;npm start&lt;/code&gt; in your terminal. You'll notice a &lt;code&gt;storage&lt;/code&gt; folder appear with subfolders like &lt;code&gt;datasets&lt;/code&gt;, &lt;code&gt;key_value_stores&lt;/code&gt;, and &lt;code&gt;request_queues&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6t97892ffkx2ji0wxjw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6t97892ffkx2ji0wxjw.png" alt="Crawlee storage" width="368" height="670"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Crawlee's storage can be divided into two categories: &lt;strong&gt;Request Storage (Request Queue and Request List)&lt;/strong&gt; and &lt;strong&gt;Results Storage (Datasets and Key Value Stores)&lt;/strong&gt;. Both are stored locally by default in the &lt;code&gt;./storage&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;Also, remember that Crawlee, by default, clears its storages before starting a crawler run. This action is taken to prevent old data from interfering with new crawling sessions. In case you need to clear the storages earlier than this, Crawlee provides a handy &lt;code&gt;purgeDefaultStorages()&lt;/code&gt; helper function for this purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Crawlee request queue&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-queue" rel="noopener noreferrer"&gt;request queue&lt;/a&gt; is a storage of URLs to be crawled. It's particularly useful for deep crawling, where you start with a few URLs and then recursively follow links to other pages.&lt;/p&gt;

&lt;p&gt;Each Crawlee project run is associated with a default request queue, which is typically used to store URLs for that specific crawler run.&lt;/p&gt;

&lt;p&gt;To illustrate that, lets go to the &lt;code&gt;routes.js&lt;/code&gt; file in the template we just generated. There you will find the code below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ enqueueLinks, log }) =&amp;gt; { log.info(`enqueueing new URLs`); // Add links found on page to the queue await enqueueLinks({ globs: ['https://crawlee.dev/**'], label: 'detail', });});router.addHandler('detail', async ({ request, $, log, pushData }) =&amp;gt; { const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's take a closer look at the &lt;code&gt;addDefaultHandler&lt;/code&gt; function, particularly focusing on the &lt;code&gt;enqueueLinks&lt;/code&gt; function it contains. The &lt;code&gt;enqueueLinks&lt;/code&gt; function in Crawlee is designed to automatically detect all links on a page and add them to the request queue. However, its utility extends further as it allows us to specify certain options for more precise control over which links are added.&lt;/p&gt;

&lt;p&gt;For instance, in our example, we use the &lt;a href="https://crawlee.dev/api/core/interface/EnqueueLinksOptions#globs" rel="noopener noreferrer"&gt;&lt;strong&gt;globs&lt;/strong&gt;&lt;/a&gt; option to ensure that only links starting with &lt;code&gt;https://crawlee.dev/&lt;/code&gt; are queued. Furthermore, we assign a detail &lt;a href="https://crawlee.dev/api/core/interface/EnqueueLinksOptions#globs" rel="noopener noreferrer"&gt;&lt;strong&gt;label&lt;/strong&gt;&lt;/a&gt; to these links. This labeling is particularly useful as it lets us refer to these links in subsequent handler functions, where we can define specific data extraction operations for pages associated with this label.&lt;/p&gt;

&lt;p&gt;💡 See all the available options for enqueueLinks in the &lt;a href="https://crawlee.dev/api/core/interface/EnqueueLinksOptions#label" rel="noopener noreferrer"&gt;Crawlee documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In line with our discussion on data storage types, we can now find all the links that our crawler has navigated through in the &lt;code&gt;request_queues&lt;/code&gt; storage, located within the crawlers &lt;code&gt;./storage/request_queues&lt;/code&gt; directory. Here, we can access detailed information about each request that has been processed in the request queue.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwg9im75xbf1hllzmabt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwg9im75xbf1hllzmabt.png" alt="Request Queue" width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Crawlee request list&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-list" rel="noopener noreferrer"&gt;request list&lt;/a&gt; differs from the request queue as it's not a form of storage in the conventional sense. Instead, it's a predefined collection of URLs for the crawler to visit.&lt;/p&gt;

&lt;p&gt;This approach is particularly suited for situations where you have a set of known URLs to crawl and don't plan to add new ones as the crawl progresses. Essentially, the request list is set in stone once created, with no option to modify it by adding or removing URLs.&lt;/p&gt;

&lt;p&gt;To demonstrate this concept, we'll modify our template to utilize a predefined set of URLs in the request list rather than the request queue. We'll begin with adjustments to the &lt;code&gt;main.js&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;main.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { CheerioCrawler, RequestList } from 'crawlee';import { router } from './routes.js';const sources = [{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },];const requestList = await RequestList.open('my-list', sources);const crawler = new CheerioCrawler({ requestList, requestHandler: router,});await crawler.run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this new approach, we created a predefined list of URLs, named &lt;code&gt;sources&lt;/code&gt;, and passed this list into a newly established requestList. This requestList was then passed into our crawler object.&lt;/p&gt;

&lt;p&gt;As for the &lt;code&gt;routes.js&lt;/code&gt; file, we simplified it to include just a single request handler. This handler is now responsible for executing the data extraction logic on the URLs specified in the request list.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;routes.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { createCheerioRouter } from 'crawlee';export const router = createCheerioRouter();router.addDefaultHandler(async ({ request, $, log, pushData }) =&amp;gt; { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Following these modifications, when you run your code, you'll observe that only the URLs explicitly defined in our request list are being crawled.&lt;/p&gt;

&lt;p&gt;This brings us to an important distinction between the &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-list" rel="noopener noreferrer"&gt;two types of request storages&lt;/a&gt;. The request queue is dynamic, allowing for the addition and removal of URLs as needed. On the other hand, the request list is static once initialized and is not meant for dynamic changes.&lt;/p&gt;

&lt;p&gt;With the request storage out of the way, lets now explore the result storage in Crawlee, starting with datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Crawlee datasets&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/api/types/interface/Dataset" rel="noopener noreferrer"&gt;Datasets&lt;/a&gt; in Crawlee serve as repositories for structured data, where every entry possesses consistent attributes.&lt;/p&gt;

&lt;p&gt;Datasets are designed for append-only operations. This means we can only add new records to a dataset, and altering or deleting existing ones is not an option. Each project run in Crawlee is linked to a default dataset, which is commonly utilized for storing precise results from our web crawling activities.&lt;/p&gt;

&lt;p&gt;You might have noticed that each time we ran the crawler, the folder &lt;code&gt;./storage/datasets&lt;/code&gt; was populated with a series of JSON files containing extracted data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5awwq5f9vrp9l4rlx0un.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5awwq5f9vrp9l4rlx0un.png" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Storing scraped data into a dataset is &lt;a href="https://crawlee.dev/docs/guides/request-storage#request-list" rel="noopener noreferrer"&gt;remarkably simple&lt;/a&gt; using Crawlee's &lt;code&gt;Dataset.pushData()&lt;/code&gt; function. Each invocation of &lt;code&gt;Dataset.pushData()&lt;/code&gt; generates a new table row, with the property names of your data serving as the column headings. By default, these rows are stored as JSON files on your disk. However, Crawlee allows you to integrate other storage systems as well.&lt;/p&gt;

&lt;p&gt;And if you take a closer look at our &lt;code&gt;addDefaultHandler&lt;/code&gt; function in &lt;code&gt;routes.js&lt;/code&gt; you will see just how the &lt;code&gt;pushData()&lt;/code&gt; function was used to append the scraped results to the Dataset.&lt;/p&gt;

&lt;p&gt;For a practical example, lets take another look at the &lt;code&gt;addDefaultHandler&lt;/code&gt; function within &lt;code&gt;routes.js&lt;/code&gt;. Here, you can see how we used &lt;code&gt;pushData()&lt;/code&gt; function to append the scraped results to the Dataset.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;routes.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;router.addDefaultHandler(async ({ request, $, log, pushData }) =&amp;gt; { log.info(`Extracting data...`); const title = $('title').text(); log.info(`${title}`, { url: request.loadedUrl }); await pushData({ url: request.loadedUrl, title, });});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Key-value store&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;key-value sto&lt;/a&gt;&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;re in Crawlee i&lt;/a&gt;s &lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;designed for st&lt;/a&gt;oring and retrieving data records or files. Each record is tagged with a unique key and linked to a specific MIME content type. This feature makes it perfect for storing various types of data, such as screenshots, PDFs, or even for maintaining the state of crawlers.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Saving screenshots&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To showcase th&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;e flexibility o&lt;/a&gt;f the &lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;key-value stor&lt;/a&gt;e in Crawlee, let's take a screenshot of each page we crawl and save it using Crawlee's key-value store.&lt;/p&gt;

&lt;p&gt;However, to do that, we need to sw&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;itch our crawle&lt;/a&gt;r from CheerioCrawler to PuppeteerCrawler. The good news is that adapting our code to different crawlers is quite straightforward. For this demonstration, we'll temporarily set aside the &lt;code&gt;routes.js&lt;/code&gt; file and concentrate our crawler logic in the &lt;code&gt;main.js&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;To get started with PuppeteerCrawl&lt;a href="https://crawlee.dev/api/core/class/KeyValueStore" rel="noopener noreferrer"&gt;er, the first s&lt;/a&gt;tep is to install the Puppeteer library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm install puppeteer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, adapt the code in your main.js file as shown below:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;main.js&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Capture the screenshot await saveSnapshot({ key, saveHtml: false }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running the code above, we should see three screenshots, one for each website crawled, pop up on our crawlers &lt;code&gt;key_value_store&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhq2r985ohebqewar73p9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhq2r985ohebqewar73p9.png" width="800" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Saving pages as PDF files&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Suppose we want to convert the page content into a PDF file and save it in the key-value store. This is entirely feasible with Crawlee. Thanks to Crawlee's PuppeteerCrawler being built upon Puppeteer, we can fully utilize all the native features of Puppeteer. To achieve this, we simply need to tweak our code a bit. Here's how to do it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { PuppeteerCrawler } from 'crawlee';// Create a PuppeteerCrawlerconst crawler = new PuppeteerCrawler({ async requestHandler({ page, request, saveSnapshot }) { // Convert the URL into a valid key const key = request.url.replace(/[:/]/g, '_'); // Save as PDF await page.pdf({ path: `./storage/key_value_stores/default/${key}.pdf`, format: 'A4', }); },});await crawler.addRequests([{ url: 'https://crawlee.dev' }, { url: 'https://crawlee.dev/docs/3.0/quick-start' }, { url: 'https://crawlee.dev/api/core' },]);await crawler.run();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Similar to the earlier example involving screenshots, executing this code will create three PDF files, each capturing the content of the accessed websites. These files will then be saved into Crawlees key-value store.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Doing more with your Crawlee scraper&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Thats it for an introduction to Crawlees data storage types. As a next step, I encourage you to take your scraper to the next level by &lt;a href="https://crawlee.dev/docs/introduction/deployment" rel="noopener noreferrer"&gt;deploying it on the Apify platform as an Actor.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With your scraper running on the Apify platform, you gain access to all of Apify's extensive list of features tailored for web scraping jobs, like cloud storage and various data export options. Not sure what it means or how to do it? Dont worry, all the information you need is in this &lt;a href="https://crawlee.dev/docs/deployment/apify-platform" rel="noopener noreferrer"&gt;link to the Crawlee documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crawlee.dev/docs/introduction/deployment" rel="noopener noreferrer"&gt;Deploy your Crawlee scrapers on the Apify platform&lt;/a&gt;&lt;/p&gt;

</description>
      <category>crawlee</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Web scraping for machine learning</title>
      <dc:creator>Theo Vasilis</dc:creator>
      <pubDate>Sun, 26 Nov 2023 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/web-scraping-for-machine-learning-3834</link>
      <guid>https://dev.to/apify/web-scraping-for-machine-learning-3834</guid>
      <description>&lt;p&gt;&lt;strong&gt;Hi, we're&lt;/strong&gt; &lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;, a cloud platform that helps you build reliable web scrapers fast and automate anything you can do manually in a web browser. This article on web scraping for machine learning was inspired by our work on&lt;/strong&gt; &lt;a href="https://apify.it/data-for-ai" rel="noopener noreferrer"&gt;&lt;strong&gt;collecting data for AI and ML applications&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is web scraping?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;At its simplest, &lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt; is the automated extraction of data from websites. This process is akin to &lt;a href="https://blog.apify.com/what-are-web-crawlers-and-how-do-they-work/" rel="noopener noreferrer"&gt;web crawling&lt;/a&gt;, which is about finding or discovering web links. The difference is that web scraping is focused on extracting that data.&lt;/p&gt;

&lt;p&gt;Initially, web scraping was a manual, cumbersome process, but with technological advances being what they are, it has become an automated, sophisticated practice. Web scrapers can navigate websites, understand their structure, and extract specific information based on predefined criteria.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;&lt;strong&gt;Web scraping 101: learn the basics&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why is web scraping used in machine learning?&lt;/strong&gt;
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;In most cases, you cant build high-quality predictive models with just internal data.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asif Syed, Vice President of Data Strategy, Hartford Steam Boiler&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The ability to harvest and process data from a myriad of web sources is what makes web scraping indispensable for machine learning. Web scraping isn't just about accessing the data but transforming it from the unstructured format of web pages into &lt;a href="https://blog.apify.com/when-data-gets-too-big-why-you-need-structured-data/" rel="noopener noreferrer"&gt;structured&lt;/a&gt; datasets that can be efficiently used in machine learning algorithms.&lt;/p&gt;

&lt;p&gt;You can't teach a machine to make predictions or carry out tasks based on data unless you have an awful lot of data to train it. From social media analytics to competitive market research, web scraping enables the gathering of diverse datasets to teach machines, such as today's so-called 'AI models', and provide them with a rich and nuanced understanding of the world.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparing data collection methods for machine learning&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;There are multiple ways to collect data for machine learning these range from traditional surveys and manually curated databases to cutting-edge techniques that utilize IoT devices. So, why choose web scraping over other methods of data acquisition?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Surveys:&lt;/strong&gt; They can provide highly specific data but often suffer from biases and limited scope.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Databases:&lt;/strong&gt; These offer structured information, yet they may lack the real-time aspect essential for certain machine learning applications.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IoT devices:&lt;/strong&gt; These bring in a wave of real-time, sensor-based data, but they're constrained by the type and quantity of data they can collect. It's worth noting that implementing &lt;a href="https://cedalo.com/blog/mqtt-authentication-and-authorization-on-mosquitto/" rel="noopener noreferrer"&gt;MQTT authentication&lt;/a&gt; enhances the security and efficiency of data transmission and allows these devices to communicate more reliably.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Web scraping:&lt;/strong&gt; In contrast, web scraping provides access to an almost infinite amount of data available online, from text and images to metadata and more. Unlike surveys or databases, web scraping taps into real-time data, which is crucial for models requiring up-to-date information. Moreover, the diversity of data that can be scraped from the web is unparalleled, which allows for a more comprehensive training of machine learning models.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/building-functional-ai-models-for-web-scraping/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn about building functional AI models for web scraping&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Quality and quantity of data in ML&lt;/strong&gt;
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;**You can have all of the fancy tools, but if your data quality is not good, you're nowhere.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Veda Bawo, Director of Data Governance, Raymond James**&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The adage "quality over quantity" holds a significant place in many fields, but in the world of machine learning, it's not a matter of choosing one over the other. The success of ML models is deeply rooted in the quality and quantity of data they're trained on.&lt;/p&gt;

&lt;p&gt;Quality of data refers to its accuracy, completeness, and relevance. High-quality data is free from errors, inconsistencies, and redundancies, making it indispensable for dependable analysis and sound decision-making. On the other hand, the quantity of data pertains to its volume. A larger dataset provides more information, leading to more reliable models and improved outcomes. However, an abundance of low-quality data can be detrimental, potentially leading to inaccurate predictions and suboptimal decision-making.&lt;/p&gt;

&lt;p&gt;When it comes to quantity, web scraping allows for the collection of vast amounts of data from various online sources. However, the web is full of low-quality data, so simply extracting raw data isn't enough. It needs to be cleaned and processed before it can be used for machine learning. More about that later.&lt;/p&gt;

&lt;p&gt;Another crucial aspect of data for machine learning is variety. Web scraping provides access to diverse data to enhance a model's ability to understand and interpret varied inputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cloud-based real-time data acquisition&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In the context of machine learning, the ability to collect and process data in real time is increasingly becoming a necessity rather than a luxury. This is where cloud-based data acquisition plays a vital role, as - in opposition to Edge-based data acquisition - it offers scalability and flexibility, which are critical for handling the voluminous and dynamic nature of web data.&lt;/p&gt;

&lt;p&gt;Cloud computing, with its vast storage and computational capabilities, allows for the handling of massive datasets that web scraping generates. It provides the infrastructure needed to collect, store, and process data from varied sources in real-time. This real-time aspect is especially important in applications like market analysis, social media monitoring, and predictive modeling, where the timeliness of data can be the difference between relevance and obsolescence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/edge-ai-vs-cloud-ai/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn about the differences between Edge AI and Cloud AI&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Web scraping challenges and techniques for machine learning&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The efficacy of web scraping in machine learning hinges on several key techniques. These not only ensure the extraction of relevant data but also its transformation into a format that machine learning algorithms can effectively utilize.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handling dynamic websites&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A major challenge in web scraping is dealing with &lt;a href="https://blog.apify.com/what-is-a-dynamic-page/" rel="noopener noreferrer"&gt;dynamic websites&lt;/a&gt; that continually update their content. These sites often use technologies like JavaScript, AJAX, and infinite scrolling, making data extraction more complex. To effectively scrape such sites, one must employ advanced techniques and tools, seeking the expertise of pioneers in the field, such as companies specializing in &lt;a href="https://tsh.io/services/" rel="noopener noreferrer"&gt;software development services.&lt;/a&gt; Advanced techniques and tools are required. These include methods for executing JavaScript, handling AJAX requests, and navigating through dynamically loaded content. Mastering these techniques enables the scraping of real-time data from these complex websites, a critical requirement for many machine-learning applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Blocking and blacklisting&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Many websites have measures in place to detect and block scraping bots to prevent unauthorized data extraction. These measures include blacklisting IP addresses, deploying CAPTCHAs, and analyzing browser fingerprints. To &lt;a href="https://blog.apify.com/crawl-without-getting-blocked/" rel="noopener noreferrer"&gt;counteract blocking&lt;/a&gt;, web scrapers employ techniques like rotating proxies, mimicking real browser behaviors, and making use of CAPTCHA-solving services.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Heavy server load&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Web scrapers can inadvertently overload servers with too many requests, leading to performance issues or even server crashes. To prevent this, its essential to implement intelligent crawl delays, randomize scraping times, and distribute the load across multiple proxies. This approach ensures a polite and responsible scraping process that minimizes the impact on website servers.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What do you do with the scraped data?&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data preprocessing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We said earlier that scraping raw data isn't enough. The next critical step involves &lt;a href="https://blog.apify.com/what-is-data-ingestion-for-large-language-models/#preprocessing" rel="noopener noreferrer"&gt;cleaning and transforming the raw data&lt;/a&gt; into a structured format suitable for machine learning models. This stage includes removing duplicates and inconsistencies, handling missing values, and normalizing data to ensure that it's free from noise and ready for analysis. Preprocessing ensures that the data fed into machine learning models is of high quality, which is essential for accurate results.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Feature selection&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once the data is preprocessed, the next step is to identify and extract the most relevant features from the dataset. This involves analyzing the data to determine which attributes are most significant for the problem at hand. By focusing on the most relevant features, the efficiency and performance of machine learning models are significantly enhanced. This step - known also as &lt;a href="https://blog.apify.com/what-is-data-ingestion-for-large-language-models/#feature-engineering" rel="noopener noreferrer"&gt;feature engineering&lt;/a&gt; - can also help in reducing the complexity of the model to make it faster and more efficient.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Integrating web data with ML applications&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Once you have your data, you need a way to integrate it with other tools for machine learning. Here are some of the most renowned libraries and databases for ML:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;LangChain&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This open-source framework is revolutionizing the way developers integrate large language models (LLMs) with external components in ML applications. It simplifies the interaction with LLMs, facilitating data communication and the generation of vector embeddings. LangChain's ability to connect with diverse model providers and data stores makes it the ML developer's library of choice for building on top of large language models.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-langchain/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn more about LangChain&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Hugging Face&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Renowned for its datasets library, Hugging Face is one of the most popular frameworks in the machine learning community. It provides a platform for easily accessing, sharing, and processing datasets for a variety of tasks, including audio, &lt;a href="https://blog.apify.com/data-collection-for-computer-vision/" rel="noopener noreferrer"&gt;computer vision&lt;/a&gt;, and &lt;a href="https://blog.apify.com/text-classification-in-nlp/" rel="noopener noreferrer"&gt;NLP&lt;/a&gt;, making it a crucial tool for ML data readiness.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-hugging-face/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn more about Hugging Face&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Haystack&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This tool's ecosystem is vast, integrating with technologies like &lt;a href="https://blog.apify.com/what-is-a-vector-database/" rel="noopener noreferrer"&gt;vector databases&lt;/a&gt; and various model providers. It serves as a flexible and dynamic solution for developers looking to incorporate complex functionalities in their ML projects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-haystack-nlp-framework/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn more about Haystack&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;LlamaIndex&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;LlamaIndex represents a significant advancement in the field of machine learning, particularly in its ability to augment large language models with custom data. This tool addresses a key challenge in ML: the integration of LLMs with private or proprietary data. It offers an approachable platform for even those with limited ML expertise, allowing for the effective use of private data in generating personalized insights.&lt;/p&gt;

&lt;p&gt;With functionalities like &lt;a href="https://blog.apify.com/what-is-retrieval-augmented-generation/" rel="noopener noreferrer"&gt;retrieval-augmented generation (RAG)&lt;/a&gt;, LlamaIndex enhances the capabilities of LLMs, making them more precise and informed in their responses. Its indexing and querying stages, coupled with various types of indexes, such as List, Vector Store, Tree, and Keyword indexes, provide a stable infrastructure for precise data retrieval and use in ML applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.apify.com/platform/integrations/llama" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn how to integrate Apify with LlamaIndex to feed vector databases and LLMs with data crawled from the web&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Pinecone and other vector databases&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ML models need numerical data, known as &lt;a href="https://blog.apify.com/what-are-embeddings-in-ai/" rel="noopener noreferrer"&gt;embeddings&lt;/a&gt; in machine learning, so any data you've collected has to be stored in and retrieved from a vector database.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Pinecone&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This vector database stands out for its high performance and scalability, which are crucial for ML applications. It's developer-friendly and allows for the creation and management of indexes with simple API calls. Pinecone excels in efficiently retrieving insights and offers capabilities like metadata filtering and namespace partitioning, making it a reliable tool for ML projects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/what-is-pinecone-why-use-it-with-llms/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn more about Pinecone&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Chroma&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As an AI-native open-source embedding database, Chroma provides a comprehensive suite of tools for working with embeddings. It features rich search functionalities and integrates with other ML tools, including LangChain and LlamaIndex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For more vector databases, check out&lt;/strong&gt; &lt;a href="https://blog.apify.com/pinecone-alternatives/" rel="noopener noreferrer"&gt;&lt;strong&gt;6 open-source Pinecone alternatives&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Your first web scraping challenge&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If you haven't done web scraping before, we've made it easy (and free) for you to get started. Apify has created a tool ideal for data acquisition for machine learning: &lt;a href="https://apify.com/apify/website-content-crawler" rel="noopener noreferrer"&gt;Website Content Crawler&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/webscraping-ai-data-for-llms/" rel="noopener noreferrer"&gt;&lt;strong&gt;Learn how to use Website Content Crawler&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This tool was specifically designed to extract data for feeding, fine-tuning, or training machine learning models such as LLMs. You can retrieve the results using the API to formats such as JSON or CSV, which can be fed directly to your LLM or vector database. You can also integrate the data with LangChain using the &lt;a href="https://python.langchain.com/docs/integrations/tools/apify" rel="noopener noreferrer"&gt;Apify LangChain integration&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;🌐 &lt;a href="https://apify.com/apify/website-content-crawler" rel="noopener noreferrer"&gt;Website Content Crawler&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>10 Google search tricks (that are also Google scraping tricks)</title>
      <dc:creator>Natasha Lekh</dc:creator>
      <pubDate>Thu, 23 Nov 2023 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/10-google-search-tricks-that-are-also-google-scraping-tricks-3hdf</link>
      <guid>https://dev.to/apify/10-google-search-tricks-that-are-also-google-scraping-tricks-3hdf</guid>
      <description>&lt;p&gt;Can you apply Google search tricks to scraping and data extraction as well? Lets put it to the test!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We're&lt;/strong&gt; &lt;a href="https://apify.it/platform-pricing" rel="noopener noreferrer"&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;, the only full-stack web scraping platform. You can build, deploy, share, and monitor scrapers or APIs for any website on Apify.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The savvy Googlers among us always have a few tricks up their sleeve. The question is: can you apply these Google search tricks to scraping and data extraction as well? Lets put it to the test! But before we start, we need to address the elephant in the blog.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🤔 What do Google search shortcuts have to do with web scraping?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Millions of people rely on Google search every day. Be it for school, research, or simple entertainment, if you know a few Google search shortcuts, your search process is more efficient. The thing is, a lot of those Google search tricks can also apply to &lt;a href="https://blog.apify.com/what-is-web-scraping/" rel="noopener noreferrer"&gt;web scraping&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The reason for this is simple: what a Google scraper does is very similar to what a Google visitor does: it goes to the google.com website, types in a query (even if it contains a shortcut), and receives results. The only difference is that the scraper also copies the results at lightning speed and packages them into a file.&lt;/p&gt;

&lt;p&gt;This means that, if you are familiar with Google tricks and shortcuts, you can use that knowledge to upgrade your Google scraping process. When we built our &lt;a href="https://apify.com/apify/google-search-scraper" rel="noopener noreferrer"&gt;Google Search Scraper&lt;/a&gt; 🔗 back in the day, we didn't count on this. Now that Google Scraper (also known as &lt;a href="https://blog.apify.com/top-google-search-api/" rel="noopener noreferrer"&gt;Google SERP API&lt;/a&gt;) has over 40,000 users, we feel obliged to everyone know about this interesting peculiarity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/" rel="noopener noreferrer"&gt;https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Learn how to use our&lt;/strong&gt; &lt;a href="https://blog.apify.com/unofficial-google-search-api-from-apify-22a20537a951/" rel="noopener noreferrer"&gt;&lt;strong&gt;Google Search Scraper&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;without tricks&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;💡 10 Google search tricks (that are also Google scraping tricks)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;So let's put 10 well-known tricks to the test and level up your Google scraping. In other words, let's learn how to scrape Google like a pro 😎&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Use site: to scrape specific sites&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklu7gwcxi35nkhixb6go.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fklu7gwcxi35nkhixb6go.png" alt="#1. Add site: after your keyword to narrow down the search to a specific website. For example, visualize site:blog.apify.com" width="800" height="555"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0xud7uce1mza313y3fg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0xud7uce1mza313y3fg.png" alt="#1. Add site: after your keyword to narrow down the search to a specific website. For example, visualize site:blog.apify.com" width="800" height="460"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#1&lt;/strong&gt;. Add &lt;code&gt;site:&lt;/code&gt; after your keyword to narrow down the search to a specific website. For example, &lt;code&gt;visualize site:&lt;/code&gt;&lt;a href="http://blog.apify.com" rel="noopener noreferrer"&gt;&lt;code&gt;blog.apify.com&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is probably the most well-known Google search shortcut to narrow down your search on a specific website without visiting it. The thing is, you can use this same trick not only to search but also to scrape content from that particular website. The syntax is very simple &lt;code&gt;keyword + site:website.com&lt;/code&gt;. The screenshot above shows how you can apply it to our &lt;a href="https://apify.com/apify/google-search-scraper" rel="noopener noreferrer"&gt;Google Scraper&lt;/a&gt; 🔗.&lt;/p&gt;

&lt;p&gt;Our query will scrape all content from Google related to the word &lt;code&gt;visualize&lt;/code&gt; but only from our Blog, &lt;a href="https://blog.apify.com/" rel="noopener noreferrer"&gt;blog.apify.com&lt;/a&gt;. All other scraping results will be filtered out. If you need to scrape specific content from a particular site, this is the shortcut to go for.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Quotation marks for exact scraping queries&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps73zhxh50ig2w6almru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fps73zhxh50ig2w6almru.png" alt="#2: surround your keyword or phrase with quotation marks to scrape accurate results" width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favwlz5tz9kvihjsycp6l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favwlz5tz9kvihjsycp6l.png" alt="#2. Surround your phrase or word with " width="800" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#2.&lt;/strong&gt; Surround your phrase or word with &lt;code&gt;" "&lt;/code&gt; quotation marks for exact scraping queries&lt;/p&gt;

&lt;p&gt;For a regular search, Google (and Google Scraper by extension) will get content containing the words of your query in any order. But you can use quotes to make your Google scraping query laser-accurate. No similar phrases, no swapping words around, no adjacent topics, just word-by-word accuracy.&lt;/p&gt;

&lt;p&gt;Let's see whether we can get away with this by choosing a specific, very long-tail keyword to scrape for: &lt;code&gt;"Headless browsers, infrastructure scaling, sophisticated blocking. Meet the full-stack platform that makes it all easy."&lt;/code&gt; This whole phrase can only be found on the Apify homepage. Will the Google SERP Scraper find it?&lt;/p&gt;

&lt;p&gt;It did! So, surrounding your scraping keyword with quotes will instruct the scraping tool to scrape Google with that specific phrase in mind. This tip can also piggyback on the previous one: you can include quotes to search for specific wording on any website. We'll come back to mixing up various tricks in tip #10.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Hyphen to exclude certain results&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsqpat0psttilcwy7gw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsqpat0psttilcwy7gw2.png" alt="#3. Add - in front to exclude certain results beforehand" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqf0plrc2ngfk3sqcqrq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frqf0plrc2ngfk3sqcqrq.png" alt="#3. Add - in front to exclude certain results beforehand" width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#3.&lt;/strong&gt; Add &lt;code&gt;-&lt;/code&gt; in front to exclude certain results beforehand&lt;/p&gt;

&lt;p&gt;This shortcut is useful for cases when you want to scrape data about one topic but filter out content about another. In other words, when you don't want a specific term to show up among your Google scraped results. For example, you want to scrape information about web scraping (going slightly meta there) but exclude any Python-related pages.&lt;/p&gt;

&lt;p&gt;You can set this up by using &lt;code&gt;-&lt;/code&gt; in front of unwanted keywords. In our example, the hyphen instructs the Google scraping tool to ignore any content that contains the word Python. And you won't find any Python-related pages among the results. The best part about this trick is that you can filter out information that you want to keep even before you start scraping.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Link: to scrape websites with backlinks&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb62j503ysqxr4awfvqi5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb62j503ysqxr4awfvqi5.png" alt="#4. Use link: to scrape websites containing backlinks of your choice" width="800" height="412"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcx8zr6adwnltuco4r0v7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcx8zr6adwnltuco4r0v7.png" alt="#4. Use link: to scrape websites containing backlinks of your choice" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#4.&lt;/strong&gt; Use &lt;code&gt;link:&lt;/code&gt; to scrape websites containing backlinks of your choice&lt;/p&gt;

&lt;p&gt;This Google scraping tip is no.1 for all SEO enthusiasts out there. Tracking backlinks is one of the most basic SEO practices because, as a rule, the more backlinks your page has, the better your Google ranking. Even better if those backlinks are "high quality," as in coming from domains with high domain authority. Essentially, the number of backlinks is a number one indicator that your website's content is valuable (since it's trusted by websites that decide to share it).&lt;/p&gt;

&lt;p&gt;So the gist of this Google scraping trick is: instead of just scraping a page, we're going to scrape all pages that link to that specific page. Let's extract pages with a backlink to &lt;a href="http://apify.com" rel="noopener noreferrer"&gt;apify.com&lt;/a&gt;, a.k.a all pages that mention &lt;a href="http://apify.com" rel="noopener noreferrer"&gt;apify.com&lt;/a&gt; on their page. Phew, that was a mouthful, but with a simple &lt;code&gt;link:&lt;/code&gt; &lt;a href="http://apify.com" rel="noopener noreferrer"&gt;&lt;code&gt;apify.com&lt;/code&gt;&lt;/a&gt; we were able to catch them all.&lt;/p&gt;

&lt;p&gt;Keep in mind that the more targeted your query is (focusing on a specific URL, for example &lt;code&gt;link:&lt;/code&gt; &lt;a href="http://apify.com/product-matching-ai/faq" rel="noopener noreferrer"&gt;&lt;code&gt;apify.com/product-matching-ai/faq&lt;/code&gt;&lt;/a&gt;), the fewer results you'll get. This happens because most pages link to the main domain page rather than specific ones.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Related: to scrape similar websites or competition&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlnrg1hqooypk7abair8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnlnrg1hqooypk7abair8.png" alt="#5: use related: to scrape similar websites or competition" width="800" height="443"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7af6fikc4yga6mew243.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7af6fikc4yga6mew243.png" alt="#5: use related: to scrape similar websites or competition" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#5:&lt;/strong&gt; use &lt;code&gt;related:&lt;/code&gt; to scrape similar websites or competition&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;related:&lt;/code&gt; trick is a scraping technique that could be a game-changer for market researchers. When you apply related to, let's say &lt;a href="http://amazon.com" rel="noopener noreferrer"&gt;amazon.com&lt;/a&gt;, you won't scrape links to Amazon. Instead, what you'll get are links to online stores &lt;em&gt;similar&lt;/em&gt; to Amazon. Think of any e-commerce platform, such as Walmart, Kohl's, and other retailers that sell goods online. The scraping results will depend on the domain you've chosen.&lt;/p&gt;

&lt;p&gt;By scraping with &lt;code&gt;related:&lt;/code&gt; you can see which companies, organizations, or other entities are perceived as competition to the page you've indicated. So you can think of this Google scraping trick as a fast way to identify competitors in a given industry at least the ones that count in a digital space.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;OR to scrape Google using multiple keywords at once&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdnarwsgxohwtgipxr7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kdnarwsgxohwtgipxr7.png" alt="#6. Use OR to scrape using multiple keywords at once" width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwmvs848wabcmcsd7plq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwmvs848wabcmcsd7plq.png" alt="#6. Use OR to scrape using multiple keywords at once" width="800" height="377"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#6.&lt;/strong&gt; Use &lt;code&gt;OR&lt;/code&gt; to scrape using multiple keywords at once&lt;/p&gt;

&lt;p&gt;This Google scraping trick allows you to scrape for multiple queries at once. For instance, let's say we want to scrape pages featuring recipes for both mustard dressing and vinaigrette dressing. By placing a simple &lt;code&gt;OR&lt;/code&gt; between these phrases, we can make sure that our search (and subsequent scraping query) includes pages containing both of these delicious terms. To make this Google scraping trick even more precise, consider using quotation marks around your queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Asterisk to scrape wildcard data from Google&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzw93z2ck3nl4u85ooz0h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzw93z2ck3nl4u85ooz0h.png" alt="#7. Use * asterisk to scrape wildcard data" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5sift5j6rzez73loslm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5sift5j6rzez73loslm.png" alt="#7. Use * asterisk to scrape wildcard data" width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#7.&lt;/strong&gt; Use &lt;code&gt;*&lt;/code&gt; asterisk to scrape wildcard data&lt;/p&gt;

&lt;p&gt;The asterisk wildcard is another nifty trick for Google scraping. When you insert an &lt;code&gt;*&lt;/code&gt; into your scraping query, it acts as a flexible placeholder, which the Google scraper can later fill in. This tip is particularly handy when you don't have all the words at your fingertips. To best illustrate this, let's use an example with song lyrics. So, for our example, let's search for the lyrics of a famous Queen song by taking two random parts from verse three and placing an asterisk between them.&lt;/p&gt;

&lt;p&gt;As the scraping tool works its magic, it understands that the asterisks could represent any word or series of words bridging our queries. More often than not, the result will include the exact lyrics of the song we're targeting. But this trick isn't limited to just song lyrics. Whether it's a specific social media post, an elusive item, a lengthy name, or an article title that's on the tip of your tongue, the asterisk wildcard can make your Google scraping just a little bit more interesting.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Filetype: to scrape files of specific format&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybhb2100fziivh0wuuqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybhb2100fziivh0wuuqd.png" alt="#8. Use keyword + filetype: to scrape files of specific format" width="800" height="477"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82cc9pjqem6liovyv10j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82cc9pjqem6liovyv10j.png" alt="#8. Use keyword + filetype: to scrape files of specific format" width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#8.&lt;/strong&gt; Use keyword + &lt;code&gt;filetype:&lt;/code&gt; to scrape files of specific format&lt;/p&gt;

&lt;p&gt;&lt;code&gt;filetype:&lt;/code&gt; is as simple as it sounds. This Google scraping trick will get you any file on the open web. Just enter your keyword + filetype: followed by a file extension type: PDF, DOCX, or HTML. So for example, for your scraping query &lt;code&gt;harry potter filetype:pdf&lt;/code&gt; you'll get a collection of Harry Potter-related PDFs. But the scope of this scraping trick isn't confined to these formats alone. You can scrape Google for any type of file it accepts, including PowerPoint Presentations (PPT), LaTeX Documents (TEX), and even Google Earth maps (KML).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Scrape results before, after, and between periods of time using BEFORE, AFTER and . .&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8eknc0y7b2ant7xsvww.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy8eknc0y7b2ant7xsvww.png" alt="#9: scrape results before, after, and between periods of time using BEFORE, AFTER and . ." width="800" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap2x20kvytegr8ej70jb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fap2x20kvytegr8ej70jb.png" alt="#9. Scrape results before, after, and between periods of time using BEFORE, AFTER and . ." width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#9.&lt;/strong&gt; Scrape results before, after, and between periods of time using &lt;code&gt;BEFORE&lt;/code&gt;, &lt;code&gt;AFTER&lt;/code&gt; and &lt;code&gt;. .&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This trick lets you scrape Google pages in chronological order. Enter your keyword followed by the desired time frame before, after, or within a specific period. For example, if we're aiming to scrape Google Maps scraping tutorials published after 2022, our query would be: &lt;code&gt;google maps scraping tutorial AFTER:2022&lt;/code&gt;. After applying this, our Google scraping results will exclusively feature tutorials from 2022 onwards, sparing us the effort of sifting through older, irrelevant data. A little caveat, though: you can't scrape anything earlier than the dawn of the internet.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Mix them up!&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumpk9qarwuh1m41mckuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumpk9qarwuh1m41mckuy.png" alt="#10. Challenge the Google Pages Scraper by mixing up " width="800" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1kobytl4ss393vek9px.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb1kobytl4ss393vek9px.png" alt="#10. Challenge the Google Pages Scraper by mixing up " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4tcnk72gw0lxe9dg0ir.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft4tcnk72gw0lxe9dg0ir.png" alt="#10. Challenge the Google Pages Scraper by mixing up " width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;#10.&lt;/strong&gt; Challenge the Google Pages Scraper by mixing up &lt;code&gt;" "&lt;/code&gt;, &lt;code&gt;* *&lt;/code&gt;, and &lt;code&gt;site:&lt;/code&gt; search&lt;/p&gt;

&lt;p&gt;Last but not least, you can combine a lot of the scraping tricks you've just learned our Google SERP Scraper loves a challenge. In our example, we'll be looking for a very specific article that we don't fully remember the name of and narrow down our search to &lt;a href="http://blog.apify.com" rel="noopener noreferrer"&gt;blog.apify.com&lt;/a&gt;. So we're using quotation marks, an asterisk, and a site search. Let's see if the search engine and scraper can find and get that article for us. It did! So go ahead and try out all of the tricks one by one or all at once.&lt;/p&gt;

&lt;p&gt;🤹 &lt;strong&gt;Know any other tricks? Try them out on&lt;/strong&gt; &lt;a href="https://apify.com/apify/google-search-scraper" rel="noopener noreferrer"&gt;&lt;strong&gt;Google Scraper&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcykd94a6dl02gx1hnm5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcykd94a6dl02gx1hnm5k.png" alt="Google scraping fueled by the platform" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google scraping fueled by the Apify platform&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Google scraping fueled by the Apify platform&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The best part of Google Scraper is that it enables you to scrape anything and everything you could ever need from the World Wide Web. It can do that because Google SERP Scraper is more than just a standalone tool; it's actually supercharged by the versatility of the Apify platform.&lt;/p&gt;

&lt;p&gt;Because of the platform support, you're not limited to simply exporting scraped Google data in a range of formats or getting results for various Google domains. You also gain the convenience of &lt;a href="https://www.youtube.com/watch?v=ViYYDHSBAKM" rel="noopener noreferrer"&gt;accessing that data through an API&lt;/a&gt;, crafting &lt;a href="https://apify.com/integrations" rel="noopener noreferrer"&gt;custom integrations&lt;/a&gt; with other scrapers or your favorite apps, and &lt;a href="https://www.youtube.com/watch?v=GRFW_Loo2dk" rel="noopener noreferrer"&gt;scheduling&lt;/a&gt; and monitoring your scraping projects with ease.&lt;/p&gt;

&lt;p&gt;Last but not least, the Apify platform makes sure our 40K+ users can scrape Google pages with confidence thanks to our specialized &lt;a href="https://apify.com/proxy#proxies-offered-by-apify" rel="noopener noreferrer"&gt;SERP proxies&lt;/a&gt; that are tailor-made for the job). All this to make data extraction from Google easy and reliable.&lt;/p&gt;

</description>
      <category>google</category>
      <category>dataextraction</category>
    </item>
    <item>
      <title>Edge AI vs. Cloud AI</title>
      <dc:creator>Apify</dc:creator>
      <pubDate>Wed, 22 Nov 2023 23:00:00 +0000</pubDate>
      <link>https://dev.to/apify/edge-ai-vs-cloud-ai-oak</link>
      <guid>https://dev.to/apify/edge-ai-vs-cloud-ai-oak</guid>
      <description>&lt;p&gt;&lt;strong&gt;Hi, we're&lt;/strong&gt; &lt;a href="https://apify.com/" rel="noopener noreferrer"&gt;&lt;strong&gt;Apify&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;, a cloud platform that helps you build reliable web scrapers fast and automate anything you can do manually in a web browser. This article on Edge AI vs. Cloud AI was inspired by our work on&lt;/strong&gt; &lt;a href="https://apify.it/data-for-ai" rel="noopener noreferrer"&gt;&lt;strong&gt;collecting data for AI and machine learning applications&lt;/strong&gt;&lt;/a&gt; &lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The rise of Edge AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Five years ago, Gartner predicted that &lt;a href="https://www.gartner.com/smarterwithgartner/what-edge-computing-means-for-infrastructure-and-operations-leaders" rel="noopener noreferrer"&gt;75% of enterprise data would be created and processed outside the cloud&lt;/a&gt; by 2025. Whether that will prove entirely accurate remains to be seen. But what is clear is that Edge AI is rapidly growing in popularity.&lt;/p&gt;

&lt;p&gt;The rise of edge computing accelerated in the 2010s with the explosion of IoT devices necessitating smarter, faster processing at the edge, in other words, closer to the data source. This gave rise to Edge AI, where AI algorithms are processed locally on a hardware device.&lt;/p&gt;

&lt;p&gt;The growing interest in Edge AI has generated a myth that edge computing will replace cloud computing. But in reality, Edge and Cloud can work hand-in-hand by synchronizing a decentralized edge and a centralized cloud.&lt;/p&gt;

&lt;p&gt;The purpose of this article isn't to tell you which of the two - Edge or Cloud - is better but to highlight the pros and cons of each so you can know which is suitable for your AI tasks.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is Edge AI?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Sometimes referred to as AI at the edge, Edge AI is the implementation of artificial intelligence in an edge computing environment. In other words, Edge AI allows computation to be done close to where data is collected rather than at an offsite data center. That means it processes data locally to provide swift response times due to on-device processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pros and cons of Edge AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Edge AI advantages&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reduced latency and bandwidth:&lt;/strong&gt; By processing data close to the edge, the need to transmit information over the network is reduced.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Swift response times:&lt;/strong&gt; Fully on-device processing provides quick services, which eliminates wait times due to remote server responses.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Privacy and security:&lt;/strong&gt; Edge AI offers better security for personal data than transmitting it across networks, where it can be vulnerable to cyberattacks.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Edge AI disadvantages&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Limited computing power:&lt;/strong&gt; Edge devices often have less computing power than cloud servers, limiting the complexity of AI models they can run.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cost and scalability challenges:&lt;/strong&gt; Scaling Edge AI solutions across numerous devices can be complex and expensive due to the amount of money required to acquire, maintain, and operate computing resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Maintenance and upgrades:&lt;/strong&gt; Regular maintenance and updates of each edge device can be more challenging compared to centralized cloud updates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Machine variations:&lt;/strong&gt; There's more variation in machine types with edge devices, leading to more common failures.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3sm2ye6qbfxpl3jq8yz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa3sm2ye6qbfxpl3jq8yz.jpg" alt="Cloud AI is where data processing and AI model execution occur in cloud-based servers." width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is Cloud AI?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Cloud AI refers to artificial intelligence systems where the data processing and AI model execution occur in cloud-based servers rather than on local devices.&lt;/p&gt;

&lt;p&gt;The foundation for Cloud AI was laid with the advent of cloud computing in the early noughties. The introduction of cloud AI services, such as Google's Cloud AI, AWS's SageMaker, and Microsoft's Azure AI, some ten years later, was a significant milestone. These platforms provided tools for machine learning, data analytics, and cognitive services (like &lt;a href="https://blog.apify.com/nlp-techniques/" rel="noopener noreferrer"&gt;natural language processing&lt;/a&gt; and &lt;a href="https://blog.apify.com/data-collection-for-computer-vision/" rel="noopener noreferrer"&gt;computer vision&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Because Cloud AI operates on data sent to remote servers, it's more scalable and flexible than Edge AI. That's the main thing that gives Cloud the edge (see what I did there?), but there are other advantages too:&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Pros and cons of Cloud AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cloud AI advantages&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Big data handling&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI algorithms thrive on voluminous data for training and accuracy. Cloud storage is integral here, providing the capacity to store and process terabytes of data. This capability is essential for developing machine learning models that learn from vast, varied datasets to enhance their predictive accuracy and reliability.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Parallel processing&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before cloud infrastructure, processing limitations were a significant bottleneck in AI development. Cloud computing introduced parallel processing nodes, which dramatically enhanced computing power. This means complex AI models can be computed much faster, accelerating the development and deployment of AI solutions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GPU acceleration&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Advanced AI computations, especially those in machine learning and deep learning, require significant processing power. GPUs, known for their &lt;a href="https://blog.apify.com/concurrency-vs-parallelism/" rel="noopener noreferrer"&gt;parallel processing&lt;/a&gt; capabilities, are ideal for these tasks. Cloud AI utilizes GPU acceleration to handle intensive AI computations efficiently.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Scalability and flexibility&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One of the most significant advantages of cloud storage in AI is scalability. Cloud-based AI systems can adapt to varying computational demands, scaling up or down as needed. This flexibility allows for efficient management of resources and costs, which is particularly vital for fluctuating AI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Cloud AI disadvantages&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Latency issues:&lt;/strong&gt; Depending on internet connectivity, there can be latency in data processing, which may not be suitable for real-time applications.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Security concerns:&lt;/strong&gt; Transmitting data to and from cloud servers can pose security risks, especially if sensitive data is involved. That being said, cloud providers can provide strong security measures and compliance standards, so they can be a viable option if properly configured.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Dependence on internet connectivity:&lt;/strong&gt; Cloud AI's effectiveness is contingent on reliable internet connectivity, which can be a limitation in remote or unstable network areas.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Key takeaways: when to use Edge and when to use Cloud&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Edge computing&lt;/strong&gt; minimizes latency by processing data locally but has limitations in terms of computational resources.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cloud computing&lt;/strong&gt; provides powerful processing capabilities but introduces latency due to data transmission.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The choice&lt;/strong&gt; between Edge and Cloud depends on the latency tolerance of your application, the available network bandwidth, and the computational needs of your AI tasks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Edge AI&lt;/strong&gt; when real-time processing, data privacy, and reduced bandwidth usage are critical.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Use Cloud AI&lt;/strong&gt; for complex computations, large-scale data analysis, and applications where latency is less of a concern.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Apify as a data cloud platform for AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;If data in the cloud is what you need, Apify is a cloud platform that helps you &lt;a href="https://apify.com/web-scraping" rel="noopener noreferrer"&gt;build reliable web scrapers&lt;/a&gt; for real-time data collection, and automate anything you can do manually in a web browser. This makes it an ideal platform for extracting web data at scale for AI and machine learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🧑🏻💻&lt;/strong&gt; &lt;a href="https://apify.com/data-for-generative-ai" rel="noopener noreferrer"&gt;&lt;strong&gt;Web scraping for AI data&lt;/strong&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Apify excels in extracting vast amounts of data from the web, which is crucial for training and fine-tuning AI models like ChatGPT and LLaMA. Its ability to crawl and extract relevant information from various sources makes it a go-to solution for feeding AI algorithms.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;🧩&lt;/strong&gt; &lt;a href="https://python.langchain.com/docs/integrations/tools/apify" rel="noopener noreferrer"&gt;&lt;strong&gt;Easy integration with AI tools&lt;/strong&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Apify facilitates the integration of scraped data into AI platforms. It supports the loading of data into &lt;a href="https://blog.apify.com/what-is-a-vector-database/" rel="noopener noreferrer"&gt;vector databases&lt;/a&gt; for querying to &lt;a href="https://blog.apify.com/how-to-ai-chatbot-python/" rel="noopener noreferrer"&gt;enhance the capabilities of AI chatbots&lt;/a&gt; and other applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;📈&lt;/strong&gt; &lt;a href="https://apify.com/enterprise" rel="noopener noreferrer"&gt;&lt;strong&gt;Customizable and scalable&lt;/strong&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;Whether it's using &lt;a href="https://apify.com/store/categories/ai" rel="noopener noreferrer"&gt;pre-built scrapers&lt;/a&gt; or developing custom ones, Apify offers &lt;a href="https://apify.com/enterprise" rel="noopener noreferrer"&gt;tailored solutions&lt;/a&gt;. This flexibility is vital for AI applications that require specific, up-to-date data from diverse web sources.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://blog.apify.com/intercom-customer-support-ai-chatbot-web-scraping/" rel="noopener noreferrer"&gt;&lt;strong&gt;Practical applications&lt;/strong&gt;&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;In customer service, Apify's web scraping abilities are already &lt;a href="https://blog.apify.com/intercom-customer-support-ai-chatbot-web-scraping/" rel="noopener noreferrer"&gt;enhancing AI chatbots&lt;/a&gt;, enabling them to provide accurate and relevant responses based on real-time data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🦾You might be interested in how you can add &lt;a href="https://blog.apify.com/add-custom-actions-to-your-gpts/" rel="noopener noreferrer"&gt;custom actions to your GPTs&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>data</category>
      <category>webcrawling</category>
    </item>
  </channel>
</rss>
