DEV Community

Cover image for Overcoming Web Scraping challenges with Firecrawl, an open-source AI tool
Ibrahim Salami
Ibrahim Salami

Posted on

Overcoming Web Scraping challenges with Firecrawl, an open-source AI tool

Web scraping is an art, and Firecrawl is your paintbrush. It can be difficult because we’re constantly faced with blockers like JavaScript-heavy content, CAPTCHAs, and strict rate limits. Fortunately, Firecrawl is designed to address common web scraping problems. This guide will take you through Firecrawl’s capabilities, showing you how to scrape, crawl, and extract data like a pro.

Getting Started with Firecrawl

Let’s begin with a quick setup. To scrape a single page and extract clean markdown data with Firecrawl handling all the complexities in the background; use the /scrape endpoint.

Here’s a simple example using Python:

# pip install firecrawl-py
from firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="YOUR_API_KEY")
content = app.scrape_url("https://docs.firecrawl.dev")

print(content["data"]["markdown"])  # Outputs the scraped content in markdown format
Enter fullscreen mode Exit fullscreen mode

But Firecrawl isn’t just about scraping plain web pages. Let’s dive into some advanced options that make Firecrawl truly shine.

Advanced Scraping Options

Scraping PDFs

By default, the /scrape endpoint can extract text content from PDFs. However, if you want to skip this, simply set pageOptions.parsePDF to false.

Page Options: Fine-Tuning Your Scrape
Firecrawl gives you control over what and how you scrape. Here’s a breakdown of the key pageOptions parameters:

  • onlyMainContent: Scrape the main content of a page and ignore headers, footers, and sidebars.
  • includeHtml: Useful for when you need the HTML version of the content, enable this to add an html key in the response.
  • includeRawHtml: For those who want raw HTML, use this option to add rawHtml key to the response.
  • screenshot: This option captures a screenshot of the top of the page.
  • waitFor: Sometimes pages take time to load. Use this to specify a wait time in milliseconds before scraping.

Example: Combining Page Options

Here’s how you might combine these options in a single request:

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization : Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      }
    }'
Enter fullscreen mode Exit fullscreen mode

In this code, Firecrawl will return only the main content, including both raw and processed HTML, capture a screenshot, and wait 5 seconds for the page to fully load.

Extractor Options: Getting Structured Data

Beyond scraping, Firecrawl helps you extract structured data from any content using the extractorOptions parameter.

  • mode: Choose between llm-extraction (from cleaned data) and llm-extraction-from-raw-html (directly from raw HTML).
  • extractionPrompt: Describe what information you want to extract.
  • extractionSchema: Define the structure of the extracted data.

Example: Extracting Data with a Schema

curl -X POST https://api.firecrawl.dev/v0/scrape \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev/",
      "extractorOptions": {
        "mode": "llm-extraction",
        "extractionPrompt": "Extract the company mission, SSO support, open-source status, and YC status.",
        "extractionSchema": {
          "type": "object",
          "properties": {
            "company_mission": { "type": "string" },
            "supports_sso": { "type": "boolean" },
            "is_open_source": { "type": "boolean" },
            "is_in_yc": { "type": "boolean" }
          },
          "required": ["company_mission", "supports_sso", "is_open_source", "is_in_yc"]
        }
      }
    }'
Enter fullscreen mode Exit fullscreen mode

This request will not only scrape the content but also extract specific pieces of information according to your defined schema. For example, this setup extracts structured information like company mission, SSO support, open-source status, and YC affiliation directly from the content.

Crawling Multiple Pages

Sometimes one page isn’t enough. That’s where the /crawl endpoint comes in; it allows you to scrape an entire site. You can specify a base URL, and Firecrawl will handle the rest, capturing all accessible subpages.

Example: Customizing Your Crawl

This setup shows you how to customize your crawl specific options:

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "excludes": ["/admin/*", "/login/*"],
        "returnOnlyUrls": false,
        "maxDepth": 2,
        "mode": "fast",
        "limit": 1000
      }
    }'
Enter fullscreen mode Exit fullscreen mode

In this configuration, Firecrawl will:

  • Crawl pages matching the /blog/* and /products/* subpaths.
  • Skip pages matching /admin/* and /login/*.
  • Crawl up to two levels deep and up to 1000 pages in total.
  • Use the fast crawling mode for quicker results.

Combining Page and Crawler Options

For more control, combine pageOptions with crawlerOptions in a single request:

curl -X POST https://api.firecrawl.dev/v0/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "pageOptions": {
        "onlyMainContent": true,
        "includeHtml": true,
        "includeRawHtml": true,
        "screenshot": true,
        "waitFor": 5000
      },
      "crawlerOptions": {
        "includes": ["/blog/*", "/products/*"],
        "maxDepth": 2,
        "mode": "fast"
      }
    }'
Enter fullscreen mode Exit fullscreen mode

With this setup, Firecrawl will deliver precisely the data you need, exactly how you need it.

You can get started with free $500 Firecrawl Credits (no credit card required) or you can self-host the open-source version.

Top comments (0)