DEV Community

Cover image for How to Use rs-trafilatura with Firecrawl
Murrough Foley
Murrough Foley

Posted on

How to Use rs-trafilatura with Firecrawl

Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti-bot bypass, and rate limiting — you send it a URL, it gives you back the page content. By default, Firecrawl returns Markdown. But if you request the raw HTML, you can run rs-trafilatura on it for page-type-aware extraction with quality scoring.

This is useful when you need structured metadata (title, author, date, page type) or when you want to know how confident the extraction is.

Install

pip install rs-trafilatura firecrawl
Enter fullscreen mode Exit fullscreen mode

You also need a Firecrawl API key from firecrawl.dev.

Basic Usage

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

# Request HTML format (required for rs-trafilatura)
result = app.scrape("https://example.com/blog/post", formats=["html"])

# Extract with rs-trafilatura
extracted = extract_firecrawl_result(result)

print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Date: {extracted.date}")
print(f"Page type: {extracted.page_type}")
print(f"Quality: {extracted.extraction_quality:.2f}")
print(f"Content: {extracted.main_content[:200]}")
Enter fullscreen mode Exit fullscreen mode

The key is formats=["html"] — this tells Firecrawl to return the raw HTML alongside whatever else it produces. Without it, you only get Markdown and there's nothing for rs-trafilatura to extract from.

Why Not Just Use Firecrawl's Markdown?

Firecrawl's built-in Markdown output is good for articles. The difference shows on non-article pages:

  • Product pages: Firecrawl may include navigation, filters, and "related products" sections in its Markdown. rs-trafilatura recognises the page type and extracts just the product description, falling back to JSON-LD structured data when needed.
  • Forums: Firecrawl treats the entire page as content. rs-trafilatura identifies user posts and excludes voting controls, user profile panels, and moderation UI.
  • Service pages: Firecrawl may over-extract or under-extract multi-section layouts. rs-trafilatura's multi-candidate merge handles hero + features + testimonials + pricing sections.

The other advantage is the quality score. Firecrawl doesn't tell you how confident it is. rs-trafilatura's extraction_quality field gives you a 0.0–1.0 score so you can flag unreliable extractions.

Getting Both Firecrawl Markdown and rs-trafilatura Extraction

You can request both and compare:

result = app.scrape("https://example.com", formats=["html", "markdown"])

# Firecrawl's own Markdown
firecrawl_markdown = result.markdown

# rs-trafilatura extraction
extracted = extract_firecrawl_result(result, output_markdown=True)
rs_markdown = extracted.content_markdown
rs_quality = extracted.extraction_quality

print(f"Firecrawl markdown: {len(firecrawl_markdown)} chars")
print(f"rs-trafilatura markdown: {len(rs_markdown)} chars")
print(f"Extraction quality: {rs_quality:.2f}")
Enter fullscreen mode Exit fullscreen mode

Batch Scraping

Firecrawl supports batch scraping. Combine it with rs-trafilatura for structured extraction at scale:

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

urls = [
    "https://example.com/products/widget",
    "https://example.com/docs/getting-started",
    "https://example.com/blog/announcement",
    "https://forum.example.com/thread/help",
]

batch = app.batch_scrape(urls, formats=["html"])

for doc in batch.data:
    extracted = extract_firecrawl_result(doc)
    print(f"[{extracted.page_type}] {extracted.title} (quality: {extracted.extraction_quality:.2f})")
Enter fullscreen mode Exit fullscreen mode

Note: the batch API returns a result object with a .data attribute containing a list of Document objects. The extract_firecrawl_result adapter handles both Document objects (v4) and legacy dicts (v1).

Options

# Stricter filtering — less noise
extracted = extract_firecrawl_result(result, favor_precision=True)

# More inclusive — captures more content
extracted = extract_firecrawl_result(result, favor_recall=True)

# Get Markdown output
extracted = extract_firecrawl_result(result, output_markdown=True)
Enter fullscreen mode Exit fullscreen mode

What You Get

extract_firecrawl_result returns an ExtractResult with:

  • title, author, date — structured metadata
  • main_content — clean extracted text
  • content_markdown — GFM Markdown (when enabled)
  • page_type — article, forum, product, collection, listing, documentation, service
  • extraction_quality — 0.0–1.0 confidence score
  • language, sitename, description — additional metadata
  • images — extracted image data with src, alt, caption

Links

Top comments (0)