How to Use rs-trafilatura with Firecrawl

#scraping #firecrawl #webcontentextraction #rust

Firecrawl is an API service for scraping web pages. It handles JavaScript rendering, anti-bot bypass, and rate limiting — you send it a URL, it gives you back the page content. By default, Firecrawl returns Markdown. But if you request the raw HTML, you can run rs-trafilatura on it for page-type-aware extraction with quality scoring.

This is useful when you need structured metadata (title, author, date, page type) or when you want to know how confident the extraction is.

Install

pip install rs-trafilatura firecrawl

You also need a Firecrawl API key from firecrawl.dev.

Basic Usage

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

# Request HTML format (required for rs-trafilatura)
result = app.scrape("https://example.com/blog/post", formats=["html"])

# Extract with rs-trafilatura
extracted = extract_firecrawl_result(result)

print(f"Title: {extracted.title}")
print(f"Author: {extracted.author}")
print(f"Date: {extracted.date}")
print(f"Page type: {extracted.page_type}")
print(f"Quality: {extracted.extraction_quality:.2f}")
print(f"Content: {extracted.main_content[:200]}")

The key is formats=["html"] — this tells Firecrawl to return the raw HTML alongside whatever else it produces. Without it, you only get Markdown and there's nothing for rs-trafilatura to extract from.

Why Not Just Use Firecrawl's Markdown?

Firecrawl's built-in Markdown output is good for articles. The difference shows on non-article pages:

Product pages: Firecrawl may include navigation, filters, and "related products" sections in its Markdown. rs-trafilatura recognises the page type and extracts just the product description, falling back to JSON-LD structured data when needed.
Forums: Firecrawl treats the entire page as content. rs-trafilatura identifies user posts and excludes voting controls, user profile panels, and moderation UI.
Service pages: Firecrawl may over-extract or under-extract multi-section layouts. rs-trafilatura's multi-candidate merge handles hero + features + testimonials + pricing sections.

The other advantage is the quality score. Firecrawl doesn't tell you how confident it is. rs-trafilatura's extraction_quality field gives you a 0.0–1.0 score so you can flag unreliable extractions.

Getting Both Firecrawl Markdown and rs-trafilatura Extraction

You can request both and compare:

result = app.scrape("https://example.com", formats=["html", "markdown"])

# Firecrawl's own Markdown
firecrawl_markdown = result.markdown

# rs-trafilatura extraction
extracted = extract_firecrawl_result(result, output_markdown=True)
rs_markdown = extracted.content_markdown
rs_quality = extracted.extraction_quality

print(f"Firecrawl markdown: {len(firecrawl_markdown)} chars")
print(f"rs-trafilatura markdown: {len(rs_markdown)} chars")
print(f"Extraction quality: {rs_quality:.2f}")

Batch Scraping

Firecrawl supports batch scraping. Combine it with rs-trafilatura for structured extraction at scale:

from firecrawl import FirecrawlApp
from rs_trafilatura.firecrawl import extract_firecrawl_result

app = FirecrawlApp(api_key="fc-your-api-key")

urls = [
    "https://example.com/products/widget",
    "https://example.com/docs/getting-started",
    "https://example.com/blog/announcement",
    "https://forum.example.com/thread/help",
]

batch = app.batch_scrape(urls, formats=["html"])

for doc in batch.data:
    extracted = extract_firecrawl_result(doc)
    print(f"[{extracted.page_type}] {extracted.title} (quality: {extracted.extraction_quality:.2f})")

Note: the batch API returns a result object with a .data attribute containing a list of Document objects. The extract_firecrawl_result adapter handles both Document objects (v4) and legacy dicts (v1).

Options

# Stricter filtering — less noise
extracted = extract_firecrawl_result(result, favor_precision=True)

# More inclusive — captures more content
extracted = extract_firecrawl_result(result, favor_recall=True)

# Get Markdown output
extracted = extract_firecrawl_result(result, output_markdown=True)

What You Get

extract_firecrawl_result returns an ExtractResult with:

title, author, date — structured metadata
main_content — clean extracted text
content_markdown — GFM Markdown (when enabled)
page_type — article, forum, product, collection, listing, documentation, service
extraction_quality — 0.0–1.0 confidence score
language, sitename, description — additional metadata
images — extracted image data with src, alt, caption