Traditional scrapers break when a site changes its HTML structure. You wake up to a broken pipeline, hunt for the new CSS selector, push a fix, and repeat the cycle next month.
LLM-powered scrapers take a different approach: instead of brittle selectors, they use a language model to understand the page and extract what you need regardless of structure changes. Here's how to build one.
When LLM Extraction Makes Sense
LLM extraction adds cost and latency. It's not always the right choice.
Use LLM extraction when:
- The target site changes layout frequently (news sites, job boards, e-commerce)
- The data you want is contextual (summaries, sentiment, classifications)
- You need to extract from unstructured text (articles, product descriptions, reviews)
- Selector-based scraping has failed multiple times on the same target
Use traditional selectors when:
- The structure is stable and machine-readable (APIs, structured data)
- You're scraping thousands of similar pages (cost adds up fast)
- Sub-second latency matters
The hybrid approach (selectors first, LLM fallback) is often optimal.
The Basic Architecture
import asyncio
import httpx
from anthropic import AsyncAnthropic
from pydantic import BaseModel
from typing import Optional
import json
class ProductData(BaseModel):
title: str
price: Optional[float]
currency: str = "USD"
availability: str
description: Optional[str]
rating: Optional[float]
review_count: Optional[int]
class LLMScraper:
def __init__(self, model: str = "claude-3-haiku-20240307"):
self.client = AsyncAnthropic()
self.model = model
self.http = httpx.AsyncClient(
headers={"User-Agent": "Mozilla/5.0 (compatible)"},
follow_redirects=True,
timeout=20
)
async def extract(self, url: str, schema: type[BaseModel],
context: str = "") -> BaseModel:
"""Fetch URL and extract data matching the schema."""
response = await self.http.get(url)
response.raise_for_status()
# Clean HTML to reduce token usage
cleaned_html = self._clean_html(response.text)
# Build prompt
prompt = f"""Extract the following data from this webpage HTML.
Target data: {schema.model_json_schema()}
{f'Additional context: {context}' if context else ''}
Return ONLY valid JSON matching the schema. No explanation.
HTML:
{cleaned_html[:8000]}""" # Limit to 8K chars to control cost
message = await self.client.messages.create(
model=self.model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
# Parse response
json_str = message.content[0].text.strip()
if json_str.startswith("```
"):
json_str = json_str.split("
```")[1]
if json_str.startswith("json"):
json_str = json_str[4:]
data = json.loads(json_str)
return schema(**data)
def _clean_html(self, html: str) -> str:
"""Remove scripts, styles, and boilerplate to reduce tokens."""
from selectolax.parser import HTMLParser
tree = HTMLParser(html)
# Remove noisy elements
for tag in tree.css("script, style, nav, footer, header, aside"):
tag.decompose()
# Get main content area first if it exists
main = tree.css_first("main, article, .product-detail, .content, #content")
if main:
return main.text(separator="\n", strip=True)
return tree.body.text(separator="\n", strip=True) if tree.body else html[:5000]
Hybrid Selector + LLM Fallback
The most cost-effective pattern: try CSS selectors first, fall back to LLM when they fail.
class HybridScraper:
def __init__(self):
self.llm = LLMScraper(model="claude-3-haiku-20240307")
self.selector_cache: dict = {} # Cache working selectors
async def extract_price(self, url: str) -> Optional[float]:
"""Try known selectors, fall back to LLM."""
response = await self.llm.http.get(url)
html = response.text
from selectolax.parser import HTMLParser
tree = HTMLParser(html)
# Try cached selector for this domain
domain = url.split("/")[2]
if domain in self.selector_cache:
node = tree.css_first(self.selector_cache[domain])
if node:
try:
price_text = node.text(strip=True)
return float(price_text.replace("$","").replace(",","").strip())
except ValueError:
pass
# Try common selectors
price_selectors = [
"[class*='price'] [class*='amount']",
".price-box .price",
"[data-product-price]",
".product-price",
"span[itemprop='price']",
]
for selector in price_selectors:
node = tree.css_first(selector)
if node:
try:
price_text = node.text(strip=True)
price = float(price_text.replace("$","").replace(",","").strip())
# Cache the working selector
self.selector_cache[domain] = selector
return price
except ValueError:
continue
# Fall back to LLM
print(f"Selector failed for {domain}, using LLM fallback")
result = await self.llm.extract(url, ProductData)
if result.price:
return result.price
return None
Using Claude Haiku for Cost-Effective Extraction
Model choice matters enormously for cost:
| Model | Cost per 1M input tokens | Cost per page (~2K tokens) |
|---|---|---|
| Claude 3 Opus | $15 | ~$0.030 |
| Claude 3.5 Sonnet | $3 | ~$0.006 |
| Claude 3 Haiku | $0.25 | ~$0.0005 |
| GPT-4o | $5 | ~$0.010 |
| GPT-4o-mini | $0.15 | ~$0.0003 |
For most extraction tasks, Haiku or GPT-4o-mini are sufficient. Reserve Sonnet for complex, multi-step reasoning tasks.
# Cost-optimized extraction with Haiku
scraper = LLMScraper(model="claude-3-haiku-20240307")
# For complex extraction requiring reasoning, use Sonnet
complex_scraper = LLMScraper(model="claude-3-5-sonnet-20241022")
At $0.0005 per page with Haiku, extracting 10,000 pages costs $5. At Sonnet pricing, the same job costs $60.
Structured Extraction With Pydantic
Define exactly what you want, and the LLM returns it or fails clearly:
from pydantic import BaseModel, Field, validator
from typing import Optional, List
class JobPosting(BaseModel):
title: str = Field(description="Exact job title")
company: str = Field(description="Company name")
location: str = Field(description="City, State or Remote")
salary_min: Optional[int] = Field(description="Minimum salary in USD/year")
salary_max: Optional[int] = Field(description="Maximum salary in USD/year")
experience_years: Optional[int] = Field(description="Required years of experience")
skills: List[str] = Field(description="Required technical skills")
remote_ok: bool = Field(description="Whether remote work is accepted")
posted_date: Optional[str] = Field(description="Date posted in ISO format")
@validator('skills', pre=True)
def normalize_skills(cls, v):
if isinstance(v, str):
return [s.strip() for s in v.split(',')]
return v
# Batch extraction
async def extract_jobs(urls: list[str]) -> list[JobPosting]:
scraper = LLMScraper(model="claude-3-haiku-20240307")
semaphore = asyncio.Semaphore(5)
async def extract_one(url: str) -> Optional[JobPosting]:
async with semaphore:
try:
return await scraper.extract(url, JobPosting,
context="This is a job posting page. Extract all available fields.")
except Exception as e:
print(f"Failed {url}: {e}")
return None
results = await asyncio.gather(*[extract_one(url) for url in urls])
return [r for r in results if r is not None]
Handling Rate Limits and Retries
import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from anthropic import RateLimitError, APITimeoutError
@retry(
retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)
async def extract_with_retry(scraper, url, schema):
return await scraper.extract(url, schema)
Real-World Example: Monitoring 50 Competitor Product Pages
import asyncio
from datetime import datetime
import asyncpg
async def monitor_competitor_prices():
scraper = HybridScraper()
# Load competitors from DB
conn = await asyncpg.connect(DATABASE_URL)
competitors = await conn.fetch("SELECT url, sku FROM tracked_products")
results = []
for comp in competitors:
try:
price = await scraper.extract_price(comp['url'])
if price:
results.append({
'sku': comp['sku'],
'price': price,
'url': comp['url'],
'scraped_at': datetime.utcnow()
})
except Exception as e:
print(f"Error for {comp['url']}: {e}")
# Bulk insert
await conn.executemany("""
INSERT INTO price_observations (sku, price, url, scraped_at)
VALUES ($1, $2, $3, $4)
""", [(r['sku'], r['price'], r['url'], r['scraped_at']) for r in results])
print(f"Extracted {len(results)}/{len(competitors)} prices")
await conn.close()
asyncio.run(monitor_competitor_prices())
Cost Estimate for Common Use Cases
| Use case | Pages/month | LLM model | Monthly cost |
|---|---|---|---|
| Price monitoring (50 products, daily) | 1,500 | Haiku | $0.75 |
| Job board monitoring (500 postings/week) | 2,000 | Haiku | $1.00 |
| News sentiment analysis (100 articles/day) | 3,000 | Haiku | $1.50 |
| Complex B2B lead enrichment (100/day) | 3,000 | Sonnet | $18.00 |
LLM extraction is economically competitive with most SaaS data tools once you factor in the eliminated maintenance cost of selector-based scrapers.
Pre-Built Scraping Infrastructure
If you want production-ready scrapers without building the infrastructure from scratch, I maintain 35 Apify actors covering the most common use cases — each with structured output that integrates cleanly with LLM post-processing pipelines.
Apify Scrapers Bundle — €29 — one-time download, all 35 actors with workflow guides.
Top comments (0)