Vhub Systems

Posted on Apr 3

How to Build an LLM-Powered Web Scraper That Adapts to Page Changes

#ai #python #tutorial #webscraping

Traditional scrapers break when a site changes its HTML structure. You wake up to a broken pipeline, hunt for the new CSS selector, push a fix, and repeat the cycle next month.

LLM-powered scrapers take a different approach: instead of brittle selectors, they use a language model to understand the page and extract what you need regardless of structure changes. Here's how to build one.

When LLM Extraction Makes Sense

LLM extraction adds cost and latency. It's not always the right choice.

Use LLM extraction when:

The target site changes layout frequently (news sites, job boards, e-commerce)
The data you want is contextual (summaries, sentiment, classifications)
You need to extract from unstructured text (articles, product descriptions, reviews)
Selector-based scraping has failed multiple times on the same target

Use traditional selectors when:

The structure is stable and machine-readable (APIs, structured data)
You're scraping thousands of similar pages (cost adds up fast)
Sub-second latency matters

The hybrid approach (selectors first, LLM fallback) is often optimal.

The Basic Architecture

import asyncio
import httpx
from anthropic import AsyncAnthropic
from pydantic import BaseModel
from typing import Optional
import json

class ProductData(BaseModel):
    title: str
    price: Optional[float]
    currency: str = "USD"
    availability: str
    description: Optional[str]
    rating: Optional[float]
    review_count: Optional[int]

class LLMScraper:
    def __init__(self, model: str = "claude-3-haiku-20240307"):
        self.client = AsyncAnthropic()
        self.model = model
        self.http = httpx.AsyncClient(
            headers={"User-Agent": "Mozilla/5.0 (compatible)"},
            follow_redirects=True,
            timeout=20
        )

    async def extract(self, url: str, schema: type[BaseModel], 
                      context: str = "") -> BaseModel:
        """Fetch URL and extract data matching the schema."""
        response = await self.http.get(url)
        response.raise_for_status()

        # Clean HTML to reduce token usage
        cleaned_html = self._clean_html(response.text)

        # Build prompt
        prompt = f"""Extract the following data from this webpage HTML.

Target data: {schema.model_json_schema()}
{f'Additional context: {context}' if context else ''}

Return ONLY valid JSON matching the schema. No explanation.

HTML:
{cleaned_html[:8000]}"""  # Limit to 8K chars to control cost

        message = await self.client.messages.create(
            model=self.model,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )

        # Parse response
        json_str = message.content[0].text.strip()
        if json_str.startswith("```

"):
            json_str = json_str.split("

```")[1]
            if json_str.startswith("json"):
                json_str = json_str[4:]

        data = json.loads(json_str)
        return schema(**data)

    def _clean_html(self, html: str) -> str:
        """Remove scripts, styles, and boilerplate to reduce tokens."""
        from selectolax.parser import HTMLParser
        tree = HTMLParser(html)

        # Remove noisy elements
        for tag in tree.css("script, style, nav, footer, header, aside"):
            tag.decompose()

        # Get main content area first if it exists
        main = tree.css_first("main, article, .product-detail, .content, #content")
        if main:
            return main.text(separator="\n", strip=True)

        return tree.body.text(separator="\n", strip=True) if tree.body else html[:5000]

Hybrid Selector + LLM Fallback

The most cost-effective pattern: try CSS selectors first, fall back to LLM when they fail.

class HybridScraper:
    def __init__(self):
        self.llm = LLMScraper(model="claude-3-haiku-20240307")
        self.selector_cache: dict = {}  # Cache working selectors

    async def extract_price(self, url: str) -> Optional[float]:
        """Try known selectors, fall back to LLM."""
        response = await self.llm.http.get(url)
        html = response.text

        from selectolax.parser import HTMLParser
        tree = HTMLParser(html)

        # Try cached selector for this domain
        domain = url.split("/")[2]
        if domain in self.selector_cache:
            node = tree.css_first(self.selector_cache[domain])
            if node:
                try:
                    price_text = node.text(strip=True)
                    return float(price_text.replace("$","").replace(",","").strip())
                except ValueError:
                    pass

        # Try common selectors
        price_selectors = [
            "[class*='price'] [class*='amount']",
            ".price-box .price",
            "[data-product-price]",
            ".product-price",
            "span[itemprop='price']",
        ]

        for selector in price_selectors:
            node = tree.css_first(selector)
            if node:
                try:
                    price_text = node.text(strip=True)
                    price = float(price_text.replace("$","").replace(",","").strip())
                    # Cache the working selector
                    self.selector_cache[domain] = selector
                    return price
                except ValueError:
                    continue

        # Fall back to LLM
        print(f"Selector failed for {domain}, using LLM fallback")
        result = await self.llm.extract(url, ProductData)
        if result.price:
            return result.price

        return None

Using Claude Haiku for Cost-Effective Extraction

Model choice matters enormously for cost:

Model	Cost per 1M input tokens	Cost per page (~2K tokens)
Claude 3 Opus	$15	~$0.030
Claude 3.5 Sonnet	$3	~$0.006
Claude 3 Haiku	$0.25	~$0.0005
GPT-4o	$5	~$0.010
GPT-4o-mini	$0.15	~$0.0003

For most extraction tasks, Haiku or GPT-4o-mini are sufficient. Reserve Sonnet for complex, multi-step reasoning tasks.

# Cost-optimized extraction with Haiku
scraper = LLMScraper(model="claude-3-haiku-20240307")

# For complex extraction requiring reasoning, use Sonnet
complex_scraper = LLMScraper(model="claude-3-5-sonnet-20241022")

At $0.0005 per page with Haiku, extracting 10,000 pages costs $5. At Sonnet pricing, the same job costs $60.

Structured Extraction With Pydantic

Define exactly what you want, and the LLM returns it or fails clearly:

from pydantic import BaseModel, Field, validator
from typing import Optional, List

class JobPosting(BaseModel):
    title: str = Field(description="Exact job title")
    company: str = Field(description="Company name")
    location: str = Field(description="City, State or Remote")
    salary_min: Optional[int] = Field(description="Minimum salary in USD/year")
    salary_max: Optional[int] = Field(description="Maximum salary in USD/year")
    experience_years: Optional[int] = Field(description="Required years of experience")
    skills: List[str] = Field(description="Required technical skills")
    remote_ok: bool = Field(description="Whether remote work is accepted")
    posted_date: Optional[str] = Field(description="Date posted in ISO format")

    @validator('skills', pre=True)
    def normalize_skills(cls, v):
        if isinstance(v, str):
            return [s.strip() for s in v.split(',')]
        return v

# Batch extraction
async def extract_jobs(urls: list[str]) -> list[JobPosting]:
    scraper = LLMScraper(model="claude-3-haiku-20240307")
    semaphore = asyncio.Semaphore(5)

    async def extract_one(url: str) -> Optional[JobPosting]:
        async with semaphore:
            try:
                return await scraper.extract(url, JobPosting, 
                    context="This is a job posting page. Extract all available fields.")
            except Exception as e:
                print(f"Failed {url}: {e}")
                return None

    results = await asyncio.gather(*[extract_one(url) for url in urls])
    return [r for r in results if r is not None]

Handling Rate Limits and Retries

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from anthropic import RateLimitError, APITimeoutError

@retry(
    retry=retry_if_exception_type((RateLimitError, APITimeoutError)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=60)
)
async def extract_with_retry(scraper, url, schema):
    return await scraper.extract(url, schema)

Real-World Example: Monitoring 50 Competitor Product Pages

import asyncio
from datetime import datetime
import asyncpg

async def monitor_competitor_prices():
    scraper = HybridScraper()

    # Load competitors from DB
    conn = await asyncpg.connect(DATABASE_URL)
    competitors = await conn.fetch("SELECT url, sku FROM tracked_products")

    results = []
    for comp in competitors:
        try:
            price = await scraper.extract_price(comp['url'])
            if price:
                results.append({
                    'sku': comp['sku'],
                    'price': price,
                    'url': comp['url'],
                    'scraped_at': datetime.utcnow()
                })
        except Exception as e:
            print(f"Error for {comp['url']}: {e}")

    # Bulk insert
    await conn.executemany("""
        INSERT INTO price_observations (sku, price, url, scraped_at)
        VALUES ($1, $2, $3, $4)
    """, [(r['sku'], r['price'], r['url'], r['scraped_at']) for r in results])

    print(f"Extracted {len(results)}/{len(competitors)} prices")
    await conn.close()

asyncio.run(monitor_competitor_prices())

Cost Estimate for Common Use Cases

Use case	Pages/month	LLM model	Monthly cost
Price monitoring (50 products, daily)	1,500	Haiku	$0.75
Job board monitoring (500 postings/week)	2,000	Haiku	$1.00
News sentiment analysis (100 articles/day)	3,000	Haiku	$1.50
Complex B2B lead enrichment (100/day)	3,000	Sonnet	$18.00

LLM extraction is economically competitive with most SaaS data tools once you factor in the eliminated maintenance cost of selector-based scrapers.

Pre-Built Scraping Infrastructure

If you want production-ready scrapers without building the infrastructure from scratch, I maintain 35 Apify actors covering the most common use cases — each with structured output that integrates cleanly with LLM post-processing pipelines.

Apify Scrapers Bundle — €29 — one-time download, all 35 actors with workflow guides.

DEV Community