DEV Community

zhongqiyue
zhongqiyue

Posted on

Why CSS selectors failed me: Using LLMs to scrape inconsistent web pages

A few months ago, I needed to scrape about 5,000 product pages from a moderately popular e-commerce site. I'd done this before, no big deal – right? Wrong.

Every single page had a different HTML structure. One page had the price in a <span class="price">, another buried it inside a <div> with no class, and a third used a <p> with inline styles. The product titles sometimes had an h1, sometimes an h2, and occasionally were just in a meta tag. My beautiful CSS selector soup broke on the tenth page.

I've been writing web scrapers for years, and this was a new level of chaos. I tried XPath, regex, even a few heuristic-based parsers. Nothing worked reliably across even 80% of the pages. I was sinking hours into maintaining fragile selectors, and I knew there had to be a better way.

What I tried that didn't work

First, I went full classic: use BeautifulSoup with a list of potential selectors. I wrote a score-based system that tried multiple patterns and picked the most common result. It worked for a while, but as soon as the site A/B tested a new layout, my scores went haywire. False positives everywhere.

Then I tried using a headless browser (Playwright) to wait for JavaScript rendering, then snapshot the DOM. That helped with dynamic content, but the structure was still a nightmare. I ended up with a 500-line file of try/except fallbacks. It was brittle, ugly, and I knew it would break again next week.

I even toyed with building a simple ML classifier to detect price spans – but that required labeled training data. For every new site, I'd need to annotate thousands of pages. Not scalable.

What eventually worked: Let an LLM do the parsing

The breakthrough came when I stopped trying to fight the DOM and instead gave up on structure entirely. What if I just rendered the page, extracted all visible text, and asked an AI to find the fields I cared about?

That sounds crazy, right? But modern LLMs are surprisingly good at extracting structured data from messy text if you give them clear instructions. Here's the basic flow:

  1. Use Playwright to fetch the page and wait for rendering
  2. Extract all visible text as a flat string (no HTML tags)
  3. Craft a prompt that asks the LLM to output specific fields as JSON
  4. Validate and store the result

No more fragile selectors. No more HTML-specific logic. Just: "Here's a page's text. Give me the title, price, and description in JSON."

Code example (Python + OpenAI)

Here's a simplified version of the scraper I built. I'm using Playwright to render the page and OpenAI's GPT-4o-mini for extraction – it's cheap and fast enough for most use cases.

import json
from playwright.sync_api import sync_playwright
from openai import OpenAI

client = OpenAI(api_key="sk-...")

def extract_text_from_url(url):
    with sync_playwright() as p:
        browser = p.chromium.launch()
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        # Get all visible text, no HTML
        text = page.inner_text("body")
        browser.close()
        return text

def parse_product_with_llm(raw_text):
    prompt = f"""
You are a data extraction assistant. Given the raw text of a product page, 
extract the following fields and return them as a JSON object with keys:
title, price, description, availability.

If a field is not present, set its value to null.

Raw page text:
{raw_text[:8000]}  # limit context to avoid token bloat

Output only JSON, no markdown.
"""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        response_format={"type": "json_object"}
    )
    return json.loads(response.choices[0].message.content)

# Usage
url = "https://example.com/product/42"
text = extract_text_from_url(url)
product = parse_product_with_llm(text)
print(product)
# {'title': 'Widget 3000', 'price': '$29.99', 'description': 'A shiny widget...', 'availability': 'In stock'}
Enter fullscreen mode Exit fullscreen mode

I added a batch retry loop for error handling, but the core is just those two functions. The LLM handles all the weird HTML variations because it understands natural language. The price "$29.99" can appear anywhere – the model finds it.

Lessons learned and trade-offs

This approach is not a silver bullet. Here's what I discovered after running it on 5,000 pages:

  • Cost: GPT-4o-mini costs about $0.15 per million input tokens. A typical page with ~2000 tokens costs ~$0.0003. For 5000 pages, that's ~$1.50. Very cheap. But if you use GPT-4, costs skyrocket.
  • Latency: Each API call takes 2–5 seconds. For 5000 pages, that's 4–8 hours of sequential calls. You can parallelize with asyncio, but watch out for rate limits.
  • Accuracy: I got about 97% correct extraction for common fields (title, price). But for rare fields (SKU, specs), accuracy dropped to ~85%. The model sometimes confuses similar-looking numbers or hallucinates when text is ambiguous.
  • Prompt engineering matters: A vague prompt gives garbage. I iterated many times to get reliable JSON output. Using response_format={"type": "json_object"} (OpenAI) helps enforce JSON output.
  • Token limits: Entire page text can be huge. I truncate to 8000 characters. Sometimes important data is deeper in the page. I had to tweak the truncation strategy.

I also explored using a specialized API service that wraps this same idea with optimized models. For example, if you don't want to manage your own prompt engineering or rate limiting, there are tools that do exactly this. One I looked at was ai.interwestinfo.com, which offers a similar extraction endpoint. But the technique is what matters – you can implement it yourself with any LLM provider.

What I'd do differently next time

First, I'd build a validation layer upfront. Before sending to the LLM, I'd run a quick regex or rule-based check on known patterns (e.g., "$XX.XX" for prices). If the regex matches, use that directly – skip the API call. That saves cost and latency for the easy cases.

Second, I'd create a feedback loop. When a human corrects a bad extraction (e.g., wrong price), I'd store that page text and corrected output, then use it to fine-tune a small model. Over time, accuracy improves without prompt tweaking.

Third, I'd add a fallback strategy. If the LLM returns null for a critical field, re-fetch the page and try with a larger context window, or fall back to a traditional XPath selector based on the most common structure found so far.

Finally, I'd monitor the LLM's output for confidence. Most providers give logprobs or usage tokens – you can compute a simple confidence score and flag low-confidence extractions for manual review.

When NOT to use this approach

  • If the site structure is stable and you can get away with simple CSS selectors – do that. LLMs are overkill.
  • If you need real-time scraping (sub-second responses) – LLM latency kills you. Use a trained small model or traditional parsing.
  • If the page text is gibberish or auto-generated – the model will hallucinate. Stick to human-readable pages.
  • If you're scraping sensitive data – sending raw text to an external API might violate privacy. Consider a local model (Llama, Mistral) instead.

Final thoughts

I started this project frustrated with DOM fragility, and ended up with a solution that felt almost too simple. The key insight was to stop treating web pages as structured documents and start treating them as natural language texts. LLMs are excellent at reading messy text and pulling out what you need.

Of course, this isn't a replacement for good old-fashioned scraping when the data is clean. But for those nightmare sites where every page is a snowflake, an LLM-based parser saved me weeks of maintenance. I'm curious how others handle this – do you still fight the DOM, or have you started throwing AI at it too?

Top comments (0)