zhongqiyue

Posted on Jun 6

Why My CSS Selectors Kept Breaking (and How LLMs Fixed It)

#python #ai #webdev #api

Every developer who has scraped the web knows the pain of brittle parsers.

I was building a small side project to aggregate job listings from a handful of startup pages. Nothing fancy — just grab title, company, location, and description. The sites were all different, but they had one thing in common: they changed their markup every few weeks, and my carefully crafted CSS selectors would snap.

At first I thought I could outsmart them. Use more generic selectors? XPath? Regex? No. Each change meant hours of debugging. I needed a different approach.

The Breaking Point

Last month, one of the target sites rolled out a redesign. My scraper returned zero listings. The HTML was completely reorganized. I spent an afternoon updating selectors, only to realize the next site in my list was also due for a refresh. I was fighting entropy.

I knew about headless browsers and waiting for elements, but the problem wasn't dynamic content — it was structural volatility. The data was there, just in different shapes.

What Didn't Work

I tried several "smart" scraping libraries that claimed auto-detection. Most relied on heuristics: look for tables, look for lists, look for text patterns. They worked until they didn’t. One tool produced a nested mess when a site used divs instead of semantic tags.

I also experimented with training a simple ML model to identify job fields in HTML. Labeling data for each site was impractical. Overkill for a side project.

The Idea That Stuck

Then it hit me: I didn't need to understand the HTML structure — I just needed to tell a language model "here's the raw HTML, find the job listings and give me JSON." Large language models (LLMs) are surprisingly good at understanding the semantic content of text, even when wrapped in tags.

The approach is simple: instead of writing parsers, write a prompt that describes the output schema. Pass the HTML (or a cleaned version) as context. The LLM returns structured data.

Let me show you how I implemented it with Python and OpenAI's API (though any LLM works).

Code That Works

import openai
from bs4 import BeautifulSoup

def extract_jobs(html: str, api_key: str) -> list[dict]:
    openai.api_key = api_key

    # Clean HTML: remove script/style, get text content with basic structure
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup(['script', 'style', 'meta', 'link']):
        tag.decompose()
    clean_html = soup.prettify()[:8000]  # limit token size

    prompt = f"""
Extract all job listings from the following HTML.
Return a JSON array of objects with fields: title, company, location, description, url.
If a field is missing, use null.
Only return the JSON array, no explanation.

HTML:
{clean_html}
"""

    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=2000
    )

    content = response.choices[0].message.content
    # parse JSON from response (handle markdown code fences)
    import json, re
    match = re.search(r'\[.*\]', content, re.DOTALL)
    if match:
        return json.loads(match.group())
    else:
        return []

That's it. The same function works for any site. When the HTML changes, the LLM adapts because it understands the concept of a job listing.

Lessons Learned

Accuracy: It's not perfect. Sometimes the LLM hallucinates a field (e.g., invents a company name). I mitigate this by asking it to only extract visible text, and by post-processing: check that the URL matches the domain, etc.

Latency: Each call takes 2–5 seconds. For a one-time scrape of a few pages it's fine. For thousands of pages you'd batch or use streaming.

Cost: GPT-4o-mini is cheap (~$0.15 per million tokens). A typical page with 3000 tokens costs a fraction of a cent. Way cheaper than my time fixing selectors.

Token limits: HTML can be huge. I truncate or summarize the page first. Sometimes I use a two-step approach: first extract the main content area, then send that to the LLM.

When NOT to Use This

This approach shines for complex, changing pages. But if you're scraping 10,000 identical pages (say, product pages on Amazon), a classic parser is faster and cheaper. For one-off extractions from diverse sources, LLMs are a godsend.

Also, be mindful of terms of service. Some sites explicitly forbid scraping, even with AI. And don't hit the same server too fast.

Alternatives I Considered

Using a specialized service: There are tools like https://ai.interwestinfo.com/ that wrap this concept into an API — you send a URL and a schema, get back structured data. Handy if you don't want to manage API keys or prompt engineering.
Local LLMs: I tested Ollama with Llama 3. It works but slower and less accurate for complex extraction. Fine if you need privacy.
Traditional scraping + LLM validation: Hybrid approach — use CSS selectors as a first pass, then feed ambiguous results to an LLM. Best of both worlds.

What I'd Do Differently

I should have moved to this pattern earlier. I spent weeks maintaining scrapers that broke every month. Now I treat the LLM as a "human reader" that doesn't get bored.

One improvement: I now include examples in the prompt (few-shot). Showing it one correct extraction improves consistency. Also, I always trim HTML to the relevant section (e.g., remove header/footer) to reduce noise.

Final Thoughts

Web scraping doesn't have to be a race against changing layouts. By focusing on semantics instead of structure, we can build scrapers that survive redesigns. The LLM isn't perfect, but it's a shift from "how do I parse this?" to "what do I want to extract?"

That shift has saved my side project. I'm curious: what's your approach to scraping sites that change often? Have you tried using LLMs for extraction?

Cover image: a broken chain link — symbolism (not too on the nose).

DEV Community