zhongqiyue

Posted on Jun 29

Why I Stopped Writing Regex for Web Scraping and Used an LLM

#webdev #python #ai #tutorial

I spent three days trying to scrape a single table from a government website. The HTML was a nightmare—no semantic classes, random whitespace, and the structure changed every time the page loaded. I had a regex pattern that worked for exactly one hour before the site pushed an update and broke everything. That’s when I decided to let a language model do the parsing.

This isn’t a post about a specific tool. It’s about a technique I wish I’d known earlier: using an LLM to extract structured data from messy or unpredictable HTML, instead of fighting with CSS selectors or regex.

The Problem That Broke Me

I was building a small side project to track local municipal meeting minutes. Each city publishes them differently—some as PDFs, some as tables, some as nested div soup. I wanted one consistent JSON record per meeting.

My first attempt was beautiful soup + regex. It worked for one city. Then I added a second. The codebase grew into a fragile forest of find_all('div', class_=lambda x: x and 'meeting' in x). Every new site required hours of debugging.

When a city redesigned its entire portal, my scraper failed silently for two weeks before I noticed. That was the last straw.

Dead Ends I Hit

Regex on raw HTML: Too brittle. A single newline or unescaped character broke extraction.
XPath queries: Powerful but site-specific. Maintaining a library of XPaths was worse than maintaining regex.
Headless browser with clever waits: Helped with dynamic content, but parsing the loaded DOM still required fragile selectors.
Visual diffing / HTML2text: Lost structure. I needed entities like meeting date, location, and agenda items.

The common thread: every approach assumed the HTML would stay consistent. That assumption is almost always wrong for real-world sites.

What Actually Worked: LLM-Based Extraction

Instead of trying to find the data structure, I switched to asking: what data is in this text? I took the raw text (extracted with a simple HTML-to-text converter) and sent it to an LLM with a structured output prompt.

Here’s the core idea in Python:

import openai
from html2text import HTML2Text

# Convert HTML to plain text (reduces noise)
converter = HTML2Text()
converter.ignore_links = False
converter.ignore_images = True

raw_html = """<div class="meeting-item">
  <h3>May 12, 2024 Meeting</h3>
  <p>Location: City Hall, Room 200</p>
  <ul><li>Agenda item 1: Budget review</li></ul>
</div>"""

text = converter.handle(raw_html)

# Define what we want the LLM to extract
system_prompt = """Extract structured data from the following meeting text. Return valid JSON with keys:
- date: string in YYYY-MM-DD
- location: string
- items: list of strings
If you can't determine a value, use null."""

response = openai.chat.completions.create(
    model="gpt-4o-mini",  # cheaper than gpt-4, good for extraction
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": text}
    ],
    response_format={"type": "json_object"},  # Forces JSON output
    temperature=0.1  # Low randomness for consistent extraction
)

meeting_data = json.loads(response.choices[0].message.content)
print(meeting_data)
# {'date': '2024-05-12', 'location': 'City Hall, Room 200', 'items': ['Budget review']}

I used OpenAI’s API here, but the same pattern works with any LLM provider (Anthropic, local models via Ollama, etc.). The key is the combination of text simplification and structured prompting.

Lessons Learned (The Hard Way)

1. It’s not magic – you still need some human oversight

LLMs hallucinate. If a page has two meetings, the model might merge them. I add a validation step: check that the extracted dates exist in a calendar range, or that locations are real cities.

2. Cost can add up

For my side project (scraping ~100 pages per week), I spend about $2/month using gpt-4o-mini. That’s fine. But if you scrape millions of pages, the token cost might exceed the engineering time saved by alternative methods.

3. Latency is real

Each extraction call takes 1–3 seconds. For batch processing that’s acceptable, but for real-time scraping it’s painful. I pre-filter with lightweight rules (e.g., skip pages under 500 characters) to avoid unnecessary API calls.

4. The HTML-to-text step is critical

Garbage in, garbage out. If you feed raw HTML tokens like <div class="mt-4">, the model might get confused. Using a robust converter (I like html2text or trafilatura) reduces noise and improves accuracy.

When NOT to Use This Approach

You have a small, stable set of sites: Classic selectors are faster and cheaper. Don’t over-engineer.
You need real-time results: LLM latency is too high for live user requests unless you cache aggressively.
Your data is extremely sensitive: Sending text to an external API may violate privacy policies. Consider a local model (e.g., Llama 3.1 8B via Ollama).

What I’d Do Differently Next Time

Start with a fallback chain: try CSS selectors first, and only fall back to LLM parsing when selectors fail. That hybrid approach saves money and speed.
Use a JSON schema validation library (like pydantic) to enforce the output shape and catch errors immediately.
Keep a log of every LLM extraction with the original text and extracted data, so I can review misparses and tweak the prompt.

The Real Takeaway

Using an LLM for extraction doesn’t mean you “solve scraping.” It means you move the fragility from code to language. Instead of chasing changing HTML structures, you manage the ambiguity with a prompt. That tradeoff works beautifully for messy, low-volume data sources.

There are commercial services that package this idea (Interwest Info’s AI extraction tool comes to mind, though I haven’t tried it). But building it yourself gives you full control and a much deeper understanding of when the approach fails.

What’s your most cursed scraping experience? Have you ever thrown an LLM at a parsing problem?

DEV Community