I spent three days trying to scrape a single table from a government website. The HTML was a nightmare—no semantic classes, random whitespace, and the structure changed every time the page loaded. I had a regex pattern that worked for exactly one hour before the site pushed an update and broke everything. That’s when I decided to let a language model do the parsing.
This isn’t a post about a specific tool. It’s about a technique I wish I’d known earlier: using an LLM to extract structured data from messy or unpredictable HTML, instead of fighting with CSS selectors or regex.
The Problem That Broke Me
I was building a small side project to track local municipal meeting minutes. Each city publishes them differently—some as PDFs, some as tables, some as nested div soup. I wanted one consistent JSON record per meeting.
My first attempt was beautiful soup + regex. It worked for one city. Then I added a second. The codebase grew into a fragile forest of find_all('div', class_=lambda x: x and 'meeting' in x). Every new site required hours of debugging.
When a city redesigned its entire portal, my scraper failed silently for two weeks before I noticed. That was the last straw.
Dead Ends I Hit
- Regex on raw HTML: Too brittle. A single newline or unescaped character broke extraction.
- XPath queries: Powerful but site-specific. Maintaining a library of XPaths was worse than maintaining regex.
- Headless browser with clever waits: Helped with dynamic content, but parsing the loaded DOM still required fragile selectors.
- Visual diffing / HTML2text: Lost structure. I needed entities like meeting date, location, and agenda items.
The common thread: every approach assumed the HTML would stay consistent. That assumption is almost always wrong for real-world sites.
What Actually Worked: LLM-Based Extraction
Instead of trying to find the data structure, I switched to asking: what data is in this text? I took the raw text (extracted with a simple HTML-to-text converter) and sent it to an LLM with a structured output prompt.
Here’s the core idea in Python:
import openai
from html2text import HTML2Text
# Convert HTML to plain text (reduces noise)
converter = HTML2Text()
converter.ignore_links = False
converter.ignore_images = True
raw_html = """<div class="meeting-item">
<h3>May 12, 2024 Meeting</h3>
<p>Location: City Hall, Room 200</p>
<ul><li>Agenda item 1: Budget review</li></ul>
</div>"""
text = converter.handle(raw_html)
# Define what we want the LLM to extract
system_prompt = """Extract structured data from the following meeting text. Return valid JSON with keys:
- date: string in YYYY-MM-DD
- location: string
- items: list of strings
If you can't determine a value, use null."""
response = openai.chat.completions.create(
model="gpt-4o-mini", # cheaper than gpt-4, good for extraction
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": text}
],
response_format={"type": "json_object"}, # Forces JSON output
temperature=0.1 # Low randomness for consistent extraction
)
meeting_data = json.loads(response.choices[0].message.content)
print(meeting_data)
# {'date': '2024-05-12', 'location': 'City Hall, Room 200', 'items': ['Budget review']}
I used OpenAI’s API here, but the same pattern works with any LLM provider (Anthropic, local models via Ollama, etc.). The key is the combination of text simplification and structured prompting.
Lessons Learned (The Hard Way)
1. It’s not magic – you still need some human oversight
LLMs hallucinate. If a page has two meetings, the model might merge them. I add a validation step: check that the extracted dates exist in a calendar range, or that locations are real cities.
2. Cost can add up
For my side project (scraping ~100 pages per week), I spend about $2/month using gpt-4o-mini. That’s fine. But if you scrape millions of pages, the token cost might exceed the engineering time saved by alternative methods.
3. Latency is real
Each extraction call takes 1–3 seconds. For batch processing that’s acceptable, but for real-time scraping it’s painful. I pre-filter with lightweight rules (e.g., skip pages under 500 characters) to avoid unnecessary API calls.
4. The HTML-to-text step is critical
Garbage in, garbage out. If you feed raw HTML tokens like <div class="mt-4">, the model might get confused. Using a robust converter (I like html2text or trafilatura) reduces noise and improves accuracy.
When NOT to Use This Approach
- You have a small, stable set of sites: Classic selectors are faster and cheaper. Don’t over-engineer.
- You need real-time results: LLM latency is too high for live user requests unless you cache aggressively.
- Your data is extremely sensitive: Sending text to an external API may violate privacy policies. Consider a local model (e.g., Llama 3.1 8B via Ollama).
What I’d Do Differently Next Time
- Start with a fallback chain: try CSS selectors first, and only fall back to LLM parsing when selectors fail. That hybrid approach saves money and speed.
- Use a JSON schema validation library (like
pydantic) to enforce the output shape and catch errors immediately. - Keep a log of every LLM extraction with the original text and extracted data, so I can review misparses and tweak the prompt.
The Real Takeaway
Using an LLM for extraction doesn’t mean you “solve scraping.” It means you move the fragility from code to language. Instead of chasing changing HTML structures, you manage the ambiguity with a prompt. That tradeoff works beautifully for messy, low-volume data sources.
There are commercial services that package this idea (Interwest Info’s AI extraction tool comes to mind, though I haven’t tried it). But building it yourself gives you full control and a much deeper understanding of when the approach fails.
What’s your most cursed scraping experience? Have you ever thrown an LLM at a parsing problem?
Top comments (0)