Last month, I had to scrape product data from a dozen e-commerce sites. Each site had its own HTML structure, inconsistent CSS classes, and the worst part? The product descriptions were nested inside a dozen different containers. I did what any sane developer would do: I reached for regex.
Three days later, I had a pile of brittle patterns that worked for one site, failed for another, and broke the moment the page layout changed by a single <div>. I was this close to giving up and hiring a VA to copy-paste data.
Then it hit me: Instead of trying to describe the pattern explicitly, why not show an AI a few examples and let it figure out the rest?
What I tried that didn't work
1. Pure CSS selectors
For sites with clean classes, document.querySelectorAll('.price') works. But the real world gives you .productPrice_3xK2f or .css-1a2b3c—auto-generated class names that change on every deploy.
2. XPath
XPath is more flexible, but writing //div[contains(@class, 'price')]//span[2] felt like guessing. And when the page structure varied across product pages, my XPath expressions crumbled.
3. Regex hell
I tried to extract JSON from <script> tags, parse inline styles, and split by keywords. Each new site demanded a custom regex. I had a file with 40+ patterns, and I still missed data.
The approach that finally worked: Few-shot text extraction with an LLM
Instead of writing rules, I decided to treat extraction as a language task. I gave an AI (OpenAI's GPT-4) a few examples of the raw HTML and the exact fields I wanted, and asked it to output JSON.
Here's the core idea:
- Don't parse the HTML structurally—just feed it as plain text.
- Provide 2-3 examples (few-shot prompting) to show the desired output format.
- Use a simple function call with temperature=0 for deterministic results.
The code
First, I wrote a Python function that takes raw HTML and returns a list of products as dictionaries.
import openai
from typing import List, Dict, Any
import json
openai.api_key = "sk-..." # Keep this in env variable
def extract_products(html_text: str) -> List[Dict[str, Any]]:
"""
Extract product name, price, and description from HTML using GPT-4.
"""
prompt = f"""You are a data extraction assistant. Given the raw HTML of a product listing page, extract each product as a JSON object with fields: "name", "price", "description". Return a JSON array.
Examples:
Input HTML:
<div class="item">
<h2>Widget A</h2>
<span class="cost">$19.99</span>
<p class="desc">This widget does X and Y.</p>
</div>
<div class="item">
<h2>Widget B</h2>
<span class="cost">$29.99</span>
<p class="desc">This widget does Z.</p>
</div>
Output JSON:
[{"name": "Widget A", "price": "$19.99", "description": "This widget does X and Y."},
{{"name": "Widget B", "price": "$29.99", "description": "This widget does Z."}}]
Now extract from this HTML:
{html_text}
Output only the JSON array, no extra text."""
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=2000
)
content = response.choices[0].message.content.strip()
# Sometimes the model wraps JSON in ```
{% endraw %}
json ...
{% raw %}
if content.startswith("```
"):
content = content.split("\n", 1)[1].rsplit("\n", 1)[0]
return json.loads(content)
**Important:** This is a minimal example. In production, you'd want to:
- Use `gpt-3.5-turbo` for cost savings (works well for simple extractions).
- Add retry logic and error handling.
- Validate the output schema before using.
### How I used it
I scraped the page with `requests` and `BeautifulSoup` to get the raw HTML (not parsed, just keep the full string). Then I called the function above.
python
import requests
url = "https://example.com/products"
html = requests.get(url).text
products = extract_products(html)
for p in products:
print(p["name"], p["price"])
I also added a fallback: if the AI request failed (e.g., timeout), I fell back to a simple regex for basic data. But 95% of the time, the AI worked.
## Lessons learned / trade-offs
### The good
- **No more regex maintenance.** New site? Show the AI one example and it just works.
- **Resilient to layout changes.** As long as the data is in the HTML somewhere, the AI can find it.
- **Handles variations.** One site had prices in `<span>`, another in `<strong>`. The AI adapted.
### The bad
- **Cost.** Each page cost ~$0.02 with GPT-4 on long HTML. For hundreds of pages, that adds up. Using GPT-3.5-turbo cut it to $0.002 per page.
- **Latency.** 2–5 seconds per request. That's fine for a one-time scrape, but not for real-time APIs. I had to batch and cache.
- **Hallucinations.** Occasionally the AI invented a price if the HTML was ambiguous. I added a sanity check: ensure price matches a regex for currency format. If not, re-prompt.
- **Privacy.** Sending raw HTML (possibly containing user data) to OpenAI's API is a no-go for sensitive environments. Consider a local model or a service like `ai.interwestinfo.com` that offers on-premise deployment for compliance.
### When NOT to use this approach
- Your data is perfectly structured and static—CSS selectors are faster and cheaper.
- You need sub-second response times.
- You cannot send data outside your network for legal reasons—but you could run a local LLM (like Llama 2) with a similar prompt.
- The HTML is enormous (megabytes per page) – the prompt window will overflow. Truncate or chunk the page first.
## What I'd do differently next time
1. **Start with few-shot prompting from day one.** I wasted days on regex when the AI would've solved it in minutes.
2. **Measure recall and precision.** I manually checked 50 pages per site to tune the prompt and add validation rules.
3. **Cache AI results.** Every page extraction was saved to a local JSON file. If I reran, I skipped already-processed pages.
4. **Use a library for retries.** The `tenacity` library saved me from transient API errors.
## Final thoughts
Look, I'm not saying throw away all your parsing knowledge. But for messy, semi-structured data extraction, LLMs are a legit tool now. They understand context, they infer missing parts, and they adapt quickly. The technique is the real hero here—the AI is just an implementation detail.
I still keep my regex cheatsheet close, but now I reach for it a lot less.
What's your go-to approach for extracting data from messy HTML? Ever tried letting an AI do the heavy lifting?
Top comments (0)