DEV Community

zhongqiyue
zhongqiyue

Posted on

I spent 3 days writing regexes. Then I asked an AI to do it in 10 minutes.

Last month, I had to scrape product data from a dozen e-commerce sites. Each site had its own HTML structure, inconsistent CSS classes, and the worst part? The product descriptions were nested inside a dozen different containers. I did what any sane developer would do: I reached for regex.

Three days later, I had a pile of brittle patterns that worked for one site, failed for another, and broke the moment the page layout changed by a single <div>. I was this close to giving up and hiring a VA to copy-paste data.

Then it hit me: Instead of trying to describe the pattern explicitly, why not show an AI a few examples and let it figure out the rest?

What I tried that didn't work

1. Pure CSS selectors

For sites with clean classes, document.querySelectorAll('.price') works. But the real world gives you .productPrice_3xK2f or .css-1a2b3c—auto-generated class names that change on every deploy.

2. XPath

XPath is more flexible, but writing //div[contains(@class, 'price')]//span[2] felt like guessing. And when the page structure varied across product pages, my XPath expressions crumbled.

3. Regex hell

I tried to extract JSON from <script> tags, parse inline styles, and split by keywords. Each new site demanded a custom regex. I had a file with 40+ patterns, and I still missed data.

The approach that finally worked: Few-shot text extraction with an LLM

Instead of writing rules, I decided to treat extraction as a language task. I gave an AI (OpenAI's GPT-4) a few examples of the raw HTML and the exact fields I wanted, and asked it to output JSON.

Here's the core idea:

  • Don't parse the HTML structurally—just feed it as plain text.
  • Provide 2-3 examples (few-shot prompting) to show the desired output format.
  • Use a simple function call with temperature=0 for deterministic results.

The code

First, I wrote a Python function that takes raw HTML and returns a list of products as dictionaries.

import openai
from typing import List, Dict, Any
import json

openai.api_key = "sk-..."  # Keep this in env variable

def extract_products(html_text: str) -> List[Dict[str, Any]]:
    """
    Extract product name, price, and description from HTML using GPT-4.
    """
    prompt = f"""You are a data extraction assistant. Given the raw HTML of a product listing page, extract each product as a JSON object with fields: "name", "price", "description". Return a JSON array.

Examples:

Input HTML:
<div class="item">
  <h2>Widget A</h2>
  <span class="cost">$19.99</span>
  <p class="desc">This widget does X and Y.</p>
</div>
<div class="item">
  <h2>Widget B</h2>
  <span class="cost">$29.99</span>
  <p class="desc">This widget does Z.</p>
</div>

Output JSON:
[{"name": "Widget A", "price": "$19.99", "description": "This widget does X and Y."},
 {{"name": "Widget B", "price": "$29.99", "description": "This widget does Z."}}]

Now extract from this HTML:

{html_text}

Output only the JSON array, no extra text."""

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
        max_tokens=2000
    )

    content = response.choices[0].message.content.strip()
    # Sometimes the model wraps JSON in ```
{% endraw %}
json ...
{% raw %}
Enter fullscreen mode Exit fullscreen mode
if content.startswith("```
Enter fullscreen mode Exit fullscreen mode


"):
content = content.split("\n", 1)[1].rsplit("\n", 1)[0]
return json.loads(content)


**Important:** This is a minimal example. In production, you'd want to:
- Use `gpt-3.5-turbo` for cost savings (works well for simple extractions).
- Add retry logic and error handling.
- Validate the output schema before using.

### How I used it

I scraped the page with `requests` and `BeautifulSoup` to get the raw HTML (not parsed, just keep the full string). Then I called the function above.

Enter fullscreen mode Exit fullscreen mode


python
import requests

url = "https://example.com/products"
html = requests.get(url).text
products = extract_products(html)
for p in products:
print(p["name"], p["price"])




I also added a fallback: if the AI request failed (e.g., timeout), I fell back to a simple regex for basic data. But 95% of the time, the AI worked.

## Lessons learned / trade-offs

### The good
- **No more regex maintenance.** New site? Show the AI one example and it just works.
- **Resilient to layout changes.** As long as the data is in the HTML somewhere, the AI can find it.
- **Handles variations.** One site had prices in `<span>`, another in `<strong>`. The AI adapted.

### The bad
- **Cost.** Each page cost ~$0.02 with GPT-4 on long HTML. For hundreds of pages, that adds up. Using GPT-3.5-turbo cut it to $0.002 per page.
- **Latency.** 2–5 seconds per request. That's fine for a one-time scrape, but not for real-time APIs. I had to batch and cache.
- **Hallucinations.** Occasionally the AI invented a price if the HTML was ambiguous. I added a sanity check: ensure price matches a regex for currency format. If not, re-prompt.
- **Privacy.** Sending raw HTML (possibly containing user data) to OpenAI's API is a no-go for sensitive environments. Consider a local model or a service like `ai.interwestinfo.com` that offers on-premise deployment for compliance.

### When NOT to use this approach
- Your data is perfectly structured and static—CSS selectors are faster and cheaper.
- You need sub-second response times.
- You cannot send data outside your network for legal reasons—but you could run a local LLM (like Llama 2) with a similar prompt.
- The HTML is enormous (megabytes per page) – the prompt window will overflow. Truncate or chunk the page first.

## What I'd do differently next time

1. **Start with few-shot prompting from day one.** I wasted days on regex when the AI would've solved it in minutes.
2. **Measure recall and precision.** I manually checked 50 pages per site to tune the prompt and add validation rules.
3. **Cache AI results.** Every page extraction was saved to a local JSON file. If I reran, I skipped already-processed pages.
4. **Use a library for retries.** The `tenacity` library saved me from transient API errors.

## Final thoughts

Look, I'm not saying throw away all your parsing knowledge. But for messy, semi-structured data extraction, LLMs are a legit tool now. They understand context, they infer missing parts, and they adapt quickly. The technique is the real hero here—the AI is just an implementation detail.

I still keep my regex cheatsheet close, but now I reach for it a lot less.

What's your go-to approach for extracting data from messy HTML? Ever tried letting an AI do the heavy lifting?
Enter fullscreen mode Exit fullscreen mode

Top comments (0)