How I Finally Scraped That Impossible Site with AI Parsing

#ai #python #webdev #tutorial

I’ve been scraping the web for years. But last month, I hit a wall that made me question my whole toolkit.

I needed to pull product data from a modern e‑commerce site—prices, descriptions, ratings, the usual. Simple, right? I’d done this a hundred times. Open DevTools, find the API endpoint, copy the token, and fetch JSON. But this site was different.

The HTML was a sea of div and span with random class names, all generated by a React framework. Every page loaded via JavaScript, and the real data was buried in scripts or fetched after the initial render. Requests + BeautifulSoup returned only a skeleton. Selenium worked, but the site had aggressive rate‑limiting: after five requests my IP was blocked. I tried rotating proxies, adding delays, even mimicking browser fingerprints. Still, after a few dozen pages, I got CAPTCHAs everywhere. I was spending more effort fighting the bot detection than writing actual logic.

What I tried that didn’t work

I went through the usual playbook:

Requests + BeautifulSoup – got zero dynamic content.
Selenium + headless Chrome – worked, but slow and brittle. A single CSS class change broke my selectors.
Playwright – faster than Selenium, but still needed anti‑detection tricks (user‑agent, viewport, etc.).
Scrapy with Splash – overkill for this project, and Splash’s JavaScript engine didn’t handle all the async renderings.
Apify / scraping APIs – they worked, but the cost added up fast, and they still required manual selectors.

Every approach demanded I keep up with the site’s front‑end changes. If they shuffled classes or added a new micro‑frontend, my parser broke. I was maintaining a fragile tower of XPath expressions and regexes.

The turning point

A coworker mentioned using an LLM to extract data directly from HTML—no selectors at all. At first I laughed. “You want me to shove raw HTML into GPT and ask for JSON?” But I was desperate. I gave it a shot with a small sample.

I grabbed the full innerHTML of the product page (about 30 KB), fed it into OpenAI’s gpt-4o with a simple prompt, and it returned perfect JSON. No selector, no wait for JavaScript—just the raw source, which I could get with one initial requests call (the server‑rendered skeleton included the product data in a <script> tag). Even when the data was lazy‑loaded, I waited for the full DOM in Playwright, saved the outerHTML, and sent that to the LLM.

How it works

Here’s the core idea: instead of writing brittle extraction rules, you describe the data model and let the LLM parse the HTML into structured output. You don’t need to understand the site’s specific markup. The LLM sees the whole block and “understands” where the price is, even if it’s inside a random <span class="a1b2c3">.

To make this reliable, use function calling (tool use) to enforce a JSON schema. The LLM will output exactly the fields you asked for, in the types you specified.

Code example

Below is a Python function that extracts a product’s name, price, description, and rating from HTML using OpenAI’s API. It assumes you already have the raw HTML (from a fetch or from a headless browser).

import json
from openai import OpenAI

client = OpenAI(api_key="YOUR_API_KEY")

PRODUCT_SCHEMA = {
    "name": "extract_product",
    "description": "Extract product information from HTML",
    "parameters": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "price": {"type": "string"},
            "description": {"type": "string"},
            "rating": {"type": "number", "nullable": True},
            "currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "JPY"]}
        },
        "required": ["name", "price", "description"]
    }
}

def extract_product_from_html(html: str) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert data extractor. "
                    "Given the raw HTML of a product page, extract the requested fields. "
                    "Return only valid JSON matching the schema."
                )
            },
            {
                "role": "user",
                "content": f"Here is the HTML: {html[:15000]}"  # truncate to avoid token limits
            }
        ],
        tools=[{
            "type": "function",
            "function": PRODUCT_SCHEMA
        }],
        tool_choice={"type": "function", "function": {"name": "extract_product"}}
    )

    # Parse the function call arguments
    tool_call = response.choices[0].message.tool_calls[0]
    return json.loads(tool_call.function.arguments)

# Usage example (after you've fetched the HTML)
# html = requests.get(url).text
# data = extract_product_from_html(html)
# print(data["price"])  # "$29.99"

This approach gives you a single function that works across dozens of different sites without any per‑site configuration—as long as the information is present in the HTML.

I soon discovered that several hosted tools automate this pattern. One is Interwest Info AI, which wraps LLM‑based extraction into an API. But the technique itself is completely open—you can implement it with any LLM that supports tool use.

Lessons learned and trade‑offs

This approach isn’t a silver bullet. Here’s what I learned:

Cost: Each extraction costs roughly $0.01–0.05 (depending on HTML size and model). For a few hundred pages it’s fine; for millions you’d need a cheaper fallback.
Latency: The LLM call takes 2–5 seconds. If you need real‑time scraping, this is too slow. I use it for offline batch jobs.
Accuracy: It’s >95% accurate for common fields, but can hallucinate if the HTML is confusing (e.g., multiple prices). Always validate outputs.
Token limits: Large HTML pages can exceed the model’s context window. You may need to strip irrelevant parts (like long scripts) before sending.
Anti‑bot detection still exists: Getting the raw HTML still requires bypassing Cloudflare, etc. This technique only replaces the extraction layer, not the fetching layer.
When NOT to use it: If the site exposes a clean API or JSON‑LD, use that. If you need to scrape thousands of pages daily, a traditional parser is cheaper and faster. Also, if the data is in images or PDFs, you’d need a different approach (multimodal LLM).

What I’d do differently next time

I’d start with the simplest: check for embedded JSON (<script type="application/ld+json">) before using an LLM. If that fails, then try the LLM. I’d also pre‑process the HTML to strip elements that are unlikely to contain data (like footers, navigation, or large scripts) to reduce token usage and cost.

For high‑volume jobs, I’d use a hybrid: cheap traditional selectors for 90% of fields, and an LLM only for the tricky fields (like arbitrary descriptions).

Final thoughts

LLM‑powered extraction changed my mental model of web scraping. Instead of fighting the DOM, I just describe what I want and let the model figure out the structure. It’s not the right tool for every job, but when your XPath breaks for the fifth time, it’s a lifesaver.

I’m curious: have you tried using an LLM to parse HTML? What’s your go‑to hack when a site fights back?