DEV Community

zhongqiyue
zhongqiyue

Posted on

I Spent 3 Days Scraping a Site — Then AI Did It in 10 Minutes

I’ve been building web scrapers for years. BeautifulSoup, Selenium, Playwright — I thought I’d seen it all. But last month I hit a wall so stubborn that I almost gave up on the entire project.

Here’s the story of how traditional scraping failed me, and why I now treat AI as a legitimate tool in my data extraction toolbox.

The Problem: A Site That Hates Scrapers

A client needed me to extract product listings from a fashion retailer. Not exactly rocket science, right? I opened the page, saw the usual suspects: div.product-card, CSS classes like price, title, image. I wrote a quick BeautifulSoup script, ran it, and… nothing.

The HTML was completely dynamic. Every product card was rendered by JavaScript, and the CSS class names changed every time I reloaded the page (likely a React app with CSS modules or Tailwind’s purge). Worse, they’d added a Cloudflare challenge that blocked headless browsers after a few requests.

What I Tried (and What Broke)

  1. Static parsing with requests + BeautifulSoup — returned an empty div. Classic.
  2. Selenium with Chrome — worked for 5-10 pages, then Cloudflare flagged my IP. Used stealth settings and proxies, still got blocked.
  3. Playwright with stealth plugins — same result. The site’s anti-bot logic was aggressive.
  4. OCR on screenshots — tried Tesseract to read the rendered page. Accuracy was terrible (fancy fonts, overlapping elements).
  5. Third-party scraping APIs — tried a few, but they either cost too much or returned incomplete data.

After three days of debugging, I was about to tell the client it’s impossible.

The Accidental Discovery

While venting to a friend, he mentioned he’d been using AI to extract data from PDF invoices. “Why not try it on web pages?” he said. “Take a screenshot, send it to a vision model, and ask it to return JSON.”

I was skeptical. I’d used GPT-4 for text summarization, but for structured data? And wouldn’t it be slow and expensive?

But I was desperate. So I wrote a quick script:

import base64
from openai import OpenAI
from playwright.sync_api import sync_playwright

def fetch_and_extract(url):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        screenshot = page.screenshot(full_page=True)
        browser.close()

    base64_image = base64.b64encode(screenshot).decode("utf-8")

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """Extract product information from this screenshot.
                        Return a JSON array of objects with fields: name, price (in USD), image_url (if visible), and availability (in stock/out of stock)."""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )

    return response.choices[0].message.content

data = fetch_and_extract("https://example-fashion-site.com/products")
print(data)
Enter fullscreen mode Exit fullscreen mode

I ran it once. Ten seconds later, I had a perfect JSON array with product names, prices formatted as “$49.99”, stock status (it even read the “Add to cart” button and deduced “in stock”). I couldn’t believe it.

Why This Works (and Why It’s Not Magic)

The key insight: modern vision models can read rendered text and understand layout almost as well as a human. They don’t care about class names, dynamic IDs, or anti-bot scripts. They see exactly what the user sees.

In my case, the site was heavy on JavaScript but the final rendered page was clean. The model easily ignored the navigation bar, ads, and footer — just by my prompt saying “extract product information.”

The Trade-offs (Be Honest)

This approach isn’t a silver bullet. Here’s what I learned:

  • Cost: Each screenshot + prompt costs about $0.01–$0.03 with GPT-4o. For 1000 products, that’s $10-30. Cheaper than manual extraction, but more expensive than a traditional scraper (if it works).
  • Latency: 5-15 seconds per page. Too slow for real-time scraping, but fine for batch jobs.
  • Accuracy: It’s not perfect. Sometimes the model hallucinates prices (“$19.99” when it’s actually $19.87). I had to add a post-processing validation step to check for obvious errors.
  • Privacy: Sending screenshots to OpenAI’s servers — some clients won’t allow that. Alternatives exist (local models like LLaVA or Qwen-VL, but they’re less accurate).
  • Rate limits: OpenAI has limits. I had to batch and add delays.

When Should You Use This?

I now use this technique only when:

  • Traditional parsing is impossible (dynamic CSS, heavy JS, anti-bot walls that allow screenshots but block DOM access).
  • The data is partially in images (e.g., size charts, ratings as stars).
  • I need a quick prototype and don’t care about cost.

I still use BeautifulSoup + requests for simple sites. It’s faster, cheaper, and more reliable. But for the really nasty ones, AI is my new hammer.

What I’d Do Differently Next Time

  1. Try a local vision model first. For sensitive data, I’d run LLaVA 13B on a GPU. Slower but no data leaving my server.
  2. Use a better prompt. I learned to ask for specific fields, with examples of the desired output format. Few-shot prompting improved accuracy a lot.
  3. Add a caching layer. If the same page appears again, skip the API call.
  4. Test with a small batch first. Don’t send 1000 screenshots only to find the model confuses “size” with “price”.

By the way, the tool at ai.interwestinfo.com provides a similar service — but the technique is what matters. You can implement it yourself with any vision-capable API.

Over to You

I’m still not fully comfortable replacing parsing with “ask the AI.” But this experience made me realize that our old approaches have limits, and sometimes the best tool is the one that just looks at the page.

What’s your go-to method for extracting data from unfriendly websites? Have you tried using vision models, or do you still rely on XPath and regex wars?

Top comments (0)