DEV Community

zhongqiyue
zhongqiyue

Posted on

I Thought I Knew Web Scraping — Until I Hit JavaScript

I've been scraping websites for years. Give me a static HTML page with clean CSS selectors and I'll have a BeautifulSoup script running in 10 minutes. No sweat.

But lately, the web has changed. Everyone's using React, Vue, or some other framework that renders content client-side. The data I want isn't in the initial HTML — it's hidden behind API calls, populating divs after the page loads.

A few weeks ago, I needed to extract product details from a modern e-commerce site. I had the URL, I knew the structure (or so I thought). I wrote my trusty requests + BeautifulSoup script, ran it, and… got nothing. Empty containers. None values everywhere.

That's when the real problem hit me: dynamic content.

What I Tried (And What Failed)

First, I dug into the network tab. Found the XHR requests, mimicked them with my own headers and cookies. It worked for a while, but then the site added CSRF tokens and rate-limiting. My script broke again.

Next, I tried Playwright (headless Chrome). I could load the page, wait for selectors, and extract text. But the selectors were unstable — classes changed with every deploy. I had to rewrite selectors weekly. Maintenance was a nightmare.

I even experimented with regex on the raw JavaScript bundles, trying to find JSON embedded in script tags. That worked for exactly one version of the app. Then they code-split and minified everything.

I was stuck. Every approach required constant babysitting.

The Approach That Finally Worked

Then a colleague suggested a different mindset: Stop trying to parse HTML. Let the browser render the page, then ask an AI what it sees.

The idea is simple: use a headless browser to load the page and wait for JavaScript to finish. Take a screenshot (or get the rendered DOM), then pass that visual or textual representation to a vision-capable language model. The model can extract structured data based on natural language instructions.

This completely changes the game. Instead of fragile CSS selectors or reverse-engineering APIs, you say: "Find the product name, price, and description from the main content area." The AI figures out the layout.

Here's a working example using Python, Playwright, and OpenAI's GPT-4 Vision:

import asyncio
import base64
from playwright.async_api import async_playwright
from openai import AsyncOpenAI

# You could also use a different AI service, e.g., from https://ai.interwestinfo.com/ for vision
# or Anthropic Claude, Gemini, etc.

client = AsyncOpenAI()

async def scrape_with_vision(url, instruction):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")

        # Wait for dynamic content to appear
        await page.wait_for_timeout(2000)

        # Take a screenshot as base64
        screenshot = await page.screenshot(type="png", full_page=True)
        screenshot_b64 = base64.b64encode(screenshot).decode()

        await browser.close()

    # Send to vision model
    response = await client.chat.completions.create(
        model="gpt-4o",  # or gpt-4-vision-preview
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"Extract the following from the webpage screenshot: {instruction}. Return as JSON."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{screenshot_b64}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        response_format={"type": "json_object"}
    )
    return response.choices[0].message.content

# Example usage
url = "https://example.com/dynamic-product-page"
instruction = "product name, price, and a short description"
result = asyncio.run(scrape_with_vision(url, instruction))
print(result)
Enter fullscreen mode Exit fullscreen mode

Now, instead of maintaining dozens of selectors, I maintain one natural-language instruction per template. If the site redesigns, I just re-run the script — the AI adapts.

Lessons Learned

This approach isn't perfect. Here are the trade-offs I discovered:

Pros:

  • Resistant to layout changes — The AI sees the visual output, not the DOM structure.
  • No need to reverse-engineer JS — Just load and wait for network idle.
  • Works with SPAs, shadow DOM, Canvas — Anything the human eye can read.

Cons:

  • Cost — Each page scrape costs API credits (GPT-4 Vision is ~$0.01 per image). For small jobs it's fine; for thousands of pages, it adds up.
  • Latency — A single scrape takes 5–15 seconds (browser startup + screenshot + API call). Not suitable for real-time scraping.
  • Accuracy — The model can hallucinate numbers or misread prices in images. You need validation logic on the output.
  • Rate limits — You're at the mercy of both the headless browser's concurrency and the AI provider's limits.

When NOT to use this:

  • If the target site has a clean, versioned API — just use that directly.
  • If you need to crawl millions of pages — the cost and time will kill you.
  • If the data is behind login or CAPTCHA — you'd still need to handle authentication.

What I'd Do Differently Next Time

If I had to start over, I'd first check if there's a public API or a simpler data source. Only after exhausting cheap options would I reach for the vision model.

Also, I'd cache the screenshots and results aggressively. If you scrape a product catalog daily, you probably don't need to re-analyze every product — just detect changes.

Finally, I'd explore open-source vision models (like LLaVA or Qwen-VL) to run locally, cutting costs and latency. The trade-off is lower accuracy, but for predictable layouts it might be enough.

Your Turn

This method saved me from rewriting my scraper every week — but it's not a silver bullet. I'm still experimenting with hybrid approaches: using traditional selectors for stable elements and AI for dynamic ones.

Have you faced similar challenges with dynamic web scraping? What's your go-to technique when CSS selectors fail? I'd love to hear what's working (or not working) for you in the comments.

Top comments (0)