I've been scraping websites for years. Give me a static HTML page with clean CSS selectors and I'll have a BeautifulSoup script running in 10 minutes. No sweat.
But lately, the web has changed. Everyone's using React, Vue, or some other framework that renders content client-side. The data I want isn't in the initial HTML — it's hidden behind API calls, populating divs after the page loads.
A few weeks ago, I needed to extract product details from a modern e-commerce site. I had the URL, I knew the structure (or so I thought). I wrote my trusty requests + BeautifulSoup script, ran it, and… got nothing. Empty containers. None values everywhere.
That's when the real problem hit me: dynamic content.
What I Tried (And What Failed)
First, I dug into the network tab. Found the XHR requests, mimicked them with my own headers and cookies. It worked for a while, but then the site added CSRF tokens and rate-limiting. My script broke again.
Next, I tried Playwright (headless Chrome). I could load the page, wait for selectors, and extract text. But the selectors were unstable — classes changed with every deploy. I had to rewrite selectors weekly. Maintenance was a nightmare.
I even experimented with regex on the raw JavaScript bundles, trying to find JSON embedded in script tags. That worked for exactly one version of the app. Then they code-split and minified everything.
I was stuck. Every approach required constant babysitting.
The Approach That Finally Worked
Then a colleague suggested a different mindset: Stop trying to parse HTML. Let the browser render the page, then ask an AI what it sees.
The idea is simple: use a headless browser to load the page and wait for JavaScript to finish. Take a screenshot (or get the rendered DOM), then pass that visual or textual representation to a vision-capable language model. The model can extract structured data based on natural language instructions.
This completely changes the game. Instead of fragile CSS selectors or reverse-engineering APIs, you say: "Find the product name, price, and description from the main content area." The AI figures out the layout.
Here's a working example using Python, Playwright, and OpenAI's GPT-4 Vision:
import asyncio
import base64
from playwright.async_api import async_playwright
from openai import AsyncOpenAI
# You could also use a different AI service, e.g., from https://ai.interwestinfo.com/ for vision
# or Anthropic Claude, Gemini, etc.
client = AsyncOpenAI()
async def scrape_with_vision(url, instruction):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(url, wait_until="networkidle")
# Wait for dynamic content to appear
await page.wait_for_timeout(2000)
# Take a screenshot as base64
screenshot = await page.screenshot(type="png", full_page=True)
screenshot_b64 = base64.b64encode(screenshot).decode()
await browser.close()
# Send to vision model
response = await client.chat.completions.create(
model="gpt-4o", # or gpt-4-vision-preview
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": f"Extract the following from the webpage screenshot: {instruction}. Return as JSON."
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{screenshot_b64}",
"detail": "high"
}
}
]
}
],
response_format={"type": "json_object"}
)
return response.choices[0].message.content
# Example usage
url = "https://example.com/dynamic-product-page"
instruction = "product name, price, and a short description"
result = asyncio.run(scrape_with_vision(url, instruction))
print(result)
Now, instead of maintaining dozens of selectors, I maintain one natural-language instruction per template. If the site redesigns, I just re-run the script — the AI adapts.
Lessons Learned
This approach isn't perfect. Here are the trade-offs I discovered:
Pros:
- Resistant to layout changes — The AI sees the visual output, not the DOM structure.
- No need to reverse-engineer JS — Just load and wait for network idle.
- Works with SPAs, shadow DOM, Canvas — Anything the human eye can read.
Cons:
- Cost — Each page scrape costs API credits (GPT-4 Vision is ~$0.01 per image). For small jobs it's fine; for thousands of pages, it adds up.
- Latency — A single scrape takes 5–15 seconds (browser startup + screenshot + API call). Not suitable for real-time scraping.
- Accuracy — The model can hallucinate numbers or misread prices in images. You need validation logic on the output.
- Rate limits — You're at the mercy of both the headless browser's concurrency and the AI provider's limits.
When NOT to use this:
- If the target site has a clean, versioned API — just use that directly.
- If you need to crawl millions of pages — the cost and time will kill you.
- If the data is behind login or CAPTCHA — you'd still need to handle authentication.
What I'd Do Differently Next Time
If I had to start over, I'd first check if there's a public API or a simpler data source. Only after exhausting cheap options would I reach for the vision model.
Also, I'd cache the screenshots and results aggressively. If you scrape a product catalog daily, you probably don't need to re-analyze every product — just detect changes.
Finally, I'd explore open-source vision models (like LLaVA or Qwen-VL) to run locally, cutting costs and latency. The trade-off is lower accuracy, but for predictable layouts it might be enough.
Your Turn
This method saved me from rewriting my scraper every week — but it's not a silver bullet. I'm still experimenting with hybrid approaches: using traditional selectors for stable elements and AI for dynamic ones.
Have you faced similar challenges with dynamic web scraping? What's your go-to technique when CSS selectors fail? I'd love to hear what's working (or not working) for you in the comments.
Top comments (0)