If Your Scraper Uses Regex on HTML, You're Already Broken

If your "scraper" is a requests.get() followed by re.findall(r'<div class=\"price\">.*?</div>', html), I have bad news.

You don't have a scraper. You have a layout sensor. The first time the dev team renames the class, adds a wrapper <span>, or A/B tests a new pricing component, your pipeline goes silent. Not loud, not error-throwing — silent. Empty rows in the dataset. No alarm. You find out a week later when a stakeholder asks why the dashboard looks weird.

I rebuilt our Idealista scraper this quarter and the regex stage was the thing I deleted first.

The 3-item checklist

Before you write another re.findall against HTML, check:

Is there a stable accessibility role or label? (getByRole('heading', { name: /price/i }) — survives class renames.)
Is the data actually in the rendered page, or is it injected via JSON? (Often the JSON-LD <script> block has everything you need, no DOM walking.)
Can you assert the schema fails loud? (If a field is missing, throw — don't silently default to None.)

If the answer to all three is no, you're not scraping. You're guessing.

The 10-line replacement

Here's the pattern I keep copying into new actors:

from playwright.async_api import async_playwright
import json

async def extract_listing(page):
    # Pull JSON-LD first — it's the spec, not the styling.
    ld_json = await page.locator('script[type="application/ld+json"]').first.text_content()
    data = json.loads(ld_json)
    return {
        "price": data["offers"]["price"],
        "currency": data["offers"]["priceCurrency"],
        "address": data["address"]["streetAddress"],
        "url": page.url,
    }

Ten lines. No regex. No CSS class names. No BeautifulSoup chain that breaks when someone wraps the price in a new <div>.

Why this works: JSON-LD is what Idealista, Bayut, and most listing sites publish for Google. It's stable because it's a contract with search engines, not with your scraper. When the visual layout changes, the JSON-LD almost always doesn't.

Quick case

Our Idealista actor went from 4 selector-related breakages per month to zero in the quarter after I switched extraction to JSON-LD + accessibility selectors. The breakages we still see are real changes — new property types, new fields — and they fail loud now, with a schema validation error, instead of silently returning empty strings.

That's the bar: when the site changes, your scraper either keeps working or throws an error you can read. "Returns empty rows" is not acceptable behaviour.

The CTA you didn't ask for

This pattern is now the default starter for every actor we ship — visible in the Idealista actor. Faster runs, fewer 3am Slack messages from clients asking why their CSV is half-empty. We turned the JSON-LD-first extractor into a reusable module that drops into any new actor in about a minute.

So:

Open your scraper. Search for re.findall, re.search, or BeautifulSoup chained more than two .find() deep. Drop the worst offender in the comments — I'll show you the JSON-LD or selector replacement.

Agree, disagree, or got a site where this falls apart? Reply.

Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

DEV Community

If Your Scraper Uses Regex on HTML, You're Already Broken

The 3-item checklist

The 10-line replacement

Quick case

The CTA you didn't ask for

Top comments (0)