DEV Community

Cover image for Stop Fighting the DOM. Selector-First Thinking Will Save Your Scraper.
SIÁN Agency
SIÁN Agency

Posted on • Originally published at apify.com

Stop Fighting the DOM. Selector-First Thinking Will Save Your Scraper.

Most broken scrapers I see have the same shape: someone wrote the extraction logic first and the selectors second. The selectors were an afterthought — whatever worked in DevTools at 2am.

That's backwards. Selectors are the contract between your code and the page. Get them wrong and the rest of your scraper is irrelevant.

The mindset shift

Selector-first thinking means: before you write a single line of extraction code, you decide how the data is identified. Not "how do I get the price?" but "what does the page tell me, programmatically, that this thing is a price?"

Three answers, in order of preference:

  1. SemanticsgetByRole, getByLabel, getByText. These mirror what an accessibility tree exposes. They survive design changes.
  2. Data attributesdata-testid, data-product-id, itemprop. Devs often add these for their own tests; you get to free-ride.
  3. Structured data — JSON-LD, microdata, OpenGraph. The page is already telling Google what's a price; let it tell you too.

CSS classes are last resort. Class names are styling, not identity. They change when the design changes. They're the equivalent of asking for "the third button from the top" — works until someone rearranges the menu.

The 3-item checklist

Before you write a selector:

  1. Open the accessibility tree in DevTools (Chrome: Elements → Accessibility tab). If the data has a role and an accessible name, use getByRole.
  2. Search the page source for application/ld+json. If it's there and contains your fields, parse it directly. No DOM walking needed.
  3. Look for data-* attributes near the data. Devs leave testing hooks everywhere. Use theirs.

If none of those work, then fall back to CSS or XPath. And when you do, anchor to something stable — a parent landmark, an aria-label, a data- attribute — not just a class chain.

The 10-line replacement

Here's the priority I use in every new actor:

async function extractPrice(page) {
  // 1. Structured data first.
  const ld = await page.locator('script[type="application/ld+json"]')
                       .first().textContent();
  const data = JSON.parse(ld ?? '{}');
  if (data?.offers?.price) return data.offers.price;

  // 2. Semantic selectors.
  const priceByLabel = page.getByLabel(/^price$/i);
  if (await priceByLabel.count()) return priceByLabel.textContent();

  // 3. Data attributes.
  const priceByData = page.locator('[data-testid="price"]');
  if (await priceByData.count()) return priceByData.textContent();

  // 4. Last resort: CSS class. Logged loudly so we know we're in fallback.
  console.warn('Falling back to CSS selector — selector audit needed.');
  return page.locator('.price-tag').textContent();
}
Enter fullscreen mode Exit fullscreen mode

Notice the warn() in the fallback path. When that warning starts appearing in your logs, it means the site changed its higher-priority signals and you're one design refresh away from breakage. Fix it before it breaks, not after.

Fig. 1 — Selector-priority ladder. Top is most stable. Bottom is most fragile.

Quick case

On our Idealista actor, the priority order above turned a "fix the selector every 6 weeks" routine into a "fix the selector twice a year" routine. The JSON-LD path catches 95% of listings without ever touching the DOM. The accessibility-role fallback catches another 4%. The CSS fallback fires on edge-case property types and tells us when a new layout has shipped — usually a week before any of our other monitoring would have noticed.

The CTA you didn't ask for

This selector ladder is the second thing every actor we ship gets, right after the request blocking from last week's post — see it in action in the Idealista actor. It's so consistent we made it a util.

So:

Open your scraper's selector code right now. Count how many class-name chains you have versus semantic / structured-data lookups. Drop the ratio in the comments. Bonus points for the longest CSS chain — I bet someone has .product-grid > .item:nth-child(3) > .price > span > strong.

Agree, disagree, or have a site that genuinely needs CSS chains? Reply.


Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Top comments (0)