Most broken scrapers I see have the same shape: someone wrote the extraction logic first and the selectors second. The selectors were an afterthought — whatever worked in DevTools at 2am.
That's backwards. Selectors are the contract between your code and the page. Get them wrong and the rest of your scraper is irrelevant.
The mindset shift
Selector-first thinking means: before you write a single line of extraction code, you decide how the data is identified. Not "how do I get the price?" but "what does the page tell me, programmatically, that this thing is a price?"
Three answers, in order of preference:
-
Semantics —
getByRole,getByLabel,getByText. These mirror what an accessibility tree exposes. They survive design changes. -
Data attributes —
data-testid,data-product-id,itemprop. Devs often add these for their own tests; you get to free-ride. - Structured data — JSON-LD, microdata, OpenGraph. The page is already telling Google what's a price; let it tell you too.
CSS classes are last resort. Class names are styling, not identity. They change when the design changes. They're the equivalent of asking for "the third button from the top" — works until someone rearranges the menu.
The 3-item checklist
Before you write a selector:
-
Open the accessibility tree in DevTools (Chrome: Elements → Accessibility tab). If the data has a role and an accessible name, use
getByRole. -
Search the page source for
application/ld+json. If it's there and contains your fields, parse it directly. No DOM walking needed. -
Look for
data-*attributes near the data. Devs leave testing hooks everywhere. Use theirs.
If none of those work, then fall back to CSS or XPath. And when you do, anchor to something stable — a parent landmark, an aria-label, a data- attribute — not just a class chain.
The 10-line replacement
Here's the priority I use in every new actor:
async function extractPrice(page) {
// 1. Structured data first.
const ld = await page.locator('script[type="application/ld+json"]')
.first().textContent();
const data = JSON.parse(ld ?? '{}');
if (data?.offers?.price) return data.offers.price;
// 2. Semantic selectors.
const priceByLabel = page.getByLabel(/^price$/i);
if (await priceByLabel.count()) return priceByLabel.textContent();
// 3. Data attributes.
const priceByData = page.locator('[data-testid="price"]');
if (await priceByData.count()) return priceByData.textContent();
// 4. Last resort: CSS class. Logged loudly so we know we're in fallback.
console.warn('Falling back to CSS selector — selector audit needed.');
return page.locator('.price-tag').textContent();
}
Notice the warn() in the fallback path. When that warning starts appearing in your logs, it means the site changed its higher-priority signals and you're one design refresh away from breakage. Fix it before it breaks, not after.
Quick case
On our Idealista actor, the priority order above turned a "fix the selector every 6 weeks" routine into a "fix the selector twice a year" routine. The JSON-LD path catches 95% of listings without ever touching the DOM. The accessibility-role fallback catches another 4%. The CSS fallback fires on edge-case property types and tells us when a new layout has shipped — usually a week before any of our other monitoring would have noticed.
The CTA you didn't ask for
This selector ladder is the second thing every actor we ship gets, right after the request blocking from last week's post — see it in action in the Idealista actor. It's so consistent we made it a util.
So:
Open your scraper's selector code right now. Count how many class-name chains you have versus semantic / structured-data lookups. Drop the ratio in the comments. Bonus points for the longest CSS chain — I bet someone has .product-grid > .item:nth-child(3) > .price > span > strong.
Agree, disagree, or have a site that genuinely needs CSS chains? Reply.
Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Top comments (0)