TL;DR — Most scraper "bugs" aren't bugs. They're the source site changing its data shape underneath you while your selectors and your code keep returning success. This is schema drift, and you cannot prevent it. You can only detect it. The detection has to be designed in. Here's how we do it.
I have a low opinion of any scraper that does not log a per-field availability rate. It's the single most useful number you can produce, and almost nobody produces it.
The premise: every record you scrape has a set of expected fields. After every run, you compute, for each field, the percentage of records that had a non-null value for it. You log that number. You alarm on it.
That's it. That's the whole technique.
Why this matters
A scraper has three failure modes you actually care about:
- Total failure — the run errors out, you get a stack trace, you fix it.
- Partial failure — some URLs fail, you log them, you retry.
- Schema drift — every URL "succeeds," every record looks fine, but a field has silently gone from 98% present to 30% present.
The first two are loud. The third is silent. Schema drift is what produces "the dashboard looks weird" support tickets a week after the cause.
Real example, from our Sephora product info actor: in March, the site moved the "ingredients" field from a top-level dropdown into a tab inside a modal. Our existing selector still found something on the page — a placeholder div — and our code happily wrote ingredients="" to the dataset. No error, no alarm. The CSV had ingredient column. The values were empty for new products. Detected eight days later by a customer who tried to filter by allergen.
If we had been logging field availability, we would have seen the ingredient field drop from 96% present to 11% present in a single deploy and caught it inside an hour.
The teardown of why this gets missed
Most scrapers track:
- Rows extracted per run.
- Errors per run.
- Run duration.
None of those move when schema drift happens. The row count is the same. The error rate is zero. The run duration is the same. You have to be looking at field-level data to see it.
The replacement pattern
After every run, compute and log this:
from collections import Counter
def field_availability(records, expected_fields):
"""Returns the % of records where each field is non-null."""
counts = Counter()
total = len(records)
for record in records:
for field in expected_fields:
if record.get(field) not in (None, "", []):
counts[field] += 1
return {field: round(counts[field] / total * 100, 1) for field in expected_fields}
At the end of the run:
availability = field_availability(records, EXPECTED_FIELDS)
log.info("field_availability", extra=availability)
# Alarm on regression vs last run.
prev = await KeyValueStore.getValue("last_field_availability") or {}
for field, pct in availability.items():
delta = pct - prev.get(field, pct)
if delta < -10: # 10-point drop is suspicious
log.warning(f"availability regression: {field} {prev[field]}% → {pct}%")
await KeyValueStore.setValue("last_field_availability", availability)
Three log lines per run. Persistent state across runs. An alarm when any field drops more than 10 percentage points.
What to monitor specifically
Field availability is the one that catches the most. Two more I find pay for themselves:
- Value distribution shift. For numeric fields (price, rating, count), log the median and p95. If price suddenly goes from "median ~$30" to "median 0.0" you have a parser bug, not just availability drift.
- Selector hit count. When you fall back from primary to secondary selector, log it. If your fallback rate goes from 1% to 40%, the primary selector is on its way out — you have a week or so before it goes to zero.
These three together (availability, distribution, fallback rate) catch ~90% of schema drift before it produces customer-visible bugs.
Result
We added per-field availability logging across the Sephora actor portfolio in February. In the four months since:
- 6 schema-drift incidents caught and fixed within 48 hours of the source-site change.
- Mean detection lag went from "a customer noticed" (~7 days) to "the alarm fired" (~12 hours, the gap being our run cadence).
- One incident where the field availability dropped in a way that was expected (Sephora removed a field site-wide); we acknowledged and updated the schema. Net cost: 20 minutes, including writing the postmortem.
The cost: about 30 lines of code per actor, run-time overhead measured in milliseconds.
When this is wrong
Field availability is a poor signal when your input is inherently heterogeneous. If you're scraping listings where some products have ingredients and most don't, "30% have ingredients" might be normal. The technique still works — you just compare to the previous run, not to an absolute target. A 10-point drop is the alarm; the absolute number doesn't matter.
If you're scraping a homogeneous catalogue (every product has a title and a price), absolute thresholds work fine. Title <99% present? Something is wrong.
We packaged the field-availability + distribution + fallback-rate triple into a small middleware that sits at the end of every actor we ship — first deployed on the Sephora product info actor and rolled out portfolio-wide. Three lines to wire up, alarms in your inbox the day a source site decides to change their schema.
Which of the three signals is missing from your scraper right now? Drop it in the comments — I'll show you the smallest version that works.
Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Top comments (0)