Alex Spinov

Posted on Jun 7 • Originally published at blog.spinov.online

Your Scraper Passes Every Run. It's Still Rotting.

#webscraping #python #dataengineering #monitoring

A scraper run finished green. Exit 0. Schema valid. Row count looked normal. So did the one before it, and the forty before that.

Then one afternoon you glance at a number you don't usually look at — total rows this month vs the same source last month — and you're collecting noticeably less than you used to. No errors. No traceback. No alert fired, because nothing was ever wrong with any single run.

That's the failure I want to talk about. Not a crash. A slow rot, measured across runs, that every single-run check on earth is blind to.

TL;DR

A scraper can pass every per-run gate (exit 0, schema ok, count plausible) while its rolling yield slides down for weeks.
Single-run checks can't see it. The signal only exists when you compare today's run to your own past, not to a declared total.
The obvious detector — "median of the last K runs" — is a boiling-frog trap. On a slow drift the baseline sinks with the signal, so it never fires. I ran it. Zero warnings while yield dropped 25%.
Fix: a lagged baseline. Compare today against the median of runs from K..2K runs ago — your settled past, not the part already eaten by decay. ~20 lines over your run log. No Grafana, no SRE.
Numbers below are a deterministic synthetic run log, not a claim about our real slide. What's real is the volume that makes such a curve observable: 2,190 production runs, one Trustpilot scraper alone at 962.

Why I trust this surface exists

I've shipped 32 scrapers and they've logged 2,190 runs in production. One of them, a Trustpilot review scraper, has run 962 times against the same source. That's the thing most scraping tutorials don't have: a long line of runs hitting one place over real calendar time.

When you have 962 runs of one source, "yield per run" stops being a single number and becomes a curve. And a curve has a shape. Most of the time the shape is flat and boring. Sometimes it tilts down so gently that no individual run ever looks off — and that's exactly the case nobody writes about, because you only see it if you have the history.

To be honest about the limits up front: I don't have a clean, published figure for how far our real yield slid on any specific site, so I'm not going to invent one. That number is n/d. What I can do is show you the mechanism on a deterministic run log you can execute yourself, and tell you the detector I'd reach for. The 2,190 / 962 is the part that's real — it's the volume that makes the curve visible at all.

This is not the other failures

If you've read the rest of this series, draw the boundary clearly, because the failures rhyme and the fixes don't.

This isn't a bad status code — HTTP 200 lying about a broken response shape is the schema canary. This isn't a wrong field value inside a row that's otherwise valid. It isn't bytes you paid for that came back empty, and it isn't a crash you resume from.

Every run here is green. Exit 0, schema valid, row count plausible for a single run. There is no declared total to check against — you don't know how many rows "should" come back. The decay only exists when you line today's run up against your own past. The rolling yield has been drifting down for weeks, and nothing ever threw.

New axis: time. Not the shape of one response, not a field, not bytes, not a crash. The trend of one source against its own history.

How a healthy run can be a rotting series

Here's the mechanism, because it's quieter than it sounds.

Say your scraper walks 80 pages of a source each run. When the source is healthy, each page hands back about 48 rows — call it a full page minus a little. So a run pulls roughly 3,800 rows. Plausible. Your per-run sanity gate says "fail if rows < 3,000," and it never trips.

Now the source starts thinning. Maybe a soft rate-throttle kicks in and pages return fewer items. Maybe the result set genuinely shrinks. Maybe an A/B test on their side trims the page size. Whatever the cause, each run quietly returns a hair less than the last — say 0.9 fewer rows per page. One run that goes from 48 to 47.1 rows/page looks identical to the one before. Nobody blinks.

Roll that forward. Twenty runs later you're at 30 rows/page. The run still walks 80 pages. Still exits 0. Still has a valid schema on every row. Still clears rows < 3000 — barely. Your per-run gate has no memory, so it can't tell that 2,400 rows used to be 3,800. The frog has been boiling the whole time and the thermometer only ever read "alive."

That's the trap of single-run validation: every check answers "is this run OK by itself?" None of them answers "is this run OK compared to what this source used to give me?"

The detector: ~20 lines over your run log

You don't need a metrics stack for this. You need three things: log the yield of every run, keep the log, and run a baseline check over it.

Here's the whole probe. Pure stdlib, no network, no browser, no paid API — a deterministic synthetic run log stands in for the source so you can run it in seconds and get the exact output I did.

import statistics

# --- knobs ---
K = 7              # baseline window size (median of K runs)
GAP = K            # how far back the window sits (LAGGED, not trailing)
THRESHOLD = 0.15   # WARN if today's yield is >15% below the lagged baseline
MIN_HISTORY = K + GAP

def synth_run_log():
    """Deterministic log of one source over 60 runs. Every run is 'green':
    exit 0, schema valid, plausible single-run row count. Yield slowly decays
    after run 40 (a soft throttle / thinning source) with small jitter."""
    runs = []
    base_yield = 48.0
    jitter = [0.0, -0.6, 0.4, -0.3, 0.5, -0.4, 0.2, -0.5, 0.3, -0.2]
    for i in range(60):
        run_id = i + 1
        decay = 0.0 if run_id <= 40 else (run_id - 40) * 0.9
        y = base_yield - decay + jitter[i % len(jitter)]
        pages = 80
        rows = round(y * pages)
        runs.append({
            "run_id": run_id, "exit_code": 0, "schema_ok": True,
            "pages": pages, "rows": rows,
            "yield_per_page": round(rows / pages, 3),
        })
    return runs

def yield_decay_probe(runs, k=K, gap=GAP, threshold=THRESHOLD):
    """For each run, compare its yield to the median of a LAGGED window:
    runs [idx-k-gap : idx-gap]. The gap is what defeats the boiling-frog trap."""
    first_warn = None
    for idx, run in enumerate(runs):
        if idx < MIN_HISTORY:
            run["verdict"] = "BUILDING_BASELINE"
            run["baseline"] = run["drop"] = None
            continue
        window = [r["yield_per_page"] for r in runs[idx - k - gap:idx - gap]]
        baseline = statistics.median(window)
        drop = (baseline - run["yield_per_page"]) / baseline
        run["baseline"], run["drop"] = round(baseline, 3), round(drop, 3)
        run["verdict"] = "DECAY_WARN" if drop > threshold else "OK"
        if drop > threshold and first_warn is None:
            first_warn = run
    return first_warn

The heart of it is one line:

window = [r["yield_per_page"] for r in runs[idx - k - gap:idx - gap]]

Today's yield isn't compared to the last 7 runs. It's compared to 7 runs that ended 14 to 8 runs ago — your settled past. I'll explain in a second why that gap is the whole game.

Run it and you get:

=== YIELD DECAY PROBE ===
runs in log              : 60
every run exit 0         : True
every run schema ok      : True
baseline window (K)      : 7 runs, lagged by 7 (settled past)
warn threshold           : 15% below baseline
min single-run rows      : 2384
max single-run rows      : 3880
--------------------------------------------------------
FIRST DECAY WARN         : run 48
  baseline yield/page    : 47.8
  this run  yield/page   : 40.3
  drop vs baseline       : 15%
  this run exit code     : 0  (GREEN)
  this run rows          : 3224  (plausible)
--------------------------------------------------------
latest run (run 60)         : yield 29.8/page, rows 2384, exit 0
latest verdict           : DECAY_WARN (drop 25% vs baseline 40.2)
========================================================

Read what that output is actually saying:

every run exit 0 = True, every run schema ok = True. All 60 runs pass every single-run check. There is nothing here a status code or schema validator would catch.
min single-run rows : 2384. Even the worst run pulled 2,384 rows. A rows < 3000 gate would have passed the early decay and only barely caught the late stuff. A rows < 2000 gate never trips at all.
The decay starts at run 41. The probe's first warning fires at run 48 — about halfway down, while the run is fully green: exit 0, 3,224 rows, perfectly plausible. You get the flag weeks before this becomes "we're collecting half of what we used to."
By run 60, yield is down 25% and rows have slid to 2,384. Without the probe, that's still a green run. With it, you'd have known seven runs in.

The part that surprised me: the obvious detector is silent

The first version I'd reach for, and the version most people write, uses a trailing median. Baseline = median of the last K runs. No gap. It feels right.

It's a boiling-frog trap, and I mean that literally. On a slow drift the baseline sinks at the same rate as the signal. Every step from one run to the next is within threshold, because the thing you're comparing against already moved down too. The detector congratulates itself the whole way down.

I didn't take that on faith. I ran the same log through a trailing median (GAP=0):

NAIVE trailing-median (GAP=0):
  warns fired         : 0
  first warn at run   : None
  latest run yield    : 29.8 (started at 48.0, now down 25%)

Zero warnings. Yield fell from 48.0 to 29.8 — a quarter gone — and the "obvious" detector never said a word. It would have caught a cliff. It is structurally blind to a slide.

The fix is the lag. Compare today not against the recent past (which the decay has already infected) but against a settled window further back — runs K..2K ago. That window remembers what healthy looked like. The drop is measured against memory, not against the slowly-poisoned present. Same probe, GAP=7, and it fires at run 48.

If you take one thing from this post: a trend detector whose baseline includes recent data can't see a slow trend. Make the baseline lag.

What this does NOT catch (and where it cries wolf)

I'm not going to oversell 20 lines.

It needs runs. With K + GAP = 14, the probe says BUILDING_BASELINE until you have enough history. Brand-new scraper, sparse schedule — no signal yet. This is a tool for sources you hit repeatedly, which is exactly where slow decay hides.

A genuine cliff also trips it — correctly, but you'll want context. If a source legitimately halves overnight (they really did remove half the listings), the probe fires. That's not a false positive, but it's not decay either; it's a step change. The probe tells you something moved, not why.

Seasonality and legitimate shrinkage will cry wolf. A source that's genuinely quieter on weekends, or a category that's actually emptying out, will look like decay. The probe has no idea your source is supposed to shrink. You'll get warnings you have to read and dismiss. A single global threshold is blunt; per-source thresholds are better, and I haven't built the per-source version into these 20 lines.

It assumes yield is comparable run to run. If your page budget changes between runs, normalize on rows-per-page (as the probe does), not raw rows. If even the per-page meaning drifts, you need a smarter denominator than I've shown here.

So: it's a smoke alarm, not a diagnosis. It earns its 20 lines by catching the one failure that every green log hides — and it will occasionally beep at burnt toast.

What to actually do Monday

Three changes, smallest first:

Log yield per run. Not only the exit code. One number (rows, or rows-per-page if your budget varies) written to a durable run log. If you're not logging it, you can't see the curve, full stop.
Alert on trend, not on an absolute floor. A rows < N gate is a tripwire at one height; decay walks under it. Compare each run to a lagged baseline of its own source.
Make the baseline lag. Trailing windows go blind to slow drift. Median of runs K..2K ago. That's the difference between "no decay detected" and "warn at run 48."

You don't need a metrics platform to start. You need the run log you probably already half-have, plus this probe reading it. Grafana is great once you've decided what to watch. This tells you what to watch before you've stood anything up.

One open question I haven't settled: across 962 runs of one source, how much of the yield wobble is the source genuinely changing vs. our own throttling/proxy behavior leaking into the curve? I can see the curve move; cleanly attributing each dip is harder than I'd like. If you've separated "the source changed" from "my client changed" in a long run history, I'd genuinely like to hear how — I read every comment.

Follow for the next numbers from the run log. And tell me the slowest, sneakiest scraper decay you've watched happen — the one no alert ever caught.

Written by Aleksei Spinov — I run production scrapers (2,190 runs across 32 actors; one Trustpilot scraper at 962). Proof: blog.spinov.online and my Apify profile.

AI disclosure: drafted with AI assistance, then edited, fact-checked, and the code run and verified by me. The run log is synthetic and deterministic; the output above is real stdout from executing the script.