Alex Spinov

Posted on Jun 9 • Originally published at blog.spinov.online

Your Scraper Re-Downloads Everything. Most Didn't Change.

#webscraping #python #performance #dataengineering

Your scheduled scraper re-downloaded the whole corpus last night. A few thousand records. About forty of them actually changed since the run before.

It downloaded all of them anyway, because it had no idea which forty.

That's the failure I want to talk about. Not a block, not a crash, not bad data. A scraper that works perfectly and still does an enormous amount of work it didn't need to — because it decides what to fetch after fetching, instead of before.

TL;DR

A full re-scrape is the dumb default. On a scheduled re-run you pay to re-download records that didn't change since last time, and the scraper can't skip them because it learns "did this change" only after the request.
The cheap re-scrape isn't "download faster." It's "don't download what didn't change" — and you decide that from a manifest of what you knew last run, before the first request goes out.
Three levers, in priority order: a trustworthy validator (ETag / Last-Modified) → CONDITIONAL (a 304 transfers zero body); no validator but a stored content hash → FETCH then compare; new URL → FETCH. RFC 7232 is the whole spec.
The trap I hit in production: a weak, per-request-rotating ETag never returns 304. It fakes a 200 every time, so you "saved" nothing and trusted a validator that lies. The planner has to flag it and fall back to content-hash.
The savings number below is a deterministic synthetic manifest you run yourself — not a measurement of any real site. What's real is the exposure: 2,190 production runs across 32 actors, one Trustpilot scraper at 962.

Why I get to talk about re-scrape cost

I run production scrapers. Thirty-two published actors, 2,190 runs logged in production as of this week — that's the live counter on my own Apify dashboard, not a rounded brag. One of them, a Trustpilot review scraper, has run 962 times.

That 962 is the relevant number here. A review scraper isn't a one-shot job. It runs on a schedule, against the same companies, over and over — which means it re-visits the same pages it already saw last week, and the week before. Most of those pages have one new review, or none. Re-pulling the unchanged bulk on every scheduled run is, in my experience, the quiet majority of the work a long-lived scraper does. Not the failures. The redundant success.

Now the honest part. I do not have a clean, published figure for the compute-units, proxy-GB, or wall-clock time of a full re-crawl versus a delta crawl on our real corpus. That number is n/d, and I'm not going to invent one to make a point. What I can give you is the mechanism, on a manifest you can run in two seconds and get the exact count I did. The 2,190 / 962 is the real part — it's the reason I think about this at all.

This is not the other failures in the series

These failures rhyme, and the fixes don't. Draw the boundary hard.

Not this	That post is about	This post is about
Raw HTML token tax	the cost of one fetch (raw HTML → markdown tokens, a polite conditional GET)	how many fetches to make at all on a re-run — work at the level of the whole record set
Corpus near-duplicates	removing duplicates inside data you already collected	not re-collecting what didn't change, before collection
Resume a dead run	finishing an interrupted run without re-doing work	deliberately not re-pulling unchanged records on a successful, scheduled re-run
Yield decay over time	detecting that output is silently rotting vs a baseline	deciding what to re-collect to keep the corpus fresh cheaply
Poisoned data	values that are valid in form but false in fact	the volume of re-collection — nothing about trusting the content

The thin line is with the token-tax post. A conditional GET shows up there too, as politeness on a single request. Here a conditional GET is just one of three levers in a plan over the whole set, and it's not even the center of gravity — the center is SKIP by manifest. The moment a re-scrape post starts reading like "how to send one conditional GET," it has drifted into the token-tax post. The question that keeps it here: for N records on a scheduled re-run, how many do I touch?

New axis: work. Not the cost of one fetch, not deduping, not resume, not detection. The size of the job on a planned repeat.

The decision belongs before the request, not after

Here's the whole reframe. A naive scraper's loop is: fetch the page, then notice it's identical to last time. The noticing is too late — the bytes already crossed the wire, the proxy already burned, the parser already ran.

A planner inverts it. Before any request, it reads last run's manifest — one small row per record with what you knew: {url, etag, last_modified, content_hash} — and assigns a plan. RFC 7232 gives you two of the three levers for free, and they've been in HTTP for years; almost nobody uses them on the re-scrape path.

The priority order:

Trustworthy validator → CONDITIONAL. If last run gave you an ETag or a Last-Modified, send If-None-Match / If-Modified-Since. If the page is unchanged the server answers 304 Not Modified — status line, no body. You confirmed "nothing changed" for the cost of a header round-trip, not a full download.
No validator, but a stored content hash → FETCH, then compare. Some servers give you nothing to precondition on. You still hold last run's body hash. Fetch, hash the new body, and if it matches, stop — don't re-parse, don't re-write downstream, don't re-embed. You paid for the bytes but skipped everything after.
New URL → FETCH. Not in the manifest, so there's nothing to compare. Pull it.

The planner's only job is to assign one of FETCH / SKIP / CONDITIONAL to every record, from the manifest alone, before the loop starts. That plan is the artifact. Everything downstream just executes it.

The trap: a validator that lies

I trusted ETags completely until one source taught me not to.

A conditional GET assumes the validator is stable: same content, same ETag, so an unchanged page returns 304. But RFC 7232 §2.1 explicitly allows weak validators — metadata "that might not change for every change to the representation data … or a desire of the resource owner to group representations by some self-determined set of equivalency" (RFC 7232 §2.1). A weak ETag is written W/"...".

What bit me was worse than weak — it rotated. The server emitted a different ETag on every response for the same unchanged page. So my If-None-Match never matched, the server never returned 304, and every conditional request came back as a fresh 200 with a full body. I'd "optimized" the re-scrape and saved exactly nothing on that source, while believing I had. The data was fine. The plan was a lie.

The fix is to treat the validator as untrustworthy and downgrade: when an ETag is weak or known to rotate, don't plan CONDITIONAL, plan FETCH and compare the content hash after. You lose the 304 savings on that one source, but you stop trusting a number that can't be trusted. The planner below carries that downgrade as an explicit branch — it's the production detail that turns a tutorial into something that survives contact with a real site.

The planner, in pieces

Pure stdlib, no network, no browser, no keys, no random. A deterministic synthetic manifest stands in for last run's stored state, so you get the exact output I did. The transport is irrelevant to the mechanism — the planning is just a decision over a table.

First, the decision for a single record. This is the entire idea:

def plan_for(rec):
    """Return (plan, note) for one manifest record, BEFORE any request."""
    if rec["etag"] is not None and rec["etag_weak"]:
        # A weak/rotating ETag never produces a 304. Don't trust it.
        return "FETCH", "untrustworthy_validator -> hash-compare"
    if rec["etag"] is not None or rec["last_modified"] is not None:
        return "CONDITIONAL", "send If-None-Match / If-Modified-Since"
    if rec["content_hash"] is not None:
        return "FETCH", "no validator -> compare content_hash after fetch"
    return "FETCH", "no prior knowledge"

Read the order. The weak-ETag downgrade is first, on purpose — a record can have an ETag and still be untrustworthy, and if you check "has an ETag" before "is the ETag weak," you plan CONDITIONAL on a validator that lies. Order is the bug surface here.

Then the simulation that proves the savings. It does not hit a network — it models each server's outcome from a fixed rule so the count is reproducible. Unchanged + CONDITIONAL → a 304 (no body). Anything FETCH → a body. Only two records actually changed since last run:

CHANGED_THIS_RUN = {  # the few records whose body really changed
    "https://shop.example.com/p/1002",
    "https://shop.example.com/p/1006",
}

def simulate(plan):
    bodies = 0          # full bodies transferred
    not_modified = 0    # 304s — zero body
    for item in plan:
        changed = item["url"] in CHANGED_THIS_RUN
        if item["plan"] == "CONDITIONAL":
            if changed:
                bodies += 1                 # 200 + new body
            else:
                not_modified += 1           # 304, zero body
        else:                               # FETCH (incl. weak-ETag fallback, new urls)
            bodies += 1
    return bodies, not_modified

The baseline it compares against is the dumb default: a full re-scrape downloads every record's body, every run. So fetches_saved = total − bodies_transferred.

The live run

Twelve records in scope: ten carried over from last run's manifest, two new this run. Seven have a trustworthy validator and get planned CONDITIONAL. Five get FETCH — two with no validator at all, one new-URL pair, and the one weak-ETag trap that got downgraded out of CONDITIONAL. Run the script and you get:

=== RE-SCRAPE PLANNER (deterministic synthetic manifest, not a real site) ===
records in scope          : 12  (10 from manifest + 2 new)
plan decided BEFORE any request:
  CONDITIONAL             : 7  (If-None-Match / If-Modified-Since)
  FETCH                   : 5
    of which weak-ETag fallback: 1  (untrustworthy validator -> hash-compare)
--------------------------------------------------------
simulated run outcomes:
  304 not-modified (no body): 5
  bodies transferred        : 7
--------------------------------------------------------
naive full re-scrape bodies : 12
planner bodies transferred  : 7
fetches_saved               : 5  (5/12 bodies not re-downloaded)
--------------------------------------------------------
per-record plan:
  1001  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1002  CONDITIONAL 200 (changed, body transferred) send If-None-Match / If-Modified-Since
  1003  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1004  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1005  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1006  CONDITIONAL 200 (changed, body transferred) send If-None-Match / If-Modified-Since
  1007  CONDITIONAL 304 (unchanged, no body)     send If-None-Match / If-Modified-Since
  1008  FETCH       200 (fetched body)           no validator -> compare content_hash after fetch
  1009  FETCH       200 (fetched body)           no validator -> compare content_hash after fetch
  1010  FETCH       200 (fetched body)           untrustworthy_validator -> hash-compare *TRAP*
  1011  FETCH       200 (fetched body)           new url (not in manifest)
  1012  FETCH       200 (fetched body)           new url (not in manifest)
========================================================
verdict: planned 7 conditional + 5 fetch; transferred 7 bodies vs 12 naive (5 saved).

Read what that output is saying:

CONDITIONAL: 7 and 304 not-modified: 5. Seven records were checked with a header round-trip; five of them came back 304 — confirmed unchanged, zero body transferred. Those five are the win. The naive scraper would have downloaded all five in full.
200 (changed, body transferred) on 1002 and 1006. The two records that actually changed still get their new body — a conditional GET costs nothing when the page did change; you just get a normal 200. The plan never hides a real update.
1010 *TRAP*. The weak-ETag record did not stay CONDITIONAL. The planner downgraded it to FETCH with untrustworthy_validator -> hash-compare, so it transfers the body and compares the hash — instead of trusting a rotating ETag that would have faked a 200 forever.
fetches_saved: 5. Five of twelve bodies not re-downloaded, decided entirely before the first request. On a real corpus the unchanged share is usually far higher than 5/12 — but that's the n/d number I won't fabricate. The 5/12 here is what you can reproduce.

Where this breaks (and I'm not overselling it)

A re-scrape planner is a work-reducer, not magic. The limits are the point.

Content-hash savings happen after the bytes, not before. The CONDITIONAL lever skips the download. The hash-compare lever only skips parsing and downstream work — you still pay for the body. On a source with no validators, you cut CPU and write amplification, not bandwidth. That's a real win, but a smaller one, and worth being honest about.

A moving timestamp in the body breaks naive hashing. If the page embeds a "last viewed" or a server time, the body hash changes on every fetch even when the data didn't. You end up hashing noise. The workaround is to hash a normalized projection — strip the volatile fields first — which is fine until the timestamp is the data you came for.

Weak ETags are one liar; there are others. A server can return a stable strong ETag and still serve changed content (broken cache), or change content without touching Last-Modified. The downgrade catches the rotating case I hit. It does not make validators trustworthy in general — it makes you stop assuming they are.

So treat the planner as what it is: a cheap way to turn "re-download everything" into "touch what plausibly changed, confirm the rest with a header." It won't make a re-scrape free. It makes the dumb default stop being the default.

What to do Monday

Three moves, smallest first:

Persist a manifest. One row per record: url, the ETag and Last-Modified the server gave you, and a hash of the body you stored. It's a tiny JSON or SQLite table. If you don't keep it, every run starts blind and a full re-scrape is your only option — the manifest is what makes a plan possible at all.
Send conditional requests on the re-run path. If-None-Match from the stored ETag, If-Modified-Since from the stored Last-Modified. Honor a 304 as "unchanged, skip." This is in requests and httpx today; it's a few lines, and it's the one change that pays back the most.
Distrust weak and rotating validators. If an ETag starts with W/, or you see it change across two fetches of an unchanged page, downgrade that source to fetch-and-hash. Log the downgrade so you know which sources you can't precondition — a plan that knows what it can't trust beats one that trusts a liar.

You don't need a crawl framework rewrite. You need to stop deciding what to fetch after you've already fetched it.

One thing I haven't solved cleanly: how do you decide a record changed when the source gives you no validator and the body carries a moving timestamp inside it? I hash with the timestamp stripped — but that breaks the moment the timestamp is the data I came to collect, and I don't have a general rule for telling those two cases apart automatically. If you've got a heuristic that holds up in production, I want to hear it. I read every comment.

Follow for the next numbers from the run log. And tell me: what's the worst re-scrape waste you've found in your own pipeline — the job that was re-downloading the most for the least?

Written by Aleksei Spinov — I run production scrapers (2,190 runs across 32 actors; one Trustpilot scraper at 962). Proof: blog.spinov.online and my Apify profile.

AI disclosure: drafted with AI assistance, then edited, fact-checked, and the code run and verified by me. The manifest is synthetic and deterministic; the output above is real stdout from executing the script.

DEV Community