What changed since the last scrape? A small change-detection layer (stdlib only)

#python #webscraping #showdev #opensource

Most of my scrapers answer one question: what's on the site right now. But that's almost never the question I actually have. What I care about is what changed since the last run. A new listing showed up, a price dropped, a product disappeared, a status flipped from open to closed. The current snapshot on its own doesn't tell me any of that.

For a while I rebuilt the same thing on every project: load last run's JSON, compare it to this run, work out what's new, what's gone, and what changed. It's never hard, but it's fiddly, and I kept getting the same details wrong (more on that below). So I pulled it into one small reusable piece and stopped rewriting it. It's called scrape-sentinel.

This post is about the design more than the tool, because the interesting part is the handful of decisions that make change detection annoying to get right.

The core idea

You give it the records from this run and a key, and it tells you what was added, removed, and changed since last time. For changed records, it tells you which fields moved and from what to what.

The diff itself is a plain function with no I/O:

from scrape_sentinel import diff

cs = diff(previous_records, current_records, key="sku", ignore_fields=["scraped_at"])

for r in cs.added:
    print("new:", r["sku"])
for changed in cs.changed:
    for d in changed.deltas:
        print(changed.key, d.field, d.old, "->", d.new)

Which prints something like:

new: W-104
W-101 price 39.0 -> 35.0
W-101 in_stock True -> False

The details that kept biting me

A few decisions are the whole reason this is worth extracting instead of rewriting inline every time:

Match by key, not by position. This is the big one. If you diff two lists positionally, a re-sorted page or a reordered API response looks like every row changed. Matching on a stable key (one field or a few) means a reordered run shows zero changes, which is correct.
The first run is a baseline. With no previous snapshot, everything looks new. The first run just records state and stays quiet instead of firing an alert for all 4,000 items.
Ignore the noisy fields. A scraped_at timestamp or a session token changes every single run. You drop those from the comparison, or restrict it to an allow-list of fields you actually care about.
Write snapshots atomically. The state file is written to a temp file and renamed, so a run that dies halfway can't leave you with a corrupted snapshot that breaks the next comparison.

Using it for real

In practice you want the diff plus the I/O around it: load the last snapshot, run your scraper, compare, alert, save the new snapshot.

from scrape_sentinel import (
    CallableSource, PipelineConfig, SnapshotStore,
    ConsoleAlerter, WebhookAlerter, run_once,
)

def scrape() -> list[dict]:
    # your requests / Playwright / API code, returns a list of dicts
    return fetch_products()

config = PipelineConfig(
    key="sku",
    ignore_fields=["scraped_at"],
    alerters=[
        ConsoleAlerter(title="catalog", key_fields=("sku",)),
        WebhookAlerter(SLACK_URL, title="catalog", key_fields=("sku",)),
    ],
)

changes = run_once(CallableSource(scrape), SnapshotStore("./.state"), config)
print(changes.summary())   # {'added': 1, 'removed': 1, 'changed': 1, 'unchanged': 2}

Alerts go to the console, a webhook (Slack or Telegram), or a JSON change log. There's also a CLI with a --fail-on-change exit code, so you can put it on a cron job or a CI step and have the next step run only when something actually moved:

scrape-sentinel run --source json:catalog.json --key sku --state ./.state --webhook "$SLACK_WEBHOOK"

What it is not

It's not a scraper. It doesn't crawl anything for you. You bring your own requests, Playwright, or API client and hand it a list of dicts, and it owns the diff, the alert, and the snapshot. It's also standard library only, no dependencies, so dropping it into an existing project doesn't pull in a tree. The diff being a pure function is what made it easy to test heavily, which is where most of the suite lives.

Repo

MIT licensed: https://github.com/vinimabreu/scrape-sentinel

Honestly curious how other people handle this. Do you diff inside the database, keep snapshots on disk like this, hash each record, or something cleaner? It feels like the kind of thing everyone quietly rebuilds, so I'd like to know what I missed.

DEV Community