Alex Spinov

Posted on Jun 1 • Originally published at blog.spinov.online

Your Scraper Returned a Clean Row. It Was Wrong.

#ai #python #webscraping #llm

The row looked perfect. rating: 7. Valid JSON, right type, no nulls, no missing keys. My schema check waved it through. The page had returned HTTP 200. The selectors hadn't moved. Everything green.

A rating of 7 on a 5-star site is impossible. The model invented it, formatted it correctly, and handed it to me with total confidence.

That's the failure I want to talk about. Not the scraper that breaks loudly. The one that hands you a clean-looking row that is quietly, plausibly false — and sails past every check you have, because your checks are all looking at the shape of the data, and the lie is in the value.

TL;DR

HTTP 200, intact selectors, and valid JSON tell you the form is fine. They say nothing about whether the value is true.
When an LLM extracts from messy free-text, structured-output mode guarantees you get valid JSON. It does not guarantee the content is real. The model fills uncertain fields rather than leaving them empty — because the schema demands a complete row.
A ~60-line value-level sanity gate (ranges, dates, cross-field, reference, language) catches the obvious lies before they hit your database. Real code and real output below.
The honest catch: this gate catches rule violations, not plausible lies inside the allowed range. A rating: 4 where the truth is 2 slides right through. I'll be specific about where the gate stops.

Two different ways a scraper lies to you

I wrote about source drift last week — the case where the page changes underneath you and a 30-line schema check catches the structure shifting. That's an input problem. The source mutated; your agreement with the page broke; you detect it by watching the shape.

This is the other end of the pipe. The source is fine. The page is intact, the selectors are correct, the structure is exactly what you expected. The thing that lied to you is the model, on the extraction step, when you asked it to pull structured fields out of a paragraph of human prose.

Those two failures feel similar and they are not. One is "the grammar of the page changed." The other is "the grammar is identical, but the fact is wrong." A schema check is built for the first one and blind to the second. I learned that the annoying way, by trusting a green schema check and shipping a row that was structurally flawless and semantically garbage.

Why structured output makes this worse, not better

Here's the part that surprised me. Turning on response_format: json_schema (or Bedrock's tool-result schemas, or whatever your stack calls it) feels like it should fix hallucination. It does the opposite for value correctness.

Paul SANTUS put it cleanly in his Dev.to piece on May 29 ("LLMs suck at generating large, structured data"): structured output modes "help with syntax — you'll get valid JSON. But they don't solve the semantic problem." And the kicker, which matches exactly what I see on real pages: "Models are more likely to hallucinate when producing structured output in one shot. They fill in fields they're uncertain about rather than leaving them empty, because the structure demands completeness."

Read that again, because it's the whole article in one sentence. The schema demands a complete row. So when the model is unsure what the rating was, it doesn't return null. Null would feel like failure. It returns a number. A confident, well-formatted, in-the-right-place number. Sometimes that number is 7.

The schema got what it asked for: a complete, valid object. It just got a fabricated value inside it.

There's a downstream version of this that's even nastier. israelhen153 wrote it up on May 26, title literally "Sonnet hallucinated. My agent stored it as fact." His memory layer took an incorrect model denial, extracted it, and tagged it [fact] in a SQLite table with no verification. The agent then repeated the false claim in later sessions. Same shape as my scraper bug: a model said something untrue with structural confidence, and a dumb downstream layer wrote it down as truth. Nobody validated the value at the boundary before it became a permanent record.

Where my numbers come from (and where they don't)

I run scrapers in production. The Trustpilot scraper on my Apify profile has 962 runs; across all my actors it's about 2,190 lifetime. That's real exploitation traffic, not a lab.

I want to be careful here, because this is exactly the spot where it's tempting to fabricate. That is HTTP scraping, not a controlled LLM benchmark. I do not have a datestamped log that says "on run 412, the model returned rating=7 for vendor X." I never instrumented a clean before/after experiment on the extraction step, so I'm not going to pretend I have a hit-rate number for you. Anyone who hands you a precise "LLMs hallucinate N% of extracted fields" from production scraping without a controlled setup is guessing with decimals.

What the volume does buy me is the right to talk about the classes of silent corruption, because I've cleaned up after all of them by hand. The examples below are constructed to illustrate those failure classes — they are not transcribed incidents. The shape of each one is real; the specific row is mine, written to make the rule obvious.

The classes I keep hitting:

A rating outside 1–5. The model read a number off the wrong part of the page, or invented one.
A review date in the future. Free-text dates are a swamp ("last Tuesday", "vor 3 Tagen", "2 days ago"), and a model normalizing them will occasionally produce a date that hasn't happened yet.
A verified flag set true with nothing backing it. The word "verified" appeared somewhere on the page; the model decided that meant the review was verified.
Scraped count wildly disagreeing with the displayed count. The page says 40 reviews; the extracted object claims 500. One of those is fiction.
Country says US, the review text is plainly German. The model copied a locale field from the wrong block.

Every one of these is valid JSON. Every one passes a schema check. Every one is wrong.

The fix: check the value, not the shape

The cheap, boring, effective move is a second gate that runs after parsing and before writing to your database. It doesn't ask "is this a complete object with the right keys and types?" — your schema check already did. It asks "is this value plausible for what it claims to be?"

stdlib only. json, re, datetime. No model call, no network, no API key. It's the kind of thing you can drop in this afternoon. Here's the core — I'll paste the rules and then the real output from running it.

import re
from datetime import date

def _rating_in_range(row):
    r = row.get("rating")
    if r is None:
        return None  # absence is a shape problem; we judge values here
    if not isinstance(r, (int, float)) or r < 1 or r > 5:
        return ("RANGE", f"rating={r!r} outside [1,5]")
    return None

def _date_not_in_future(row):
    raw = row.get("review_date")
    if raw is None:
        return None
    if not re.fullmatch(r"\d{4}-\d{2}-\d{2}", str(raw)):
        return ("DATE_FORMAT", f"review_date={raw!r} is not YYYY-MM-DD")
    try:
        y, m, d = (int(x) for x in str(raw).split("-"))
        parsed = date(y, m, d)  # raises on 2026-02-31, etc.
    except ValueError:
        return ("DATE_FORMAT", f"review_date={raw!r} is not a real calendar date")
    if parsed > date.today():
        return ("FUTURE_DATE", f"review_date={raw} is in the future")
    return None

def _verified_flag_consistent(row):
    if row.get("verified") is True and not row.get("verification_token"):
        return ("CROSS_FIELD", "verified=True but no verification_token")
    return None

def _counts_agree(row):
    scraped = row.get("review_count_scraped")
    shown = row.get("review_count_displayed")
    if isinstance(scraped, int) and isinstance(shown, int) and shown >= 0:
        if scraped > shown * 2 + 10:
            return ("REFERENCE", f"scraped {scraped} >> displayed {shown}")
    return None

The language check is the crudest one, and I'm leaving it crude on purpose: a tiny word-frequency hint, not a real language detector. It exists to catch the "country says US, text is obviously German" case and nothing fancier:

_LANG_HINTS = {
    "de": (" der ", " und ", " nicht ", " ich ", " sehr "),
    "en": (" the ", " and ", " not ", " very ", " with "),
    # fr, es, ... add as needed
}
_COUNTRY_LANG = {"US": "en", "GB": "en", "DE": "de"}

def _language_matches_country(row):
    country, text = row.get("country"), row.get("text")
    if not country or not text:
        return None
    expected = _COUNTRY_LANG.get(country)
    if not expected:
        return None  # no opinion about this country; stay quiet
    t = f" {text.lower()} "
    best, score = None, 0
    for lang, hints in _LANG_HINTS.items():
        hits = sum(t.count(h) for h in hints)
        if hits > score:
            best, score = lang, hits
    if best and best != expected:
        return ("LANG_COUNTRY", f"country={country} but text looks like '{best}'")
    return None

And the gate that runs them all:

RULES = (_rating_in_range, _date_not_in_future,
         _verified_flag_consistent, _counts_agree, _language_matches_country)

def sanity_violations(row):
    out = []
    for rule in RULES:
        result = rule(row)
        if result is not None:
            out.append([result[0], result[1]])
    return out

That's it. An empty list means the row is plausible by these rules. Hold that thought — it's doing a lot of work, and I'll come back to how much it isn't doing.

Running it

I built one GOOD row (everything sane) and one BAD row where every field has the correct type and parses as valid JSON, but every value is something the model should never have returned. Then I ran the file. No edits to the output — this is what python3 field_sanity.py printed on Python 3.13, stdlib only:

GOOD: CLEAN []
BAD:  [
  [
    "RANGE",
    "rating=7 outside [1,5]"
  ],
  [
    "FUTURE_DATE",
    "review_date=2027-01-01 is in the future"
  ],
  [
    "CROSS_FIELD",
    "verified=True but no verification_token"
  ],
  [
    "REFERENCE",
    "scraped 500 >> displayed 40"
  ],
  [
    "LANG_COUNTRY",
    "country=US but text looks like 'de'"
  ]
]

The good row comes back clean. The bad row, with its five plausible-looking, schema-passing fields, comes back as a list of five specific value violations. Each one is a thing you can log, alert on, quarantine, or refuse to write. None of them would have been caught by checking that the keys exist and the types are right.

You wire it in at exactly the spot Paul SANTUS called the boundary: after the model produces the object, before it touches your database.

row = json.loads(model_output)        # schema/shape check happens here
problems = sanity_violations(row)     # value check happens here
if problems:
    quarantine(row, problems)         # do NOT write a confident lie to prod
else:
    db.insert(row)

Validation at the boundary, as he put it, doesn't lose any data. A rejected row goes to a quarantine table with its violation list attached, not into the void. You decide later whether it was a real edge case or a hallucination. What you don't do is let it become a [fact].

What this does NOT catch (read this before you trust it)

Now the honest part, because a gate that you over-trust is worse than no gate.

This catches rule violations. It does not catch a plausible lie that lives inside the allowed range. If the true rating was 2 and the model returned 4, my check sees 4 ∈ [1,5] and waves it through, smiling. That's the genuinely scary failure — confidently wrong, perfectly in-bounds — and a range check is useless against it. There's no cheap stdlib trick for that one. You're into reference data, re-extraction with a second model, or sampling-and-human-review territory, and all of those cost real money.

The rules are hand-written and domain-specific. "rating ∈ [1,5]" is a review-scraper rule. It means nothing for a price field or a stock count. You will write your own rules per domain, and you'll get some of the thresholds wrong on the first pass. My scraped > shown * 2 + 10 is a guess at "wildly disagrees"; tune it on your data or it'll false-positive on legitimately fast-growing pages.

The language check is a toy. Five stopwords per language. It'll catch German-where-you-expected-English and it'll completely miss anything subtler. I left it crude because a heavier dependency wasn't worth it for the one case it needs to handle — but don't mistake it for language detection. If you need real detection, pull in a library and accept the weight.

It's per-field. It has no idea whether the review as a whole makes sense. A row can pass every rule and still be a fabricated review of a product that doesn't exist. Per-field plausibility is a floor, not a ceiling.

So: this is a cheap, high-value first line of defense that turns a class of silent corruption into a loud, logged rejection. It is not a truth oracle. Anyone who tells you a stdlib validator makes LLM extraction "safe" is selling you the same false confidence the model gave you with rating: 7.

The shift that actually matters

The mental move here is small and it changed how I ship extraction. HTTP 200 means the request succeeded. Valid JSON means the response is well-formed. Neither one means the data is true. Those are three separate claims, and I used to collapse them into one green light.

A model that has to fill a required field will fill it. The structure you added for safety is the same structure that pressures it to invent. So you put one more cheap gate at the boundary, you check values and not just shapes, and you stop letting confident-and-wrong rows become permanent records. It won't catch everything. It'll catch the embarrassing ones, the impossible-rating-7 ones, before they're in your database with everyone downstream treating them as fact.

The hard question I'm still chewing on: how do you catch the in-range lie — the 4 that should be 2 — without paying for a second extraction or a human in the loop? I don't have a cheap answer. If you've found one that holds up in production, I genuinely want to read it.

Follow for the numbers from the next batch of production runs — and tell me the worst plausible-but-wrong field a scraper ever handed you. I read every comment.

Written with AI assistance; all code was run locally on Python 3.13 (stdlib only) and the output above is the real, unedited result. The production figures (962 / 2,190 runs) are HTTP-scraping context, not a controlled LLM benchmark, and the failure rows are constructed examples of classes I've cleaned up after — not a datestamped incident log.