DEV Community

Sanjay Chauhan
Sanjay Chauhan

Posted on

Your scraper says 200 OK. I measured how often it's lying.

You write a scraper. It hits a URL, gets back 200 OK, you check resp.status_code, it is 200, so you call save(resp) and move on. The pipeline runs nightly. Everything is green. You trust it, because the whole point of a status code is that it tells you what happened.

Three days later a downstream report looks subtly wrong. A column is empty, or a count is off. You start the long walk back to find out which page quietly handed you a login form instead of the article you asked for. It was a login wall. Or a JavaScript app-shell with no content rendered yet. Or a soft-404 dressed up as a real page. The status code said success, your code believed it, and the junk got stored as data.

In 2026 a 200 OK is not ground truth. It is just as likely to be an anti-bot challenge page, a login wall, a soft-404, or an empty JavaScript shell that never rendered. Status-code retry logic never notices the difference, so the corruption gets stored as data.

I wanted to see this happen on real, named sites rather than argue it in the abstract, so I went and measured it.

What I found

I took three popular Python fetchers (requests, curl_cffi, scrapling), pointed them at a mix of control sites and protected ones, and ran 3 requests each. Then, this is the part that matters, I captured each raw body and labeled what every fetcher actually got back independently, by reading the stored bytes, not by trusting the status line and not by trusting veriscrape. Only after labeling did I compare. Every result was stable 3 of 3.

A "silent failure" here means one specific thing: a 2xx response whose body is junk (a login wall, a JS app-shell, an empty page) that gets reported as success with no signal that anything is wrong. The cleanest, least-disputable case:

  • discord.com/app and web.telegram.org return 200 with an empty JavaScript app-shell: a mount point and a wall of scripts, zero server-rendered content. Every status-code-only fetcher (requests, curl_cffi, scrapling) stores that husk as a successful page. The HTML loads, the content does not.

This is a category-wide, structural blind spot, not a knock on any one tool. Any fetcher that decides success from the status line stores that 200 as good data, because a 200 with a skeleton in the body is, by every status-code measure, a success.

The independent labeling earned its keep in an unflattering way, and I am keeping that on the record because it is the whole point. An earlier draft of this writeup reported one cell as a competitor's (scrapling) "silent failure" on g2.com. Re-labeling from the captured body showed that was wrong: it was a veriscrape false positive. The real, content-rich G2 homepage had come back (the anti-bot let the fetch through), and veriscrape had mislabeled it as a login wall. I fixed the detector (that homepage now classifies OK) and retracted the claim. That is the thesis turned on its author: the tool exists to flag silently-wrong data, and the discipline has to apply to its own output first. If it cannot survive that, it has no business judging anyone else's fetch.

Why retry logic cannot see it

Here is the shape of almost every fetch-and-store loop I have ever written or reviewed:

resp = fetcher.get(url)
if resp.status_code == 200:
    save(resp)        # looks fine, ships it
else:
    retry(url)        # only fires on 4xx / 5xx
Enter fullscreen mode Exit fullscreen mode

The branch that matters never runs. A login wall is served with 200. A JS shell is served with 200. A DataDome gate can be served with 200. The status code is doing exactly what it is defined to do (the HTTP transaction succeeded), and your code is reading a meaning into it that was never there. So the if is true, save() runs, and the corruption is now in your store. There is no error, no exception, no log line.

You cannot fix this with more retries, because retrying a 200 login wall just gives you the same 200 login wall, stably, 3 of 3. The only way to catch it is to look at the body, the headers, and the cookies, and decide what the response is.

The fix: get the bytes plus a verdict

That decision is what I built. It is a library called veriscrape: a verified-fetch primitive that returns the bytes plus a portable, deterministic trust verdict, so the moment your data is silently wrong you have a signal at the fetch layer instead of a wrong report three days later.

pip install veriscrape
Enter fullscreen mode Exit fullscreen mode
import veriscrape

r = veriscrape.get("https://discord.com/app")

r.verdict      # 'EMPTY_SHELL'
r.cause        # 'js_app_shell'  (or 'datadome', 'login_wall', 'cloudflare_challenge', ...)
r.confidence   # 0.0 to 1.0
r.evidence     # the exact markers matched, for audit
r.ok           # True ONLY when r.verdict is OK
Enter fullscreen mode Exit fullscreen mode

get() is a drop-in for requests.get. It fetches with curl_cffi (browser-like TLS, so you are not labeled on a TLS signal alone), then runs the deterministic classifier over the response. The verdict is one of:

OK  BLOCKED  CHALLENGE  HONEYPOT  SOFT_404  LOGIN_WALL  EMPTY_SHELL  UNVERIFIED
Enter fullscreen mode Exit fullscreen mode

The taxonomy is the whole point. Instead of one boolean status_code == 200, you get a named reason for what the response actually is.

If you already have a fetch stack, you do not have to replace it. Classify what you already pulled, without re-fetching:

from veriscrape.adapters import from_requests, from_response
# from_requests(resp) for a requests.Response
# from_response(...) for httpx, Playwright, or any stack
Enter fullscreen mode Exit fullscreen mode

There is also a Scrapy middleware (VeriscrapeMiddleware) and a CLI (veriscrape check <url>, exit code 0 for OK or UNVERIFIED, 1 on a problem) for pipelines and CI.

How it works under the hood

It is deterministic. No LLM. The verdict is computed from status, headers, cookies, and body, which means it is reproducible and auditable: r.evidence shows you exactly which markers matched, so you can argue with any verdict.

The core rule I keep coming back to is the two-key rule. A vendor fingerprint alone is not a verdict. Server: cloudflare, a cf-ray header, a _px cookie, an x-kpsdk-* header: all of these show up on perfectly normal allowed pages too. If you treat vendor presence as "challenge," you will flag half the internet. So a real verdict needs two keys: the vendor gate and a challenge-or-block-specific marker on a genuine mitigation response. One key without the other is not a verdict.

Coverage today is 14 negative detectors plus the affirmative OK detector. The negatives are 7 anti-bot vendors (Cloudflare, DataDome, Akamai, PerimeterX/HUMAN, Kasada, Imperva/Incapsula, F5 BIG-IP ASM), 3 CAPTCHA gates (reCAPTCHA, Turnstile, hCaptcha), and honeypot, login-wall, soft-404, and empty-shell. The affirmative OK detector is the one that ships a green light: it is the only path to r.ok being True, and it is deliberately the hardest to earn (more on that below).

Every detector ships allowed-page fixtures: real pages from the same vendors that are not challenges. The test suite fails if any of those fixtures trips a verdict. The whole product is the claim "I will not lie to you," so the false-positive gate is the part I care about most.

The honest caveat

Read this part before you adopt anything.

veriscrape abstains over guessing, and the affirmative OK verdict that ships today is built around that. get() returns a positive OK only for a 200 that is a real document (it has a <title>) with substantial server-rendered visible text, the inverse of an empty shell. Anything short, ambiguous, or disqualified comes back UNVERIFIED, not OK, and r.ok is True only on that affirmative OK.

What that means in practice: a padded soft-404, a paywall teaser, a geo or maintenance or age-gate page, a suspended or error page served as a 200, none of these get blessed; they stay UNVERIFIED. The detector keys on affirmative evidence (real document, substantial server-rendered text) and disqualifies long-but-bad pages, because length alone is not proof of content.

That is on purpose. The design rule is abstain over guess: I would rather return UNVERIFIED than emit a confident-but-wrong OK, because a confident-but-wrong success is the exact failure this whole thing exists to prevent. It is the failure mode that costs you three days, and I am not going to reproduce it inside the tool meant to catch it. UNVERIFIED is a real verdict and it is not ok. If that tradeoff does not fit your pipeline, that is fair, and now you know it up front.

Try it, and please try to break it

pip install veriscrape
Enter fullscreen mode Exit fullscreen mode

Reproduce it yourself:

uv run --extra benchmark python -m benchmark.run
Enter fullscreen mode Exit fullscreen mode

What I most want is for you to break it. If you find a false positive (a normal page that gets flagged) or a false negative (junk that comes back UNVERIFIED when a detector should have caught it), open an issue with the URL. The detectors are pure functions of (status, headers, body), so a captured response is enough to reproduce and add as a fixture, and the evidence dict will tell us both exactly which marker tripped. That is the conversation I want to have.


I am Sanjay Chauhan. I build reliability and data-integrity primitives for data pipelines. veriscrape is open source under Apache-2.0: https://github.com/san64777/veriscrape . Reach me at san64777@gmail.com.

Top comments (0)