Ismail Haddou

Posted on Jun 5

Your Scraper Returns 200 OK and Lies. Here's How to Catch It.

#ai #cybersecurity #dataengineering #webscraping

For a decade, scraping defense was about blocking. Status codes, IP bans, captchas. You knew when you were blocked because your scraper threw errors.

That mental model is now broken. In 2026, the dominant anti-scraping technique is no longer to block. It is to deceive. Cloudflare's AI Labyrinth, in production since 2025, detects suspected crawlers and serves them realistic AI-generated content on a 200 OK response. Same theme, same DOM structure, same field types as the real site. The only thing fake is the data.

Your scraper does not know it has been targeted. Your pipeline does not know it is loading lies. Your downstream model trains on a partially fabricated corpus.

Here is the technical breakdown of the problem and what to actually build to defend against it.

Why Every Existing Check Passes

Run through your standard scraping observability stack and ask which check catches Labyrinth content.

HTTP status code monitor: 200 OK
Response time anomaly detector: response time is normal, content is pre-generated and cached
Captcha challenge detector: no captcha served
Schema validator: fields present, correctly typed
Field completeness rate: 100 percent
Record count vs daily baseline: stable

None of these catch it. The system was designed to detect HTTP-level or structural failure modes. Labyrinth is a content-level attack and it bypasses the entire stack.

The conceptual fix is to stop conflating "request succeeded" with "data is true." Those are now distinct verification problems.

A Trust Layer Architecture

The architecture I deploy with clients separates collection from promotion. Scraped records do not flow directly to the data warehouse or search index. They land in a staging table where trust checks run before promotion.

[Scraper] -> [Raw Staging] -> [Trust Layer] -> [Warehouse]
                                   |
                                   +--> [Quarantine]

The trust layer runs five classes of checks. You do not need all five on every record. Sample intelligently.

Layer 1: Cross-Source Validation

For any field that drives business outcomes, scrape it from two independent sources whose anti-bot systems are unlikely to be coordinated. Compare.

def cross_source_check(record, alt_source_fetcher, tolerance=0.02):
    """Returns (verdict, confidence, alt_value)."""
    alt = alt_source_fetcher(record.entity_id)
    if alt is None:
        return ("unverifiable", 0.0, None)

    if record.field_type == "numeric":
        diff = abs(record.value - alt.value) / max(abs(record.value), 1e-9)
        if diff <= tolerance:
            return ("verified", 1.0 - diff, alt.value)
        return ("disagree", diff, alt.value)

    if record.field_type == "string":
        if record.value.strip().lower() == alt.value.strip().lower():
            return ("verified", 1.0, alt.value)
        return ("disagree", 0.0, alt.value)

This is the single highest-leverage check you can run. It catches most Labyrinth content immediately because fabricated values do not coincide with anything observable elsewhere on the open web.

Layer 2: Entity Grounding

Maintain a registry of stable entities with known values. Company HQ addresses, ISBN to title mappings, product UPCs, executive names at top 500 companies. Anything that changes slowly and that you can verify independently.

class EntityRegistry:
    def __init__(self, store):
        self.store = store  # k/v of canonical values

    def check(self, entity_type, entity_id, observed_value):
        canonical = self.store.get(f"{entity_type}:{entity_id}")
        if canonical is None:
            return "no_ground_truth"
        if normalize(canonical) == normalize(observed_value):
            return "grounded"
        return "ungrounded"

When a scrape returns a value for a grounded entity that disagrees with the registry, the source is suspect. Requeue with a different fingerprint and IP class. If three independent attempts disagree, mark the source as compromised for this batch.

Layer 3: Distributional Anomaly Detection

LLM-generated content has statistical fingerprints. Values cluster in unnatural ranges. Vocabulary skews toward common tokens. Dates round to plausible-looking values rather than the natural distribution of the source.

A KL divergence check against a rolling baseline catches this at the batch level.

import numpy as np

def kl_divergence(p, q, eps=1e-12):
    p = np.asarray(p) + eps
    q = np.asarray(q) + eps
    p = p / p.sum()
    q = q / q.sum()
    return float(np.sum(p * np.log(p / q)))

def batch_drift_check(current_batch, baseline_distribution, threshold=0.15):
    """Detect drift in a numeric field's distribution vs the baseline."""
    hist, _ = np.histogram(current_batch, bins=baseline_distribution["edges"])
    drift = kl_divergence(hist, baseline_distribution["counts"])
    return drift, drift > threshold

Set the threshold per field based on observed historical variance. Tune by replaying a month of known-good data and choosing the 99th percentile drift as the alert threshold.

Layer 4: Temporal Consistency

The same URL, scraped twice within a short window from different sessions, should return the same values for stable fields. Labyrinth content is often regenerated per session.

def temporal_consistency(url, scrape_fn, fields_to_check, gap_seconds=120):
    """Two reads of the same URL with different sessions, compare stable fields."""
    first = scrape_fn(url, session=new_session())
    time.sleep(gap_seconds)
    second = scrape_fn(url, session=new_session())
    inconsistencies = []
    for f in fields_to_check:
        if first.get(f) != second.get(f):
            inconsistencies.append((f, first.get(f), second.get(f)))
    return inconsistencies

Run this on a small sample, around one percent of crawl volume. Inconsistencies on stable fields are a strong signal of being inside a deception system.

Layer 5: Hidden Link Trap Detection

Labyrinth specifically injects invisible links into served pages. If you follow discovered links and aggregate content from them, you are being led deeper.

Track URL provenance in your crawl frontier:

class CrawlURL:
    def __init__(self, url, provenance):
        self.url = url
        # provenance: "seed", "user_visible_nav", or "discovered_link"
        self.provenance = provenance

def should_crawl(crawl_url, discovered_link_quarantine_rate=0.95):
    if crawl_url.provenance == "discovered_link":
        if random.random() < discovered_link_quarantine_rate:
            return False
    return True

In practice you want a much higher rejection rate for discovered links than for explicit navigation. If you must follow them, run cross-source validation on every record returned from those URLs.

Sampling Strategy

Running all five layers on every record will not be economical. A realistic production sampling strategy:

Cross-source validation: random 10 percent, plus 100 percent of high-value entities
Entity grounding: 100 percent for any record touching a known grounded entity
Distributional anomaly: per-batch, runs against the full batch but compares aggregate to baseline
Temporal consistency: random 1 percent at collection time
Discovered-link rejection: configured per source, default 95 percent rejection unless validated

This adds something on the order of 5 to 8 percent to total pipeline cost on the deployments I have built. The catch rate on actually-fabricated content runs around 95 percent within the first week of tuning.

Things That Will Bite You

A few notes worth flagging.

Do not assume response time gives you a signal. Pre-generated Labyrinth content responds at normal latency because it is cached.

Do not retry from the same fingerprint expecting different results. The session is flagged. Rotate fingerprint, headers, and IP class together if you want a different output.

Do not use bypass tools that were popular in 2024. Cloudscraper, Cfscrape, and a long list of others no longer work against current versions. Stealth browser automation with residential proxies is still viable but the cost per request is rising and the success rate is falling. The economic leverage in 2026 is on verification, not on collection.

Audit historical data. If you have a scraping operation that ran through 2025, there is a non-zero chance your existing data is partially poisoned. Sample, validate against ground truth, and quarantine before using historical data for training or fine-tuning.

Bottom Line

The bottleneck in production scraping is no longer access. It is verification. The teams shipping clean data are the ones who treat scraped records the way a security team treats untrusted user input: validate, cross-reference, quarantine before promoting.

If you are hitting this in production and want a second set of eyes, feel free to DM me, happy to dig in.

DEV Community