MH Habib

Posted on Jun 30

How Content Invariance Field (CIF) Solves the Volatile Website Problem

#machinelearning #architecture #monitoring #buildinpublic

Modern websites are probabilistic runtime surfaces, not deterministic documents.

A pricing page served to two different visitors produces observably different HTML, React hydration nonces, A/B test bucket assignments, session-specific analytics payloads, and rotating testimonials. The same logical page, two different DOM trees. This makes web change detection surprisingly hard: the problem is not detecting that something changed, but determining whether it was a meaningful business event or ephemeral rendering noise.

I spent the better part of a year working on this problem. This post covers the approach that worked — a per-URL Content Invariance Field (CIF) that learns which parts of a page are stable and uses entropy-weighted hashing to suppress everything else.

Why Traditional Approaches Are Open-Loop

The standard approach to web change detection follows a pattern I call strip-and-hash:

Define a global list of "noisy" elements and attributes
Strip them from the DOM
Hash the cleaned DOM
Compare hashes between crawls

This is an open-loop system; it applies a fixed preprocessing pipeline and makes a binary decision on each observation pair, without learning from the page's history. Every new noise source requires updating the stripping rules.

I've seen teams maintain 15+ regex patterns for hydration nonces alone (__next-, sc-, data-session-, etc.), plus separate filters for timestamps, visitor counters, A/B test parameters, and analytics payloads. Each new frontend framework or CMS plugin a target site adopts introduces new noise patterns that must be manually identified and stripped.

The deeper issue is that the approach assumes noise is enumerable. It's not. The set of possible rendering artifacts is unbounded because the set of frontend tools and frameworks is unbounded.

A Closed-Loop Alternative: Per-URL Invariance Learning

If noise is not enumerable, perhaps it is learnable. Each URL has its own stability profile; some sections change every visit, others never change. Instead of trying to predict what is volatile globally, we can observe what is stable locally.

The core idea is a Content Invariance Field: a per-URL probabilistic model that assigns an invariance weight w ∈ [0.0, 1.0] to each tracked section of the page.

Weight	Meaning	Observed behavior
1.0	Invariant	Content hash identical across all observations
0.5–0.9	Partially stable	Changes occasionally, usually reverts
0.0–0.4	Volatile	Different content on most visits
0.0	Chaotic	Never the same twice

Rules do not assign these weights. They emerge from observation.

Training Phase

For each observation of a URL, the page is parsed into semantic sections (hero, features, navigation, footer, testimonials, etc.). Each section gets an anchor selector and its content hash recorded:

{
  "anchorSelector": "pricing:nth-of-type(1)",
  "observedCount": 5,
  "invarianceWeight": 0.92,
  "entropy": 0.08,
  "contentHistory": [
    { "hash": "a1b2c3", "observedAt": "2026-06-27T10:00:00Z" },
    { "hash": "a1b2c3", "observedAt": "2026-06-27T14:00:00Z" },
    { "hash": "a1b2c3", "observedAt": "2026-06-28T10:00:00Z" }
  ]
}

Content history uses a sliding window, the last 20 observations per node. This bounds memory usage while keeping the model responsive to behavioral shifts (a section that was stable for months and suddenly becomes volatile will reflect the change within 20 observations).

Entropy as a Stability Signal

After each observation, Shannon entropy is computed per node from the empirical state distribution:

H(e) = -Σ p(s) · log₂ p(s)

This is normalized to [0, 1] by dividing by log₂(min(distinct_states, 20)). The entropy value maps directly to a detection behavior:

Entropy	Classification	Detection treatment
0.0	Invariant	Full weight — always included in comparison hash
0.0–0.3	Stable	Full weight — always included
0.3–0.7	Cyclic	Included with content-truncated hash (structural check only)
0.7–1.0	Volatile	Excluded from stable hash entirely
1.0	Chaotic	Only structural presence tracked

The entropy threshold of 0.3 is not arbitrary; it corresponds to the point where a node has visited at least two distinct states with non-negligible probability. For a binary-state node (e.g., a nav bar that toggles between two states), the entropy is exactly 1.0 at equilibrium. The thresholds were tuned against a corpus of ~200 real competitor URLs.

Weighted Hash Construction

The comparison hash that drives change detection is:

pageHash = Σ(invarianceWeight_i × contentHash_i) / Σ(invarianceWeight_i)

A pricing table that has never changed (weight 1.0) dominates the hash. A live visitor counter that changes every visit (weight ~0.05) barely contributes. A single hydration nonce cannot flip this hash — it was never included because its parent section has high entropy and was excluded.

The actual implementation uses a slightly different approach for practical reasons; the top 30 weighted hashes are sorted by descending weight and concatenated, but the principle is the same. The stable sections determine the hash; the volatile sections are invisible to it.

Edge Cases That Matter

The core algorithm is straightforward. The engineering complexity is in the edge cases.

Cold Start

Before a URL has accumulated enough observations, the invariance field has no information. Using CIF during this period means suppressing everything, which is wrong for the most important detections (the first real change on a newly-monitored URL).

The solution is a training phase fallthrough: the CIF hash gate only activates once trainedOn >= 2 observations and at least one node has a learned stable hash. Before that, the system uses traditional comparison:

export function cifContentUnchanged(
  currentGlobalStableHash: string,
  field: CifFieldSnapshot | null,
): boolean {
  if (
    !field ||
    field.trainedOn < MIN_TRAINED_CRAWLS ||  // 2
    field.nodes.length === 0 ||
    !field.globalStableHash
  ) {
    return false; // not trained — let it through
  }
  return currentGlobalStableHash === field.globalStableHash;
}

This is conservative by design: better to let noise through during training than to miss a real change.

Strategic Section Override

Not all page sections are equal. A pricing table changing on crawl 2 before CIF is trained is a critical signal that must not be suppressed. Strategic section types bypass entropy-based exclusion entirely:

const STRATEGIC_SECTION_TYPES = new Set([
  "hero", "cta", "pricing", "features",
  "enterprise", "announcement",
]);

If a strategic section has entropy < 1.0 (i.e., it's not completely chaotic), it stays in the stable hash regardless of what the CIF would otherwise do. This acts as a safety valve. The CIF learns over time, but business-critical signals never go dark during the learning window.

Chrome Flapping (Navigation/Footer)

Navigation and footer sections present a distinct problem. A LinkedIn nav bar toggles between "Get the app, Sign in, Join now" and "Sign in, Join now" on almost every crawl. This is cyclic — it rotates through a known set of states — but it's not random.

A generic CIF would classify this as medium-entropy (~0.5–0.7) and partially suppress it. But partial suppression creates edge cases: a nav toggle that occasionally flips the truncated hash can still trigger false positives.

The specific solution used here: chrome sections (navigation, footer) use a state-frequency map instead of raw content. When a chrome node toggles to a state seen ≥2 times in its history, and the node has accumulated ≥4 observations, the hash collapses to a deterministic token:

`chrome-cyclic:${section.anchorSelector}`

A novel chrome state after the training window still produces a content token, so a genuine navigation restructure is not missed. Chrome is never a strategic signal, so this only removes known-cyclic churn.

Node Disappearance

What happens when a high-invariance section vanishes between crawls? A pricing table that was present in every observation for six months has suddenly gone missing — should that be treated as volatility or a real change?

The implementation retains nodes with invarianceWeight >= 0.6 even when they are not observed in the current crawl. A missing high-invariance node is flagged as a structural change. This handles cases like:

A competitor removing a pricing tier
A page is being redesigned and sections are being relocated
Temporary rendering failures (which produce their own signal)

Continuous Adaptation

The invariance weight update uses an exponentially weighted moving average (EWMA) with a change penalty:

const alpha = 0.35;
const rawStability = computeStabilityFromHistory(contentHistory);
const changePenalty = changed ? 0.15 : 0;
const invarianceWeight = Math.max(
  0.05,
  existingNode.invarianceWeight * (1 - alpha) +
    rawStability * alpha -
    changePenalty,
);

The EWMA parameter alpha = 0.35 means each new observation contributes 35% weight to the running estimate. This is aggressive enough to detect behavioral shifts within 3–5 observations, but smooth enough to avoid reacting to single outlier observations. The computeStabilityFromHistory function returns the ratio of observations matching the most common state — a node that has been the same 19 out of 20 times gets stability ~0.95.

When a previously stable node changes, the changePenalty of 0.15 drops its weight immediately. If it reverts next crawl, the weight recovers. If it stays in the new state, the EWMA slowly converges on the new stability. A node that flips between two states with equal frequency will settle at weight ~0.3–0.5 and entropy ~0.5–0.7.

The key property: no code needs to be deployed for new noise patterns. An A/B testing framework that rotates class names every visit is automatically suppressed after 3–5 observations. The behavior emerges from observation.

A Note on the Entropy-Action Mapping

The entropy thresholds (0.0, 0.3, 0.7, 1.0) were determined empirically, and they are likely specific to the types of pages in this dataset — primarily SaaS marketing sites, documentation, and pricing pages. Sites with different characteristics (news portals, social media feeds, e-commerce with dynamic inventory) would likely require different thresholds.

The general approach of mapping continuous entropy to discrete actions is sound, but the specific thresholds should be validated against your own corpus. The mathematical framework — content history → empirical entropy → invariance weight → weighted hash — is independent of the threshold values.

How It Performs on Real Pages

The approach was tested against a set of competitor URLs that were producing persistent false positives under strip-and-hash.

Stripe's pricing page was a particularly instructive case. It serves different A/B test variants ("control" vs "treatment") that change copy, layout, and even CTA wording with every visit. Before CIF training, every single crawl flagged as a change — the pageHash comparison had no way to distinguish between a different A/B variant and a real pricing change.

After three training crawls:

The globalStableHash became byte-identical across control and treatment variants
A change in any invariant section (price value, feature list) immediately flipped the hash
The volatile sections — A/B copy, different CTAs — were automatically excluded because their entropy exceeded 0.7

This validated the core hypothesis: stability is a learnable property per URL, and once learned, it serves as a reliable detection baseline.

Linear's changelog page had a different noise profile — nav highlight states and footer links varied between crawls while the actual changelog entries were stable. The CIF assigned high invariance weight to the main content and low weight to the chrome, reducing the false positive rate from approximately 60% to under 2%.

Vercel's pricing page used a dynamic currency selector that changed pricing display based on geo-detection. Each crawl from different proxies produced different currency formatting. The CIF learned the pricing section had moderate entropy (0.4–0.6) due to the formatting variations while the actual plan structure was stable. Using the weighted hash, the pricing structure changes were detectable while currency formatting noise was suppressed.

Limitations

CIF has specific failure modes worth stating explicitly.

Cold start vulnerability. The first 2–3 observations use fallback comparison. A competitor that launches a major change on observation 2 is handled by the fallback path, which may be less accurate. Archetype-based transfer learning (bootstrapping a new URL's invariance field from similar pages) is a potential mitigation but adds complexity.
Synchronized multi-node changes. A full site redesign changes every tracked node simultaneously. The CIF sees this as maximum-entropy noise — all previously stable nodes are now changing. The Bayesian accumulation of belief across observations handles this slowly rather than immediately. Coordinated simultaneous change across many high-invariance nodes is a separate signal that should be treated differently from independent volatility, but the current implementation does not distinguish them.
Re-identification drift. The anchor selector strategy — currently "type:nth-of-type(N)" — is fragile. A page restructure that changes section ordering can cause re-identification failures. Perceptual hashing or multi-anchor re-identification (CSS selector + visual signature + text anchor + spatial position) would be more robust but adds significant extraction overhead.
Sliding window lag. A section that was stable for 20 observations and suddenly becomes volatile takes 20 observations to fully age out of the history window. EWMA mitigates this (the change penalty drops weight immediately), but the entropy estimate lags by the window size.

Implementation Notes

The full implementation is approximately 700 lines of TypeScript running in a Node.js service. The algorithmic complexity is O(n) per node per observation — linear in the number of tracked sections. Storage is a JSON document on the URL record (~2–10 KB per URL depending on node count and history depth).

The performance-critical path is the hash comparison, which is a single SHA-256 comparison of two short strings. This runs before any expensive page analysis (screenshot rendering, semantic extraction, AI inference) and gates all downstream work. For URLs in a trained state where no invariant section has changed, the total detection cost is effectively zero.

The implementation stores a sliding window of 20 observations per node, capped at 50 nodes with the last 10 history entries preserved when the JSON size exceeds 900 KB. In practice, most URLs track 15–30 sections with negligible storage cost.

The Insight

The fundamental shift is from enumerating noise to measuring stability.

A strip-and-hash system maintains a growing list of things to ignore. Each new noise source requires identification, rule-writing, deployment, and testing. The rule set grows monotonically and never converges.

An invariance-learning system maintains a model of what is stable. Each new observation improves the model. Noise is not identified — it is automatically suppressed because it is not stable. The system converges on a per-URL stability profile and adapts when that profile changes.

The entropy-weighted hash is the mechanism that makes this practical. Instead of asking "did any part of the page change?" — which is almost always yes on a modern SPA — it asks "did any stable part of the page change?" which is the question we actually need answered.

I work on a competitive intelligence project called IntelDif, where this approach was developed and deployed. The problem of distinguishing meaningful business changes from rendering noise turned out to be harder than we expected, and CIF was the solution that consistently worked.

Website: inteldif.com
Ask us any questions on this at LinkedIn

DEV Community