When you publish content to influence AI search answers (ChatGPT, Perplexity, Google AI Overviews), the hard question is not "did it rank" — it's "did the engine cite your URL". We built a small attribution layer to answer that.
Three-level matching
An AI answer cites sources by URL. We match each cited source against our published assets at three levels:
-
L1 — exact URL: normalize both URLs (drop scheme,
www., trailing slash) and compare. - L2 — domain: same registrable domain, different path. Weaker, but still ours.
- L3 — fingerprint: no URL overlap, but the cited text is a near-copy. We compute a token-based simhash and accept matches under a Hamming-distance threshold.
The simhash has one non-obvious requirement: it must be deterministic across processes. Python's built-in hash() is randomized per process, so a fingerprint stored today won't compare against one computed tomorrow. We hash tokens with blake2b instead.
Attribution windows
A citation appearing right after you publish is not necessarily caused by you. Engines re-crawl on different cadences — Perplexity within days, Google AI Overviews over weeks, ChatGPT not on a short horizon at all. We only attribute a lift when the observation falls inside the engine-specific window, and we subtract a control group's lift so background movement doesn't get counted.
It's a deliberately conservative model. We'd rather under-claim than report a citation we can't defend.
Top comments (0)