Measuring whether AI search engines actually cite you

#geo #ai #seo #webdev

When you publish content to influence AI search answers (ChatGPT, Perplexity, Google AI Overviews), the hard question is not "did it rank" — it's "did the engine cite your URL". We built a small attribution layer to answer that.

Three-level matching

An AI answer cites sources by URL. We match each cited source against our published assets at three levels:

L1 — exact URL: normalize both URLs (drop scheme, www., trailing slash) and compare.
L2 — domain: same registrable domain, different path. Weaker, but still ours.
L3 — fingerprint: no URL overlap, but the cited text is a near-copy. We compute a token-based simhash and accept matches under a Hamming-distance threshold.

The simhash has one non-obvious requirement: it must be deterministic across processes. Python's built-in hash() is randomized per process, so a fingerprint stored today won't compare against one computed tomorrow. We hash tokens with blake2b instead.

Attribution windows

A citation appearing right after you publish is not necessarily caused by you. Engines re-crawl on different cadences — Perplexity within days, Google AI Overviews over weeks, ChatGPT not on a short horizon at all. We only attribute a lift when the observation falls inside the engine-specific window, and we subtract a control group's lift so background movement doesn't get counted.

It's a deliberately conservative model. We'd rather under-claim than report a citation we can't defend.