DEV Community

Code Pocket
Code Pocket

Posted on • Originally published at redditquora.com

Measuring whether AI search engines actually cite you

When you publish content to influence AI search answers (ChatGPT, Perplexity, Google AI Overviews), the hard question is not "did it rank" — it's "did the engine cite your URL". We built a small attribution layer to answer that.

Three-level matching

An AI answer cites sources by URL. We match each cited source against our published assets at three levels:

  • L1 — exact URL: normalize both URLs (drop scheme, www., trailing slash) and compare.
  • L2 — domain: same registrable domain, different path. Weaker, but still ours.
  • L3 — fingerprint: no URL overlap, but the cited text is a near-copy. We compute a token-based simhash and accept matches under a Hamming-distance threshold.

The simhash has one non-obvious requirement: it must be deterministic across processes. Python's built-in hash() is randomized per process, so a fingerprint stored today won't compare against one computed tomorrow. We hash tokens with blake2b instead.

Attribution windows

A citation appearing right after you publish is not necessarily caused by you. Engines re-crawl on different cadences — Perplexity within days, Google AI Overviews over weeks, ChatGPT not on a short horizon at all. We only attribute a lift when the observation falls inside the engine-specific window, and we subtract a control group's lift so background movement doesn't get counted.

It's a deliberately conservative model. We'd rather under-claim than report a citation we can't defend.

Top comments (0)