Last quarter I sat down with a junior on our team and watched her build a citation tracker in a spreadsheet for the third time that month. The fourth time, I stopped pretending we had a system. We had a habit. Habits and systems are not the same thing.
What we had was a growing pile of screenshots from Perplexity, Google's AI Overviews, ChatGPT, and Gemini, plus a Notion page where each of us had been informally rating the citation quality of pieces we'd written or co-written for B2B SaaS clients. Some of us called it "good" or "weak." One person used a 1-5 scale. Another used colored dots. The tracker was the symptom. The problem was that we had no shared vocabulary for what a "good citation" actually was, and so every retrospective ended in a polite shrug.
Over six weeks in Q4 2025 we ran what eventually became our baseline: 40 prompts, four engines (Perplexity, Google AIO, ChatGPT with web on, Gemini), five repetitions per prompt-engine combination. That's 800 prompt-runs. The point wasn't to win citations. The point was to figure out what to call them when we got them. Here is what we found, and what we'd do differently if we ran it again.
Why we ran the study in the first place
The trigger was a single client conversation in early Q3 2025. The client asked, point blank, "are we doing well on AI search?" The honest answer was that I didn't know how to answer the question precisely. I could say "yes, we've seen citations" or "no, we're not in the top results," but I couldn't tell them how often, on what kinds of queries, with what consistency, against what baseline. That gap was the embarrassing part. We were charging for GEO work and didn't have a measurement instrument we trusted.
We went away from that meeting, looked at the tools available off the shelf, and concluded that the existing AI-search rank-tracking products were either too shallow (one-shot queries) or too opaque (proprietary scoring with no exposed methodology) to underwrite the kind of answer we wanted to give. So we built the methodology ourselves, knowing it would be slow and partial and likely embarrassing in retrospect. That client is still a client. The methodology has evolved. The original question is now answerable in a way it wasn't.
Why a tier system at all
The first draft of the framework had three buckets: cited, not cited, and "sort of mentioned." That collapsed almost immediately. A citation that's a hyperlinked source under a direct answer is not the same artifact as a passing mention in a paragraph that an AI engine generated from training data and didn't link out from. We needed to distinguish at least four things: whether the source was linked, whether the answer paraphrased our claim, whether the brand entity was named, and whether the user would plausibly click through.
After two passes we landed on A through E:
- A-tier: linked primary citation, our specific claim is paraphrased, entity is named in answer body.
- B-tier: linked citation, claim is paraphrased, entity not named (anonymous source).
- C-tier: unlinked mention in the answer text, but no source attribution in the citation rail.
- D-tier: appears only as a footnote-style URL in the "sources" rail with no semantic pull-through.
- E-tier: indexed but not surfaced to the user (you can find it via "show all sources" but it's invisible by default).
Across the 800 runs, 23% of citations landed in A or B. 45% sat in D or E. The middle (C) was small, about 11%, which surprised us; we'd expected a wider plateau. The remaining ~21% returned no citation for our content at all on a given run.
What the 23% number actually means
I want to be careful. The 23% is a portfolio number across our test set, not a per-engine number, and not a per-client number. In our testing, Perplexity tier-A rates ran noticeably higher than Gemini's; ChatGPT (web on) sat between them; Google AIO behaved most like a confidence-weighted SEO ranker, with strong D/E presence and rare A-tier breakthroughs.
Small n caveats apply. Forty prompts is not a representative sample of any client's actual demand curve. The 23% is the headline, not the answer. The answer is the variance: across five reps of the same prompt on the same engine in the same week, we saw tier shifts on 31% of prompt-engine pairs. So a single audit run is, statistically, half a coin flip from being directionally wrong on a third of the data.
One thing we got wrong on the first pass
We initially coded "entity named in body but no link" as B-tier. Two months in, we noticed that those mentions correlated almost zero with downstream session starts on the client's analytics. We moved them to C. The lesson is that linked-ness, not nameness, is doing the heavy lifting. The agency I work with had quietly assumed brand mention was the prize; it isn't, at least not yet. Reformulating mid-study was uncomfortable. We re-coded 174 records. It was the right call.
The per-engine breakdown
When we sliced the 23% A+B rate by engine, the variation was wider than the headline suggests. Perplexity returned A or B tier on about 31% of its runs in our test set. ChatGPT with web on sat around 24%. Gemini was 19%. Google AIO was 15%, with most of its surface concentrated in D and E. The aggregate is a portfolio average; if you only care about one engine, the portfolio number is the wrong number to plan against.
We also broke the 23% out by prompt category. Prompts that asked for comparative statements ("X vs Y") cited better than prompts that asked for definitional statements ("what is X"). Prompts that referenced a specific product or vendor by name cited better than category-level prompts. Prompts about recent events (within the trailing 90 days) cited better on Perplexity and worse on Google AIO. Most of these patterns are intuitive in retrospect; we hadn't predicted any of them in advance.
The point of mentioning these is not to suggest you should optimize for the categories that cite better. It's that any single headline number — including ours — is hiding a structure underneath it, and the structure is where the decisions actually live.
The coding fatigue problem
A boring methodological note that I want to write down because it bit us. Tier coding 800 records by hand is fatigue-prone work. We tried to do it in long sessions and our inter-rater reliability dropped noticeably after about the 60th record in a sitting. We've since switched to coding in 45-minute blocks with two coders comparing notes at the end of each block. Reliability improved. Throughput stayed roughly constant, because the rework rate dropped.
If you're running a study like this, the fatigue effect is real. We logged a 12% discrepancy rate between coders when sessions ran past 90 minutes, versus about 4% in shorter sessions on the same content. The data underneath your tier rate is only as good as the coding discipline that produced it.
What we'd change next time
Five reps was the minimum that gave us stable tier assignments. Three reps lied to us repeatedly in the first pilot. If you're doing this on your own content, please run five. We'd also pre-register the prompt list before looking at any results, because we caught ourselves rewriting prompts that "didn't work" and that's exactly how to fool yourself.
We'd also pre-register the tier definitions. We didn't, and we ended up re-coding 174 records (mentioned above) when we revised the framework. Pre-registration would have forced us to argue the definitions before we knew the answer. That would have been slower up front and faster overall.
We're now running a sequel study, 60 prompts, same engines, with prompt phrasing held constant from the start and a second coder doing blind tier assignment for inter-rater reliability. I don't expect the 23% number to hold; it might be lower once we control for prompt drift. We'll publish either way, including if the answer is embarrassing.
If you're starting your own tier framework, the question I'd ask first isn't "how do I score this." It's "what would I have to see to change my mind about what counts as a good citation?" If you can't answer that in a sentence, the framework is going to drift the moment you stare at the data. Ours did.
Top comments (0)