Ken Imoto

Posted on Jun 2 • Originally published at kenimoto.dev

I Plugged the Same Site Into 7 AI-Citation Trackers. They Reported 7 Different Numbers.

#llmo #geo #seo #ai

I expected the seven citation trackers to vary by maybe 20%. The smallest gap I got was 4x. The widest was 8x. Same site, same fifteen days, same twelve brand queries.

My favorite tracker turned out to be the cheapest one. Not because it was the most accurate. Because it was the most honest about what it was actually counting.

The setup

I run kenimoto.dev in four languages, and for months I have been trying to answer one question: does AI search actually see my site? Free trials and starter plans on the major AI-citation trackers had been piling up in my inbox. So I ran them all at once on the same input and compared.

The rules I set for myself:

One site: kenimoto.dev (including the /ja/, /pt/, and /es/ subtrees).
One window: May 1 through May 15, 2026. Fifteen days.
Twelve brand queries, written once, handed to every tool. Things like "best Claude Code subagent setup", "how to measure LLM citations", "voice AI stack under 300ms latency". All queries my content already targets.
Five LLMs I care about: ChatGPT, Claude, Gemini, Perplexity, Copilot. No tool covers all five, and that gap turned out to matter.

I picked seven tools. Six commercial, one I wrote myself in an afternoon. I wanted exactly seven so the headline would write itself, but also because seven is roughly the shortlist a normal team builds before buying.

The seven: Profound ($499/mo lite tier, enterprise focus, SOC 2 / HIPAA), Peec AI (about €89/mo, Berlin, multilingual, 115+ languages), Otterly AI ($29/mo, cheapest of the lot, Semrush integration), Bluefish AI (enterprise quote-only, Fortune 500 angle), Scrunch (mid-tier visibility tracker), Semrush AI Toolkit (bundled in their SEO suite), and my own Python script (OpenAI, Anthropic, Perplexity APIs, about $8/mo in calls).

I plugged kenimoto.dev into each, set up the same twelve queries wherever the UI allowed it, waited fifteen days, then exported the citation count.

The numbers

Otterly AI reported 38. My self-built script, 54. Semrush AI Toolkit, 71. Bluefish AI, 89. Profound, 147. Scrunch, 203. Peec AI, 312.

The gap between smallest and largest is 8.2x. Not rounded differently. Not off by a confidence interval. Eight times.

I sat there assuming I had misread the export. Then I read each tool's docs on what a "citation" actually is. That is where the answer was hiding.

Why the seven numbers diverge

Read side by side, the vendor docs turn the mystery into a definition problem. The numbers vary on four axes.

1. What counts as a citation

This is the big one. Every tool counts a different thing and calls it the same word.

Profound counts a citation only when the answer includes a clickable source link pointing at your domain. Strict, and useful for attribution. It misses any mention where the model talks about your brand without linking. Peec AI counts any mention of your brand name in the answer text, link or no link. So if Perplexity says "Ken Imoto wrote a useful guide on voice AI," that is a citation to Peec, even with no link. That is why their number is the biggest. Otterly AI counts a cited URL, like Profound, but de-duplicates per query per day, which crushes the number down. Bluefish AI is really running a share-of-voice calculation against competitors, so its "citations" number reads closer to a rank than a count. Scrunch counts both brand mentions and source links with no dedup, which lands it in the middle-high range. Semrush only counts when your domain shows up in the URL field of the structured answer, the strictest reading. My Python script counts whatever I tell it to, which today is "the brand string appears in the answer text, deduped per query, three samples averaged."

This split is not specific to me. The 2026 tooling guides now draw the same line: brand mentions are how often a model says your name, citations are when it links or attributes a source. Some platforms (Profound, Peec AI, AthenaHQ) break out explicit versus implicit citations at the URL level; others report brand-level visibility only. Pick any two definitions and they will not agree. That is the field not having a shared standard yet.

2. Which LLMs they sample

No tool covered all five engines I cared about. Peec AI samples all five, which gives it more surface area and is part of why its number is highest. Scrunch samples only ChatGPT and Perplexity, which makes its high number more interesting: more citations from fewer surfaces. If you only care about ChatGPT, your choice of tracker matters less. If you care about Gemini or Claude, you can cross half the list off immediately.

3. How often they sample

Most tools run each query daily. Some run weekly. Otterly runs daily but deduplicates inside a 24-hour window, so a brand mentioned five times in one day counts once. Peec AI runs daily and counts each mention on its own. Over fifteen days and twelve queries, that compounds fast.

4. Whether they sample in your languages at all

I publish in four languages. Most trackers default to English-only sampling unless you configure language sets by hand. Peec AI gave me the most useful multilingual number because it queries in 115+ languages by default. The rest mostly ignored my PT and ES content, which is why their numbers undercount what is actually happening in Brazilian and LatAm search.

Pick the definition, then pick the tool

After two weeks staring at this, I think "which tracker is most accurate" is the wrong question. There is no ground truth for AI citations. Every LLM is a black box that returns slightly different answers to the same prompt depending on time, region, and which datacenter you hit. There is no Search Console for this.

The right question: which definition of "citation" maps to the business outcome you actually care about?

Want attribution traffic (someone clicks a link)? Use Profound or Otterly. They count linked citations only. The numbers stay small, but they map to GA4 referrer events you can verify.
Want brand presence (the model says your name, link or not)? Use Peec AI. The number looks generous, but it is the closest proxy to "ChatGPT says my name out loud."
Want competitive positioning? Use Bluefish or Scrunch. Both run competitor sets natively.
Want the truth on a budget? Write your own script. Mine is 200 lines of Python around the OpenAI, Anthropic, and Perplexity APIs, runs about $8 a month, and hands me raw answer text to grep through, which the commercial tools mostly hide behind charts.

Until the field agrees on a shared definition, every vendor keeps counting differently under the same word. A shared taxonomy would fix this: a standard for what "citation", "mention", and "source link" mean across tools, so the numbers become comparable. The Citation Signals work at llmoframework.com is one attempt at exactly that vocabulary.

What I actually run

Honest answer: two trackers, not seven.

I kept Otterly because it is cheap and its strict definition lines up with what I can verify in GA4. If Otterly says I got cited and GA4 shows a referrer click, I trust both. I kept my own Python script because it hands me raw text and I can change the definition tomorrow if I want.

I dropped the rest. Not because they are bad. Because paying $499 a month for a number I could not reconcile against a $29 tool was making me dumber, not smarter.

If you are about to spend money on an AI-citation tracker, do this first: write down what "citation" means to you, in one sentence. Then ask each vendor whether their definition matches yours. Most will not answer cleanly. That is your answer.

I wrote a book about exactly this measurement problem, including the Python script I use and the GA4 setup that pairs with it: LLMO: AI Search Optimization.

Top comments (2)

Performance Dev • Jun 2

Really thorough breakdown of the definition problem, Ken. The 8.2x spread isn't surprising once you realize each tool answers a different question — but most teams won't do the cross-comparison work you did.

One subtlety worth adding: crawl topology interacts with citation tracking in ways most vendors don't talk about. If your site has 30,000 pages crawled but only 4,000 indexed, your citation numbers will be artificially depressed because the AI models are sampling a fraction of your content.

The $8/mo Python script is the right call. We built a similar approach and the ability to grep raw answer text catches patterns that commercial tools bury behind aggregation.

Out of curiosity: did you see any variance between what free-tier vs paid-tier API responses from the same LLM cited? ChatGPT's free tier cites about 2x more broadly, presumably from a coarser model path.

Performance Dev • Jun 2

Great point about the definition problem being the axis everything depends on — the 4-axis breakdown makes it clear that "citation" doesn't mean the same thing across any two tools. I'd add a corollary to axis #1 (what counts): even when two tools DO define a citation the same way, they're sampling LLM responses at different times, which means they're querying what are effectively different models (ChatGPT's API version from an hour ago vs now can return completely different citations for the same query string). The temporal variance between API endpoints may actually be wider than the definitional variance between tools. Have you tested running the same tracker back-to-back on the same day to isolate the time-of-query variance?