I've been building a tool that monitors how brands show up in ChatGPT, Gemini, Grok, and Perplexity answers. Most "GEO playbooks" I read assume the four LLMs behave roughly the same — same internet, same authoritative sources, optimize once, ship everywhere.
Wanted to test that assumption with actual data. So I sampled the same 5 prompts across all four models, four times each over 18 hours, and pulled every cited URL.
700 citations later, the four models barely overlap. The intersection of all four top-10 lists is exactly one domain (visible.seranking.com). For top-10 territory — the most important sources — each LLM is reading a fundamentally different internet.
Here's what I found, the methodology, and what it means for anyone trying to optimize for LLM visibility.
Methodology
The setup is intentionally simple so it's reproducible:
Stack:
- Headless Chromium via patchright (Playwright fork with stealth patches) running on EC2 spot instances
- Each (model × prompt × run) opens a fresh ephemeral
userDataDir, types the prompt into the public web UI, captures the response text and citations - Lambda fans out per-prompt SQS messages with rate-limit aware routing across 2 IPs
- Raw responses written to S3, parsed citations stored in Postgres
- Aggregations run via a one-off Python script over 4 cron windows (every 6h)
Sample:
- 1 category: AI brand visibility / GEO tooling
- 5 prompts varying the category question (e.g., "Which platforms help me track my brand mentions in ChatGPT, Gemini, and Perplexity?")
- 4 LLMs: chatgpt-web, gemini-web, grok-web, perplexity-web
- 4 cron runs over 18 hours
- Result: 69 successful (model × prompt × run) records, 700 cited URLs
Important: the four LLM endpoints are the public web interfaces (chatgpt.com, gemini.google.com, etc.) — not the API. The web UIs use grounding by default; APIs typically don't. Real users interact with the web UI, so that's what I measure.
Finding 1: Gemini's citations are nearly random run-to-run
Jaccard similarity of cited-domain sets across paired runs of the same prompt:
grok-web 0.49 ← most stable (small sample n=4)
chatgpt-web 0.32 ← ~1/3 overlap between runs
perplexity-web 0.32 ← same
gemini-web 0.10 ← ~90% of cited domains change between runs
Jaccard 0.10 means the cited set you observe today has only 10% overlap with the set you'd observe re-running the same prompt tomorrow.
That's not "noisy." That's nearly random.
If you're building tools or dashboards that surface "Gemini cited X" as a single point estimate, you're misleading users. The right unit is cited in N of M samples. For a stable Gemini reading on any prompt, sample 3-4 times.
Possible causes (none confirmed from external sampling):
- A/B testing of grounding sources to mitigate over-reliance on a single source
- Heavy freshness weighting → source pool changes hourly
- Multi-source diverse sampling that rotates which category gets emphasized
- User-segment-specific routing (different model variants by traffic segment)
Finding 2: Each LLM has a citation personality
Top 10 most-cited domains per LLM. The lists are almost disjoint:
ChatGPT — tier-1 publications + recognizable SaaS blogs:
techradar.com 12
frase.io 11
visible.seranking.com 9
llmclicks.ai 7 ← competitor
riffanalytics.ai 6
visiblie.com 6
otterly.ai 5 ← competitor
sitepoint.com 4
getpassionfruit.com 4
llmrefs.com 4
Gemini — SEO agencies and indie creators:
siftly.ai 18 ← competitor
nightwatch.io 15 ← agency blog
kime.ai 10
digitalapplied.com 8
visible.seranking.com 6
frictionai.co 6
nicklafferty.com 5 ← personal blog
genixly.io 5
reddit.com 4
ziptie.dev 4
Grok — universal sources + topic-blind weirdness:
cookiepedia.co.uk 9 ← cookie compliance???
onetrust.com 9 ← cookie compliance???
visible.seranking.com 8
reddit.com 6
ziptie.dev 4
digitalapplied.com 3
therankmasters.com 3
siftly.ai 3
amplitude.com 3
evertune.ai 3
The cookiepedia + onetrust spike is the strangest single observation in the dataset. For prompts about "best AI brand monitoring tool," both are completely off-topic. My read: Grok's web search appears to grab pages it visited during the search session — including cookie consent banners — and treat them as content sources. Grok's grounding has the lowest signal-to-noise ratio of the four.
Perplexity — competitor product pages, directly:
visible.seranking.com 12
reddit.com 7
llmclicks.ai 7 ← competitor
nicklafferty.com 7
amplitude.com 6
evertune.ai 6 ← competitor
superlines.io 6 ← competitor
brainz.digital 6
therankmasters.com 5
aiclicks.io 5 ← competitor
Perplexity disproportionately cites the homepages of competitors directly. Useful as competitive intel: Perplexity cites whoever has discoverable, well-structured product pages.
Top-10 overlap across all four models: exactly 1 domain. Each LLM is, for top-10 territory, a different internet.
Finding 3: 10 domains every LLM cites
Drop the "top 10" filter and look at the long tail. Exactly 10 domains were cited by all four LLMs in the sample:
reddit.com forum
visible.seranking.com SEO suite
amplitude.com analytics platform
conductor.com enterprise SEO
zapier.com automation
siftly.ai direct competitor
scrunch.com direct competitor
aiclicks.io direct competitor
bluefishai.com AI niche
ziptie.dev dev blog
A single content placement on any one of these reaches all 4 LLMs simultaneously. If you're bandwidth-constrained, this is the list to attack first.
Reddit alone produced 21 distinct citation events across the 80-attempt sample — roughly 25% of all unique citations come from one domain. Most categories I've looked at have Reddit at the top of the universal-N list.
Finding 4: Recommendation ≠ citation
This is the finding that changed how I think about GEO measurement.
Two distinct events happen in any LLM answer:
- Mention — the brand name appears in the recommendation text
- Citation — the brand's domain appears in the cited sources
These are independent. I measured the conditional probabilities for 25+ named brands in the sample.
# Simplified version of the calculation
P_cited_given_mentioned = sum(brand_cited & brand_mentioned) / sum(brand_mentioned)
P_mentioned_given_cited = sum(brand_cited & brand_mentioned) / sum(brand_cited)
Results:
Model P(cited|mentioned) P(mentioned|cited)
chatgpt-web 47% 75%
perplexity-web 40% 45%
grok-web 37% 82%
gemini-web 33% 63%
Across all four models, only 33-47% of recommended brands are also cited. The other half were pulled from training data alone — the LLM remembered the brand without grounding the recommendation in any current web source.
The most striking example: a brand named Peec was mentioned 8 / 7 / 9 / 9 times by the four models across the same 5 prompts. Citations of peec.com? Zero, across all four LLMs.
Peec wasn't outreaching. Peec wasn't getting backlinks. The LLMs just know Peec exists in this category.
This implies a thing most GEO tools and playbooks gloss over: durable presence in LLM training data is doing real work that fresh outreach can't replicate. Wikipedia entries, Crunchbase profiles, Hacker News threads from 2023-2024, conference talk recordings — these don't show up as citations but they materially shape recommendations.
If you're allocating 100% of your GEO budget to fresh citation outreach, you're missing roughly half of the recommendation surface.
Finding 5: Perplexity cites broadly, recommends narrowly
The inverse direction (P(mentioned | cited) above) splits the four into two camps:
- Grok (82%) and ChatGPT (75%): citing ≈ endorsing. If they cite you, they almost always recommend you.
- Perplexity (45%): cites broadly to support specific sentences, but only recommends a narrow subset of those sources as brand endorsements.
A single Perplexity citation is a weaker brand signal than a single ChatGPT citation. To move Perplexity's recommendation needle, you need citations plus explicit "best in category" positioning in the cited content.
What I'd actually do with this data
If I were optimizing for LLM visibility in this category, here's the prioritized plan:
- Reddit engagement (highest leverage, low cost) — find the 5 most-cited threads in your niche, comment substantively. Don't brigade or pitch your product. Authentic engagement compounds across all 4 LLMs.
- Get on the universal-10 — pitch a guest article to amplitude.com/blog, conductor.com/learn, or zapier.com/blog. Each placement is a 1-of-10 reach across all 4 LLMs.
- Build training-data presence — Wikipedia entry (3-6 months), Crunchbase profile (2 hours), Hacker News post (instant), conference talk recording. Aim for the Peec pattern.
- Optimize your own product page — schema markup, comparison-friendly headings, category-clear meta. This matters most for Perplexity but helps all four.
- Stop chasing single-LLM gains — pick your two priority LLMs, plan to their personality, accept that the other two need different content.
Limitations
To stay honest:
- One category, one project. The universal-N list and per-LLM personalities probably differ across verticals.
- Small Grok sample (n=9 vs 20 for others). The cookiepedia observation is suggestive, not definitive.
- Scraper-dependent. Each LLM's DOM changes, so citation extraction is a moving target.
- No domain-authority weighting. A 4-of-4 cite from a small blog is genuinely lower-leverage than from a tier-1 publication.
What's next
I'm planning to publish similar analyses on different verticals as I collect them. Want this kind of breakdown on your category? Email me at yibo@aiattention.ai with your category and competitors — I'll run 5 prompts × 4 LLMs × 4 runs on your vertical and send back the universal sources, citation gaps vs your competitors, and per-LLM tactics. No charge.
If you found this useful, follow me on dev.to or check out aiattention.ai. The product is in MVP — feedback welcome.
— Yibo
Top comments (0)