Yibo Hu

Posted on May 1 • Originally published at Medium

I scraped 700 LLM citations to figure out how AI actually ranks websites

#ai #datascience #seo #marketing

I've been building a tool that monitors how brands show up in ChatGPT, Gemini, Grok, and Perplexity answers. Most "GEO playbooks" I read assume the four LLMs behave roughly the same — same internet, same authoritative sources, optimize once, ship everywhere.

Wanted to test that assumption with actual data. So I sampled the same 5 prompts across all four models, four times each over 18 hours, and pulled every cited URL.

700 citations later, the four models barely overlap. The intersection of all four top-10 lists is exactly one domain (visible.seranking.com). For top-10 territory — the most important sources — each LLM is reading a fundamentally different internet.

Here's what I found, the methodology, and what it means for anyone trying to optimize for LLM visibility.

Methodology

The setup is intentionally simple so it's reproducible:

Stack:

Headless Chromium via patchright (Playwright fork with stealth patches) running on EC2 spot instances
Each (model × prompt × run) opens a fresh ephemeral userDataDir, types the prompt into the public web UI, captures the response text and citations
Lambda fans out per-prompt SQS messages with rate-limit aware routing across 2 IPs
Raw responses written to S3, parsed citations stored in Postgres
Aggregations run via a one-off Python script over 4 cron windows (every 6h)

Sample:

1 category: AI brand visibility / GEO tooling
5 prompts varying the category question (e.g., "Which platforms help me track my brand mentions in ChatGPT, Gemini, and Perplexity?")
4 LLMs: chatgpt-web, gemini-web, grok-web, perplexity-web
4 cron runs over 18 hours
Result: 69 successful (model × prompt × run) records, 700 cited URLs

Important: the four LLM endpoints are the public web interfaces (chatgpt.com, gemini.google.com, etc.) — not the API. The web UIs use grounding by default; APIs typically don't. Real users interact with the web UI, so that's what I measure.

Finding 1: Gemini's citations are nearly random run-to-run

Jaccard similarity of cited-domain sets across paired runs of the same prompt:

grok-web        0.49   ← most stable (small sample n=4)
chatgpt-web     0.32   ← ~1/3 overlap between runs
perplexity-web  0.32   ← same
gemini-web      0.10   ← ~90% of cited domains change between runs

Jaccard 0.10 means the cited set you observe today has only 10% overlap with the set you'd observe re-running the same prompt tomorrow.

That's not "noisy." That's nearly random.

If you're building tools or dashboards that surface "Gemini cited X" as a single point estimate, you're misleading users. The right unit is cited in N of M samples. For a stable Gemini reading on any prompt, sample 3-4 times.

Possible causes (none confirmed from external sampling):

A/B testing of grounding sources to mitigate over-reliance on a single source
Heavy freshness weighting → source pool changes hourly
Multi-source diverse sampling that rotates which category gets emphasized
User-segment-specific routing (different model variants by traffic segment)

Finding 2: Each LLM has a citation personality

Top 10 most-cited domains per LLM. The lists are almost disjoint:

ChatGPT — tier-1 publications + recognizable SaaS blogs:

techradar.com           12
frase.io                11
visible.seranking.com    9
llmclicks.ai             7   ← competitor
riffanalytics.ai         6
visiblie.com             6
otterly.ai               5   ← competitor
sitepoint.com            4
getpassionfruit.com      4
llmrefs.com              4

Gemini — SEO agencies and indie creators:

siftly.ai               18   ← competitor
nightwatch.io           15   ← agency blog
kime.ai                 10
digitalapplied.com       8
visible.seranking.com    6
frictionai.co            6
nicklafferty.com         5   ← personal blog
genixly.io               5
reddit.com               4
ziptie.dev               4

Grok — universal sources + topic-blind weirdness:

cookiepedia.co.uk        9   ← cookie compliance???
onetrust.com             9   ← cookie compliance???
visible.seranking.com    8
reddit.com               6
ziptie.dev               4
digitalapplied.com       3
therankmasters.com       3
siftly.ai                3
amplitude.com            3
evertune.ai              3

The cookiepedia + onetrust spike is the strangest single observation in the dataset. For prompts about "best AI brand monitoring tool," both are completely off-topic. My read: Grok's web search appears to grab pages it visited during the search session — including cookie consent banners — and treat them as content sources. Grok's grounding has the lowest signal-to-noise ratio of the four.

Perplexity — competitor product pages, directly:

visible.seranking.com   12
reddit.com               7
llmclicks.ai             7   ← competitor
nicklafferty.com         7
amplitude.com            6
evertune.ai              6   ← competitor
superlines.io            6   ← competitor
brainz.digital           6
therankmasters.com       5
aiclicks.io              5   ← competitor

Perplexity disproportionately cites the homepages of competitors directly. Useful as competitive intel: Perplexity cites whoever has discoverable, well-structured product pages.

Top-10 overlap across all four models: exactly 1 domain. Each LLM is, for top-10 territory, a different internet.

Finding 3: 10 domains every LLM cites

Drop the "top 10" filter and look at the long tail. Exactly 10 domains were cited by all four LLMs in the sample:

reddit.com                forum
visible.seranking.com     SEO suite
amplitude.com             analytics platform
conductor.com             enterprise SEO
zapier.com                automation
siftly.ai                 direct competitor
scrunch.com               direct competitor
aiclicks.io               direct competitor
bluefishai.com            AI niche
ziptie.dev                dev blog

A single content placement on any one of these reaches all 4 LLMs simultaneously. If you're bandwidth-constrained, this is the list to attack first.

Reddit alone produced 21 distinct citation events across the 80-attempt sample — roughly 25% of all unique citations come from one domain. Most categories I've looked at have Reddit at the top of the universal-N list.

Finding 4: Recommendation ≠ citation

This is the finding that changed how I think about GEO measurement.

Two distinct events happen in any LLM answer:

Mention — the brand name appears in the recommendation text
Citation — the brand's domain appears in the cited sources

These are independent. I measured the conditional probabilities for 25+ named brands in the sample.

# Simplified version of the calculation
P_cited_given_mentioned = sum(brand_cited & brand_mentioned) / sum(brand_mentioned)
P_mentioned_given_cited = sum(brand_cited & brand_mentioned) / sum(brand_cited)

Results:

Model           P(cited|mentioned)   P(mentioned|cited)
chatgpt-web     47%                  75%
perplexity-web  40%                  45%
grok-web        37%                  82%
gemini-web      33%                  63%

Across all four models, only 33-47% of recommended brands are also cited. The other half were pulled from training data alone — the LLM remembered the brand without grounding the recommendation in any current web source.

The most striking example: a brand named Peec was mentioned 8 / 7 / 9 / 9 times by the four models across the same 5 prompts. Citations of peec.com? Zero, across all four LLMs.

Peec wasn't outreaching. Peec wasn't getting backlinks. The LLMs just know Peec exists in this category.

This implies a thing most GEO tools and playbooks gloss over: durable presence in LLM training data is doing real work that fresh outreach can't replicate. Wikipedia entries, Crunchbase profiles, Hacker News threads from 2023-2024, conference talk recordings — these don't show up as citations but they materially shape recommendations.

If you're allocating 100% of your GEO budget to fresh citation outreach, you're missing roughly half of the recommendation surface.

Finding 5: Perplexity cites broadly, recommends narrowly

The inverse direction (P(mentioned | cited) above) splits the four into two camps:

Grok (82%) and ChatGPT (75%): citing ≈ endorsing. If they cite you, they almost always recommend you.
Perplexity (45%): cites broadly to support specific sentences, but only recommends a narrow subset of those sources as brand endorsements.

A single Perplexity citation is a weaker brand signal than a single ChatGPT citation. To move Perplexity's recommendation needle, you need citations plus explicit "best in category" positioning in the cited content.

What I'd actually do with this data

If I were optimizing for LLM visibility in this category, here's the prioritized plan:

Reddit engagement (highest leverage, low cost) — find the 5 most-cited threads in your niche, comment substantively. Don't brigade or pitch your product. Authentic engagement compounds across all 4 LLMs.
Get on the universal-10 — pitch a guest article to amplitude.com/blog, conductor.com/learn, or zapier.com/blog. Each placement is a 1-of-10 reach across all 4 LLMs.
Build training-data presence — Wikipedia entry (3-6 months), Crunchbase profile (2 hours), Hacker News post (instant), conference talk recording. Aim for the Peec pattern.
Optimize your own product page — schema markup, comparison-friendly headings, category-clear meta. This matters most for Perplexity but helps all four.
Stop chasing single-LLM gains — pick your two priority LLMs, plan to their personality, accept that the other two need different content.

Limitations

To stay honest:

One category, one project. The universal-N list and per-LLM personalities probably differ across verticals.
Small Grok sample (n=9 vs 20 for others). The cookiepedia observation is suggestive, not definitive.
Scraper-dependent. Each LLM's DOM changes, so citation extraction is a moving target.
No domain-authority weighting. A 4-of-4 cite from a small blog is genuinely lower-leverage than from a tier-1 publication.

What's next

I'm planning to publish similar analyses on different verticals as I collect them. Want this kind of breakdown on your category? Email me at yibo@aiattention.ai with your category and competitors — I'll run 5 prompts × 4 LLMs × 4 runs on your vertical and send back the universal sources, citation gaps vs your competitors, and per-LLM tactics. No charge.

If you found this useful, follow me on dev.to or check out aiattention.ai. The product is in MVP — feedback welcome.

— Yibo

DEV Community