DEV Community

Yibo Hu
Yibo Hu

Posted on • Originally published at Medium

I scraped 700 LLM citations to figure out how AI actually ranks websites

I've been building a tool that monitors how brands show up in ChatGPT, Gemini, Grok, and Perplexity answers. Most "GEO playbooks" I read assume the four LLMs behave roughly the same — same internet, same authoritative sources, optimize once, ship everywhere.

Wanted to test that assumption with actual data. So I sampled the same 5 prompts across all four models, four times each over 18 hours, and pulled every cited URL.

700 citations later, the four models barely overlap. The intersection of all four top-10 lists is exactly one domain (visible.seranking.com). For top-10 territory — the most important sources — each LLM is reading a fundamentally different internet.

Here's what I found, the methodology, and what it means for anyone trying to optimize for LLM visibility.


Methodology

The setup is intentionally simple so it's reproducible:

Stack:

  • Headless Chromium via patchright (Playwright fork with stealth patches) running on EC2 spot instances
  • Each (model × prompt × run) opens a fresh ephemeral userDataDir, types the prompt into the public web UI, captures the response text and citations
  • Lambda fans out per-prompt SQS messages with rate-limit aware routing across 2 IPs
  • Raw responses written to S3, parsed citations stored in Postgres
  • Aggregations run via a one-off Python script over 4 cron windows (every 6h)

Sample:

  • 1 category: AI brand visibility / GEO tooling
  • 5 prompts varying the category question (e.g., "Which platforms help me track my brand mentions in ChatGPT, Gemini, and Perplexity?")
  • 4 LLMs: chatgpt-web, gemini-web, grok-web, perplexity-web
  • 4 cron runs over 18 hours
  • Result: 69 successful (model × prompt × run) records, 700 cited URLs

Important: the four LLM endpoints are the public web interfaces (chatgpt.com, gemini.google.com, etc.) — not the API. The web UIs use grounding by default; APIs typically don't. Real users interact with the web UI, so that's what I measure.


Finding 1: Gemini's citations are nearly random run-to-run

Jaccard similarity of cited-domain sets across paired runs of the same prompt:

grok-web        0.49   ← most stable (small sample n=4)
chatgpt-web     0.32   ← ~1/3 overlap between runs
perplexity-web  0.32   ← same
gemini-web      0.10   ← ~90% of cited domains change between runs
Enter fullscreen mode Exit fullscreen mode

Jaccard 0.10 means the cited set you observe today has only 10% overlap with the set you'd observe re-running the same prompt tomorrow.

That's not "noisy." That's nearly random.

If you're building tools or dashboards that surface "Gemini cited X" as a single point estimate, you're misleading users. The right unit is cited in N of M samples. For a stable Gemini reading on any prompt, sample 3-4 times.

Possible causes (none confirmed from external sampling):

  • A/B testing of grounding sources to mitigate over-reliance on a single source
  • Heavy freshness weighting → source pool changes hourly
  • Multi-source diverse sampling that rotates which category gets emphasized
  • User-segment-specific routing (different model variants by traffic segment)

Finding 2: Each LLM has a citation personality

Top 10 most-cited domains per LLM. The lists are almost disjoint:

ChatGPT — tier-1 publications + recognizable SaaS blogs:

techradar.com           12
frase.io                11
visible.seranking.com    9
llmclicks.ai             7    competitor
riffanalytics.ai         6
visiblie.com             6
otterly.ai               5    competitor
sitepoint.com            4
getpassionfruit.com      4
llmrefs.com              4
Enter fullscreen mode Exit fullscreen mode

Gemini — SEO agencies and indie creators:

siftly.ai               18    competitor
nightwatch.io           15    agency blog
kime.ai                 10
digitalapplied.com       8
visible.seranking.com    6
frictionai.co            6
nicklafferty.com         5    personal blog
genixly.io               5
reddit.com               4
ziptie.dev               4
Enter fullscreen mode Exit fullscreen mode

Grok — universal sources + topic-blind weirdness:

cookiepedia.co.uk        9    cookie compliance???
onetrust.com             9    cookie compliance???
visible.seranking.com    8
reddit.com               6
ziptie.dev               4
digitalapplied.com       3
therankmasters.com       3
siftly.ai                3
amplitude.com            3
evertune.ai              3
Enter fullscreen mode Exit fullscreen mode

The cookiepedia + onetrust spike is the strangest single observation in the dataset. For prompts about "best AI brand monitoring tool," both are completely off-topic. My read: Grok's web search appears to grab pages it visited during the search session — including cookie consent banners — and treat them as content sources. Grok's grounding has the lowest signal-to-noise ratio of the four.

Perplexity — competitor product pages, directly:

visible.seranking.com   12
reddit.com               7
llmclicks.ai             7    competitor
nicklafferty.com         7
amplitude.com            6
evertune.ai              6    competitor
superlines.io            6    competitor
brainz.digital           6
therankmasters.com       5
aiclicks.io              5    competitor
Enter fullscreen mode Exit fullscreen mode

Perplexity disproportionately cites the homepages of competitors directly. Useful as competitive intel: Perplexity cites whoever has discoverable, well-structured product pages.

Top-10 overlap across all four models: exactly 1 domain. Each LLM is, for top-10 territory, a different internet.


Finding 3: 10 domains every LLM cites

Drop the "top 10" filter and look at the long tail. Exactly 10 domains were cited by all four LLMs in the sample:

reddit.com                forum
visible.seranking.com     SEO suite
amplitude.com             analytics platform
conductor.com             enterprise SEO
zapier.com                automation
siftly.ai                 direct competitor
scrunch.com               direct competitor
aiclicks.io               direct competitor
bluefishai.com            AI niche
ziptie.dev                dev blog
Enter fullscreen mode Exit fullscreen mode

A single content placement on any one of these reaches all 4 LLMs simultaneously. If you're bandwidth-constrained, this is the list to attack first.

Reddit alone produced 21 distinct citation events across the 80-attempt sample — roughly 25% of all unique citations come from one domain. Most categories I've looked at have Reddit at the top of the universal-N list.


Finding 4: Recommendation ≠ citation

This is the finding that changed how I think about GEO measurement.

Two distinct events happen in any LLM answer:

  • Mention — the brand name appears in the recommendation text
  • Citation — the brand's domain appears in the cited sources

These are independent. I measured the conditional probabilities for 25+ named brands in the sample.

# Simplified version of the calculation
P_cited_given_mentioned = sum(brand_cited & brand_mentioned) / sum(brand_mentioned)
P_mentioned_given_cited = sum(brand_cited & brand_mentioned) / sum(brand_cited)
Enter fullscreen mode Exit fullscreen mode

Results:

Model           P(cited|mentioned)   P(mentioned|cited)
chatgpt-web     47%                  75%
perplexity-web  40%                  45%
grok-web        37%                  82%
gemini-web      33%                  63%
Enter fullscreen mode Exit fullscreen mode

Across all four models, only 33-47% of recommended brands are also cited. The other half were pulled from training data alone — the LLM remembered the brand without grounding the recommendation in any current web source.

The most striking example: a brand named Peec was mentioned 8 / 7 / 9 / 9 times by the four models across the same 5 prompts. Citations of peec.com? Zero, across all four LLMs.

Peec wasn't outreaching. Peec wasn't getting backlinks. The LLMs just know Peec exists in this category.

This implies a thing most GEO tools and playbooks gloss over: durable presence in LLM training data is doing real work that fresh outreach can't replicate. Wikipedia entries, Crunchbase profiles, Hacker News threads from 2023-2024, conference talk recordings — these don't show up as citations but they materially shape recommendations.

If you're allocating 100% of your GEO budget to fresh citation outreach, you're missing roughly half of the recommendation surface.


Finding 5: Perplexity cites broadly, recommends narrowly

The inverse direction (P(mentioned | cited) above) splits the four into two camps:

  • Grok (82%) and ChatGPT (75%): citing ≈ endorsing. If they cite you, they almost always recommend you.
  • Perplexity (45%): cites broadly to support specific sentences, but only recommends a narrow subset of those sources as brand endorsements.

A single Perplexity citation is a weaker brand signal than a single ChatGPT citation. To move Perplexity's recommendation needle, you need citations plus explicit "best in category" positioning in the cited content.


What I'd actually do with this data

If I were optimizing for LLM visibility in this category, here's the prioritized plan:

  1. Reddit engagement (highest leverage, low cost) — find the 5 most-cited threads in your niche, comment substantively. Don't brigade or pitch your product. Authentic engagement compounds across all 4 LLMs.
  2. Get on the universal-10 — pitch a guest article to amplitude.com/blog, conductor.com/learn, or zapier.com/blog. Each placement is a 1-of-10 reach across all 4 LLMs.
  3. Build training-data presence — Wikipedia entry (3-6 months), Crunchbase profile (2 hours), Hacker News post (instant), conference talk recording. Aim for the Peec pattern.
  4. Optimize your own product page — schema markup, comparison-friendly headings, category-clear meta. This matters most for Perplexity but helps all four.
  5. Stop chasing single-LLM gains — pick your two priority LLMs, plan to their personality, accept that the other two need different content.

Limitations

To stay honest:

  • One category, one project. The universal-N list and per-LLM personalities probably differ across verticals.
  • Small Grok sample (n=9 vs 20 for others). The cookiepedia observation is suggestive, not definitive.
  • Scraper-dependent. Each LLM's DOM changes, so citation extraction is a moving target.
  • No domain-authority weighting. A 4-of-4 cite from a small blog is genuinely lower-leverage than from a tier-1 publication.

What's next

I'm planning to publish similar analyses on different verticals as I collect them. Want this kind of breakdown on your category? Email me at yibo@aiattention.ai with your category and competitors — I'll run 5 prompts × 4 LLMs × 4 runs on your vertical and send back the universal sources, citation gaps vs your competitors, and per-LLM tactics. No charge.

If you found this useful, follow me on dev.to or check out aiattention.ai. The product is in MVP — feedback welcome.

— Yibo

Top comments (0)