DEV Community

Chudi Nnorukam
Chudi Nnorukam

Posted on • Originally published at chudi.dev

What Actually Predicts Whether AI Cites Your Website (Data from 7 Site Audits)

Originally published at chudi.dev


Domain authority does not predict whether AI will cite your website. I audited 7 websites for AI citability, and the results challenge nearly everything the SEO industry assumes about AI search visibility.

Ahrefs (DA 92) was cited by AI only 5% of the time despite 100% visibility. A brand-new site with DA under 10 achieved a 15% citation rate. Sites with millions of daily visitors failed basic infrastructure checks. The factors that actually predicted citations had nothing to do with backlinks or traffic.

Here is what the data showed.

TL;DR

AI citability is whether AI answer engines include your URL as a source, not just mention your brand.

  • Domain authority has zero correlation with AI citation rates
  • Ahrefs (DA 92) is 100% AI-visible but only 5% cited
  • citability.dev (DA under 10) achieved 15% citation rate, outperforming DA 90+ sites
  • Reddit, Medium, and X all failed basic AI infrastructure checks
  • The three strongest predictors: answer-first content, dateModified schema, original data
  • Only 12% of URLs cited by LLMs appear in Google's top 10 results

The Audit: 7 Sites, 3 AI Platforms, 10 Infrastructure Checks

I used the AI Visibility Readiness (AVR) framework to run infrastructure audits on 7 websites. Each site was checked for 10 signals that AI crawlers use to discover and parse content: robots.txt, sitemap.xml, answer-first content, content freshness, structured data (JSON-LD), meta descriptions, canonical URLs, HTTPS, heading hierarchy, and social sharing readiness.

Then I queried ChatGPT, Perplexity, and Claude with questions each site should be able to answer. I tracked two metrics:

  • AI Visibility: Does the AI mention the brand when asked?
  • AI Citability: Does the AI include a URL from the site as a cited source?

The Results

Site Domain Authority AI Infrastructure AI Visibility AI Citability
ahrefs.com 92 Foundation-ready 100% 5%
semrush.com 91 Foundation-ready Partial Partial
chudi.dev 28 Foundation-strong 25% 0%
citability.dev Under 10 Foundation-strong 44% 15%
reddit.com 97 Not ready Untested Untested
medium.com 95 Not ready Untested Untested
x.com 96 Not ready Untested Untested

The three highest-DA sites (Reddit 97, X 96, Medium 95) all failed basic infrastructure readiness. They are missing structured data, answer-first content, or proper AI crawler permissions. These sites get cited constantly by AI, but not because of their infrastructure. They get cited because AI training data includes their content at massive scale.

The most striking result: citability.dev, a site with DA under 10 and fewer than 100 backlinks, achieved a 15% citation rate. That is 3x higher than Ahrefs (DA 92). The difference is not authority. The difference is original benchmark data and answer-first content structure.

For everyone else, infrastructure is the gate.

Does High Domain Authority Mean AI Will Cite You?

No. The data is clear: DA has zero predictive power for AI citations.

Ahrefs has a DA of 92, one of the highest in the SEO industry. Every AI platform recognizes the brand instantly. Ask ChatGPT "what is Ahrefs?" and you get a detailed, accurate answer. That is 100% AI visibility.

But ask ChatGPT "what tools should I use for keyword research?" and Ahrefs gets mentioned but rarely linked. The AI knows the brand exists. It does not need to cite the source. That is the visibility-citation gap, and it exists because AI systems already have the information internalized from training data.

Citation happens when AI needs your content as a source for a specific claim. That requires your content to be structured in a way the AI can extract and attribute.

What Infrastructure Do AI Crawlers Actually Need?

The 10-check audit revealed a clear pattern. Sites that passed 8+ infrastructure checks had measurably higher visibility scores. Sites that failed basic checks were invisible regardless of their authority.

The Baseline Signals

robots.txt and sitemap.xml are table stakes. Every site in the audit had these, but the content of each matters. Reddit's robots.txt blocks several AI crawlers. Medium's sitemap is auto-generated but does not include all content pages. Simply having the files is not enough.

HTTPS and canonical URLs are similarly baseline. Every audited site passed these. They are necessary but not differentiating.

The Differentiating Signals

Three signals separated the visible sites from the invisible ones:

Answer-first content. Pages that led with a direct answer in the first 100 words scored dramatically higher on AI extractability. This matches research showing AI systems extract the first clear, unqualified statement they find on a page. Generic marketing copy, hero images, and navigation-heavy layouts all push the answer down, making it harder for AI to extract.

Structured data (JSON-LD). Sites with Article, FAQPage, and HowTo schema gave AI systems explicit context about content purpose and structure. The chudi.dev audit showed 9 schema types across pages, including TechArticle with dateModified, FAQPage with 5+ questions per article, and Person schema with expertise signals. This machine-readable layer is what lets AI systems understand your content without parsing ambiguous HTML.

Content freshness. Pages with dateModified in their schema received 1.8x more AI citations than pages without, according to Semrush research. This aligns with another finding: 95% of ChatGPT citations come from recently published or updated content. Stale content without date signals gets deprioritized.

Which Sites Get Cited vs Just Mentioned?

The gap between being mentioned and being cited is the central problem in AI visibility.

Platform-Specific Citation Behavior

Each AI platform has different citation preferences:

  • Perplexity cites approximately 6.6 sources per answer and heavily indexes Reddit (46.7% of its top cited sources)
  • ChatGPT cites only about 2.6 sources per answer and shows strong Wikipedia preference (7.8% of all citations)
  • Google Gemini cites about 6.1 sources per answer with 76% overlap with Google's traditional top 10

This means the optimization strategy differs by platform. Perplexity rewards breadth of presence across forums and communities. ChatGPT rewards being on established reference sources. Google AI Overviews still correlates heavily with traditional SEO rankings.

The 12% Divergence

Only 12% of URLs cited by LLMs appear in Google's top 10 search results for the same queries. This is the statistic that should reframe how you think about AI search: ranking on Google and getting cited by AI are largely separate problems.

The exceptions are Google AI Overviews, which show 76% overlap with traditional rankings. But ChatGPT and Perplexity operate on fundamentally different source selection algorithms.

The Three Factors That Actually Predict AI Citations

Based on the audit data and corroborating research, three factors had the strongest predictive power:

1. Answer-First Content Structure

Pages where the direct answer appears in the first 100 words get extracted more often. This means:

  • Lead with the answer, not the question
  • Keep opening paragraphs to 25-40 words
  • Use clear, factual statements without qualifying language
  • Structure H2 headings as questions the reader would ask AI

The qualifying language point is critical. Phrases like "it depends," "in many cases," or "it can be argued" signal uncertainty. AI systems prefer definitive statements they can extract as answers.

2. dateModified Schema with Substantive Updates

The 1.8x citation lift from dateModified schema is real, but only when paired with actual content updates. Google penalizes fake freshness signals, meaning you cannot just bump the date without changing anything. The safe approach:

  • Update content quarterly with new data and statistics
  • Add at least 100 words of substantive new content per refresh
  • Reference current-year sources and data points
  • Only update dateModified when the refresh is genuine

3. Inline Statistics and Original Data

Pages with inline statistics get 40%+ more AI citations. This makes sense: AI systems need claims they can attribute, and specific numbers are the easiest claims to attribute to a source.

Original data is even more powerful. If your page contains data that does not exist elsewhere, AI has no choice but to cite you when referencing it. This is why I publish audit results and benchmark data publicly. The comparison table at the top of this article is data that exists nowhere else.

What This Means for Your Site

The path from invisible to cited is not about building more backlinks or increasing your DA. It is about making your content technically extractable by AI systems.

The checklist is short:

  1. Check your infrastructure. Run a free scan to verify the 10 baseline signals.
  2. Restructure your content. Lead with answers. Use question-based headings. Add FAQ and HowTo schema.
  3. Publish original data. Give AI systems something they can only get from you.
  4. Keep content fresh. Update quarterly with substantive changes and current statistics.
  5. Test across platforms. Query ChatGPT, Perplexity, and Claude with questions your site should answer. Track citation rates over time.

The sites that get cited in 2026 will not be the ones with the highest DA. They will be the ones whose content is structured so AI systems can extract, trust, and attribute it.

Sources

Top comments (0)