DEV Community

Cover image for How to Structure Content So AI Models Actually Cite It (Based on a 602-Prompt Study)
Shinobis IA
Shinobis IA

Posted on • Originally published at shinobis.com

How to Structure Content So AI Models Actually Cite It (Based on a 602-Prompt Study)

A research study analyzed 602 prompts across ChatGPT, Gemini and Perplexity, tracking 21,000 citations to find out what makes an AI model cite one source over another.

The answer isn't backlinks. It isn't domain authority. It's content structure.

I took those findings and applied every one of them to my blog. Here's the data, what I changed, and the implementation I use.

The five strategies ranked by citation influence

Not all content strategies perform equally when AI models decide what to cite. These are the results from the study, sorted by measured influence:

  1. Numerical data: +61.55% citation influence
  2. Clear definitions at the start of sections: +57.33%
  3. Structured comparisons: +55.28%
  4. How-to steps: +41.20%
  5. Q&A format: -5.74%

That last one is important. Q&A format, the strategy every SEO guide recommends for featured snippets, has negative influence on AI citations. LLMs don't want questions and answers. They want direct statements they can extract without needing the question for context.

What this looks like in code

Every post on my blog now follows a specific HTML structure based on these findings. No framework. No CMS. Just PHP generating semantic HTML.

Definitions first. Every H2 section opens with a direct definition or statement. Not a question. Not a transition.

<h2>Content negotiation for AI agents</h2>
<p>Content negotiation is the process where a server returns 
different response formats based on what the client requests 
in its Accept header. When an AI agent sends 
Accept: text/markdown, the server can return clean Markdown 
instead of full HTML.</p>
Enter fullscreen mode Exit fullscreen mode

Not this:

<h2>What is content negotiation?</h2>
<p>Have you ever wondered how servers know what format to 
send back? Let's explore this fascinating topic...</p>
Enter fullscreen mode Exit fullscreen mode

The first version is citable. The second requires the question for context and has zero information density in the first sentence.

Numerical data inline, not in tables. AI models extract statements, not table cells.

<p>The Cloudflare Agent Readiness test checks 10 standards. 
My blog scored 50 out of 100. Four of those standards apply 
to blogs. I implemented all four in under an hour.</p>
Enter fullscreen mode Exit fullscreen mode

This gives the model a complete, self-contained statement with specific numbers it can cite directly.

Structured comparisons with explicit criteria.

<h2>Google SEO vs GEO: what actually changed</h2>
<p>Google's AI search uses the same ranking systems as 
traditional search. RAG and query fan-out pull from the 
existing index. For Google, SEO and GEO are identical. 
But ChatGPT, Claude and Perplexity do not use Google's 
index. They run their own crawlers with different citation 
criteria. Optimizing for Google does not guarantee citations 
from other AI models.</p>
Enter fullscreen mode Exit fullscreen mode

The comparison is explicit, names the parties, states the difference, and gives a conclusion. A model can extract any sentence and it stands on its own.

The JSON-LD layer

Beyond HTML structure, I generate a JSON-LD Knowledge Graph on every page. This is the @graph structure that tells AI models what the content is about before they read a single paragraph.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "BlogPosting",
      "headline": "602 Prompts, 21,000 Citations",
      "abstract": "Analysis of what makes AI models cite content...",
      "about": [
        { "@type": "Thing", "name": "Generative Engine Optimization" },
        { "@type": "Thing", "name": "AI citation analysis" }
      ],
      "citation": [
        { "@type": "CreativeWork", "name": "Source study name" }
      ]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The abstract field is what LLMs read first. The about array tells them the topic without parsing the full article. The citation array signals that the content references real sources.

Google says don't obsess over structured data. For Google's system, that's fair. But the study shows semantic alignment is the strongest predictor of citation (r=0.43). JSON-LD contributes directly to that alignment.

The number that changed my perspective

One ChatGPT citation is worth 4.6x more than a Google click.

A single mention in a ChatGPT response drives more engaged traffic than ranking on page one for the same query. The visitors stay longer, read more pages, and convert at higher rates. They arrive with context because the AI already explained why your content matters.

Optimal content length

The study data shows a clear range: 1,000 to 3,000 words. Shorter and there isn't enough substance for a model to cite. Longer and the model has to work harder to extract the relevant section.

This doesn't mean padding content. It means covering a topic with enough depth that an AI model can find a specific, citable statement for a specific query.

What I run this on

Vanilla PHP 8.2, MariaDB, Apache on shared hosting with cPanel. No Laravel. No WordPress. No Node. Every optimization described here runs on the most boring stack imaginable.

The full analysis with the original study data and my implementation results is on the original post.

Full breakdown → https://shinobis.com/en/602-prompts-21000-citations-how-ai-models-choose-what-to-cite

What content structure are you using on your site? Have you noticed any AI citation patterns in your analytics?

Top comments (0)