DEV Community

Cover image for How to Make Your Website Easier for ChatGPT and Perplexity to Cite
Nikhil Goyal
Nikhil Goyal

Posted on

How to Make Your Website Easier for ChatGPT and Perplexity to Cite

I've spent the last year building an AI visibility tool, and in the process I've had to figure out what actually makes AI search engines cite one website over another.

Most of what I assumed turned out to be wrong. Here's what I've learned.


The basics: AI search is a different game

If you're already ranking well on Google, you might assume AI engines will find you too. The data says otherwise.

Ahrefs found that around 80% of URLs cited by ChatGPT, Perplexity, Copilot, and Google AI Mode don't rank in Google's top 100 for the original query. These engines aren't just repackaging Google results — they're making independent decisions about what to cite.

What seems to matter most:

  • Content structure and extractability — can the AI pull a clean answer from your page?
  • FreshnessBrightEdge research shows pages updated within 60 days are 1.9x more likely to appear in AI answers
  • Structured data — sites implementing schema markup and FAQ blocks saw a 44% increase in AI citations
  • Depth over keyword density — comprehensive coverage of a topic beats keyword stuffing

Traditional SEO signals like backlink count and domain rating still matter, but they're not sufficient on their own.


What I found in my server logs

When I started checking server logs for AI crawler activity, a few things stood out.

First, check whether you're even allowing AI crawlers in:

curl https://yoursite.com/robots.txt | grep -i "gptbot\|claudebot\|perplexitybot\|anthropic\|chatgpt"
Enter fullscreen mode Exit fullscreen mode

A surprising number of sites — including ones that want AI visibility — have blanket blocks on these bots. Sometimes it's an overzealous security plugin, sometimes it's a robots.txt that hasn't been revisited since 2023.

Second, AI crawlers behave differently from Googlebot. They tend to hit fewer pages but spend more time parsing each one. They care a lot about whether the content is directly accessible in the HTML versus buried behind client-side JavaScript rendering.

If your site is a heavy SPA with most content rendered client-side, AI crawlers may be seeing an empty shell.


The content structure that gets cited

After analyzing which pages on client sites get cited versus which get ignored, a clear pattern emerged. It comes down to answer-first content architecture.

Here's what I mean:

<!-- This gets ignored by AI engines -->
<p>When it comes to understanding the complexities of emergency 
plumbing services, there are many factors that homeowners should 
consider before making a decision about which provider to call...</p>

<!-- This gets cited -->
<h2>How much does emergency plumbing cost?</h2>
<p>Emergency plumbing typically costs $150–$500 for common issues 
like burst pipes or severe leaks. After-hours calls usually add 
a $75–$150 surcharge.</p>
Enter fullscreen mode Exit fullscreen mode

This isn't surprising when you look at the data: a Wix study found that 44.2% of all LLM citations come from the first 30% of a page's text. If your answer is in paragraph 7, the AI has already moved on.

The pattern that works best is what I've been calling "answer packs" — structured content blocks with a specific format:

## [Question in natural language]

[Direct answer in 2-4 sentences]

**Key details:**
- Specific fact or data point
- Another relevant detail
- Context that helps the reader decide

*Last updated: March 2026*
Enter fullscreen mode Exit fullscreen mode

The "last updated" line matters. AI engines have a measurable recency bias — one study found that artificially refreshing publication dates alone can shift AI ranking positions by up to 95 places.


Schema markup: the low-hanging fruit most devs skip

If you're a developer reading this, structured data is probably the highest-ROI thing you can implement today. Websites with author schema are 3x more likely to appear in AI answers.

Here's a minimal FAQPage implementation:

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [{
    "@type": "Question",
    "name": "What should I do if a pipe bursts?",
    "acceptedAnswer": {
      "@type": "Answer",
      "text": "Immediately shut off the main water valve, then call an emergency plumber. While waiting, open faucets to drain remaining water and move valuables away from the affected area."
    }
  }]
}
Enter fullscreen mode Exit fullscreen mode

And if you're running a service business or local operation, LocalBusiness schema is equally important:

{
  "@context": "https://schema.org",
  "@type": "LocalBusiness",
  "name": "Your Business Name",
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "Austin",
    "addressRegion": "TX"
  },
  "telephone": "+1-512-555-0100",
  "priceRange": "$$",
  "openingHoursSpecification": {
    "@type": "OpeningHoursSpecification",
    "dayOfWeek": ["Monday","Tuesday","Wednesday","Thursday","Friday"],
    "opens": "08:00",
    "closes": "18:00"
  }
}
Enter fullscreen mode Exit fullscreen mode

The combination of semantic HTML (<article>, <section>, proper heading hierarchy) plus JSON-LD schema gives AI crawlers a machine-readable map of your content. Without it, they're guessing.


The HTML quality checklist

Here's the quick checklist I run on every page I want AI engines to cite:

Structure:

  • [ ] Uses semantic HTML5 elements (<article>, <section>, <main>)
  • [ ] Proper heading hierarchy (h1 → h2 → h3, no skipped levels)
  • [ ] Key content is in the HTML source, not only rendered via JS

Content:

  • [ ] Page leads with a direct answer to the primary query
  • [ ] Specific data points (prices, timelines, specs) are in plain text, not images
  • [ ] Content is updated within the last 60 days

Machine readability:

  • [ ] JSON-LD schema on the page (FAQPage, LocalBusiness, Product, HowTo — whatever fits)
  • [ ] Author information present (name, credentials, schema)
  • [ ] robots.txt allows GPTBot, ClaudeBot, PerplexityBot

Meta:

  • [ ] Clean, descriptive URLs (not /page?id=4827)
  • [ ] OpenGraph and meta description present
  • [ ] Sitemap includes the page and is submitted

None of this is exotic. It's mostly just good web development hygiene. But it's surprising how many production sites fail 3-4 of these checks.


Some counterintuitive findings

A few things that went against my assumptions:

Question-style headings underperform. I expected "How much does X cost?" headings to get cited more, but research from multiple sources shows straightforward headings actually get more citations than question-format ones (4.3 avg citations vs 3.4).

FAQ sections don't always help. Pages with dedicated FAQ sections showed slightly fewer citations than those without in one study — but this likely reflects that FAQs tend to appear on simpler support pages with less depth overall. The format works; it's the content quality that matters.

Different AI engines prefer different content types. ChatGPT disproportionately cites product and service pages directly. Perplexity leans toward listicles and comparison articles. Google AI Overviews pull from whatever it has already indexed highly. There's no single format that wins everywhere.

Brand mentions correlate more strongly with AI visibility than backlinks. Ahrefs data shows brand mention correlation at r = 0.664, higher than traditional link signals. Being talked about on Reddit, forums, and review sites seems to matter more for AI citation than having a strong backlink profile.


Measuring whether any of this works

The tricky part: an estimated 25-35% of AI-influenced traffic is misattributed in standard analytics setups.

What I do:

  1. Manual prompting — every week, I run 20-30 relevant queries through ChatGPT, Perplexity, and Google AI. I note whether my pages are cited, who else is cited, and what format the cited content uses. Low-tech, high-signal.

  2. GA4 referral sources — filter for chatgpt.com, perplexity.ai, gemini.google.com. The numbers will be small but growing fast.

  3. Server logs — grep for GPTBot, ClaudeBot, PerplexityBot user agents. Track which pages they're hitting and how often.

  4. Correlation tracking — watch for direct traffic spikes that line up with when your site starts appearing in AI answers. This catches the unattributed portion.

The manual prompting step sounds tedious, but it's by far the most useful. You'll learn more in 30 minutes of querying AI engines than from any dashboard.


The opportunity most people are missing

One data point that's been stuck in my head: according to an upGrowth report, technology and SaaS companies already see 18-25% of their traffic from AI referrals, but local service businesses sit at just 3-7%.

That gap is enormous. And it exists mostly because local businesses haven't structured their content for AI readability yet. The first mover advantage in local AI search is wide open.


What I'm still figuring out

I don't want to pretend this is all solved. Here's what's still genuinely hard:

  • Volatility is real. AI Overview content changes roughly 70% of the time for the same query. Only about 30% of brands remain visible in back-to-back AI responses. Consistency is hard to achieve.
  • Attribution is messy. Even with careful tracking, connecting AI citations to actual conversions requires a lot of inference.
  • Platform fragmentation. Optimizing for ChatGPT, Perplexity, and Google AI simultaneously sometimes requires conflicting approaches. There's no universal playbook yet.
  • The goalposts move. AI engines update their models and citation patterns regularly. What works today might not work in 3 months.

Wrapping up

We've been testing all of this while building PageX, and the biggest lesson so far is that AI systems reward clarity and structure far more than traditional SEO signals. Clean HTML, direct answers at the top of the page, fresh content, and proper schema — it's not glamorous, but it works.

The other lesson: this stuff compounds. Sites that start optimizing for AI readability now are building citation history that'll be hard for competitors to catch up on later. AI engines learn which sources consistently give reliable, extractable answers, and they keep going back.

Curious whether others here are seeing the same patterns in their logs, referrals, or citation data. If you've run the robots.txt check and found something surprising, or if you've noticed AI referral traffic showing up in your analytics, I'd be interested to hear what you're seeing.

Top comments (0)