Make your site citable by AI: a technical GEO checklist (with code)

#ai #llm #tutorial #webdev

For 25 years we optimized for one thing: ranking in Google's ten blue links. That game is changing fast. A growing share of search now happens inside an AI answer — ChatGPT, Perplexity, Claude, Google's AI Overviews — where the user never sees a results page. They see a synthesized answer, sometimes with a handful of cited sources.

If your site isn't in that handful, you're invisible. Optimizing for it has a name now: Generative Engine Optimization (GEO).

The good news for developers: a lot of GEO is just good engineering. Below is a practical checklist with the code that matters. No marketing fluff.

1. Let the AI crawlers in (robots.txt)

AI engines use their own crawlers, separate from Googlebot. If you block them — or never explicitly allow them — you can't be cited. Many sites accidentally block these because a security plugin or a copy-pasted robots.txt denies unknown user agents.

Here are the ones worth knowing:

# OpenAI (ChatGPT)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

# Anthropic (Claude)
User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

# Perplexity
User-agent: PerplexityBot
Allow: /

# Google (Gemini / AI Overviews extended crawling)
User-agent: Google-Extended
Allow: /

# Apple Intelligence
User-agent: Applebot-Extended
Allow: /

# Common Crawl (training data for many models)
User-agent: CCBot
Allow: /

Two things to decide consciously:

GPTBot / Google-Extended / Applebot-Extended are partly about training models on your content. Some publishers block these for IP reasons. That's a business decision, not a default.
OAI-SearchBot, PerplexityBot, ChatGPT-User are about live retrieval and citation. If you want to be cited in answers, you almost certainly want these allowed.

Don't guess. Check what you're actually serving:

curl -s https://yourdomain.com/robots.txt | grep -iE "gptbot|perplexity|claude|google-extended"

2. Give models a structured map (schema.org JSON-LD)

LLMs parse messy HTML, but structured data removes ambiguity about what a page is. For an article, the minimum useful block:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Make your site citable by AI",
  "author": { "@type": "Person", "name": "Your Name" },
  "datePublished": "2026-06-03",
  "dateModified": "2026-06-03",
  "publisher": {
    "@type": "Organization",
    "name": "Your Company"
  }
}
</script>

If your page answers questions, FAQPage schema is one of the highest-leverage formats — it maps cleanly to the question-and-answer shape that LLMs return:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [{
    "@type": "Question",
    "name": "What is Generative Engine Optimization?",
    "acceptedAnswer": {
      "@type": "Answer",
      "text": "GEO is the practice of structuring content so AI engines can find, understand, and cite it in generated answers."
    }
  }]
}
</script>

Validate it before you ship — invalid JSON-LD is silently ignored:

npx structured-data-testing-tool --url https://yourdomain.com/page

3. Try llms.txt (emerging, low cost)

llms.txt is a proposed convention: a single Markdown file at your root that gives models a clean, link-rich map of your most important content — without the nav, ads, and scripts they'd otherwise wade through.

# Your Product

> One-sentence description of what you do.

## Docs
- [Getting started](https://yourdomain.com/docs/start): how to set up
- [API reference](https://yourdomain.com/docs/api): endpoints and auth

## Key articles
- [What is GEO](https://yourdomain.com/learn/geo): definitions and examples

Save it as /llms.txt. It's not yet a hard ranking signal and not every engine consumes it, but it's cheap to add and aligned with where things are heading. Treat it as an investment, not a guarantee.

4. Write for extraction, not just for humans

This is where most "SEO content" fails for GEO. Models reward content they can lift a clean, self-contained answer from. Concretely:

Front-load the answer. State the conclusion in the first sentence under each heading, then explain. Don't bury it after three paragraphs of throat-clearing.
Use real headings as questions. ## How do AI crawlers find my site? beats ## Discoverability.
Keep facts dense and specific. "PerplexityBot respects robots.txt" is citable. "Our solution leverages cutting-edge synergy" is not.
Use semantic HTML. <article>, <section>, <table> — not a soup of <div>s. Tables in particular get extracted well.
Cite your own sources. Pages that link to primary data are themselves treated as more trustworthy.

A quick self-check: copy a section of your page and ask an LLM "answer X using only this text." If it can't produce a clean answer, neither can ChatGPT when a user asks.

5. Measure it

You can't improve what you don't measure. Track two things:

Crawl access — are the AI bots actually hitting your pages? Grep your server logs:

   grep -iE "gptbot|perplexitybot|claudebot|oai-searchbot" access.log | wc -l

Citations — are you showing up in answers? Run a fixed set of prompts your customers would ask across ChatGPT, Perplexity, Claude and Gemini on a schedule, and log whether your domain appears.

If you'd rather not build the scoring harness yourself, GEO-Score.online runs an automated audit of a page across 22 GEO metrics (crawler access, structured data, answer-completeness, factual density and more) and tells you exactly what's blocking citation. There's also a free robots.txt generator for AI crawlers if you just want step 1 sorted quickly.

TL;DR checklist

[ ] robots.txt explicitly allows the retrieval crawlers you care about
[ ] Article + FAQPage JSON-LD on key pages, validated
[ ] /llms.txt published
[ ] Answers front-loaded, headings phrased as questions, semantic HTML
[ ] Crawl access + citations measured on a schedule

SEO isn't dead — but it's no longer the whole game. The sites that get cited by AI in 2026 are the ones that made themselves easy to cite. Most of that is engineering you already know how to do.

What's worked for you? Have you seen AI crawlers in your logs yet? Curious what others are finding.