Daniel Igel

Posted on Jun 24

How to audit your site for AI search readiness (step-by-step with CiteReady)

#ai #tooling #tutorial #webdev

Your site ranks well — but AI search engines like ChatGPT, Perplexity, and Google's AI Overviews still won't cite it. Classic SEO tools measure rankings; they don't tell you whether GPTBot can crawl your page or whether your content is structured enough for a model to lift a clean passage.

Here's how to audit and fix that, category by category.

Step 1: Run your first audit (free, no signup)

Paste your URL into https://citeready.sprytools.com or call the API directly:

curl -X POST https://citeready-api.sprytools.com/v1/audit \
  -H "content-type: application/json" \
  -d '{"url":"https://yoursite.com"}'

You get a score 0–100, a letter grade (A–F), and a per-category breakdown with pass/warn/fail on each check. The response looks like this:

{
  "score": 54,
  "grade": "D",
  "categories": {
    "ai_crawlers": { "status": "fail", "score": 0, "recommendation": "Unblock PerplexityBot, GPTBot in robots.txt" },
    "llms_txt": { "status": "fail", "score": 0, "recommendation": "Add llms.txt to improve LLM discoverability" },
    "structured_data": { "status": "warn", "score": 10 },
    "citability": { "status": "pass", "score": 24 },
    "technical": { "status": "pass", "score": 20 }
  }
}

That score of 54 is surprisingly common. Let's walk through each category.

Step 2: Understand the five categories

AI Crawler access

AI systems need to crawl your pages before they can cite them. The problem is that many sites block them accidentally. A common robots.txt pattern:

User-agent: *
Disallow: /

This blocks every crawler — including GPTBot (ChatGPT), OAI-SearchBot (OpenAI's separate indexing crawler), PerplexityBot, ClaudeBot, and Google-Extended (used for AI Overviews). You won't see warnings in Google Search Console because these aren't the traditional Googlebot.

Even a partial block causes problems. Disallow: /blog/ hides your best content from AI citation. Disallow: /api/ is usually fine, but check that you're not also blocking content pages by proximity.

llms.txt

llms.txt is a plain Markdown file at yourdomain.com/llms.txt. It tells language models what your site contains and where your important content is — similar in spirit to robots.txt, but written for models to read, not crawlers to obey.

It's optional, but it's a cheap signal. A three-line file can move you from fail to pass on this category.

JSON-LD structured data

Schema.org markup gives AI systems a machine-readable description of your content. Without it, a model has to infer your page's topic, author, and date from prose. With it, that context is unambiguous.

The most relevant types for AI citation: Article, WebPage, FAQPage, HowTo, WebApplication. A noindex page with perfect JSON-LD still scores zero — structured data amplifies citability, it doesn't create it.

Content citability

This is the hardest category to improve quickly because it depends on how you write. AI engines prefer passages that are self-contained — a sentence the model can quote without the surrounding paragraphs to make sense of it.

Low-citability: "As mentioned earlier, the tool integrates with all major platforms seamlessly."

High-citability: "CiteReady checks five signals: AI crawler access, llms.txt, JSON-LD, passage citability, and technical hygiene. Each category returns a pass/warn/fail with a specific fix."

The second version can be lifted from the article and quoted accurately. The first can't be cited without context and doesn't contain a quotable claim.

Technical signals

A few checks that either block AI content extraction entirely or reduce confidence:

HTTPS — most AI crawlers skip HTTP-only pages
nosnippet meta tag — this explicitly prevents excerpt extraction; if it's set, your citability score is zero regardless of content quality
Canonical tags — the canonical should point to the URL you want cited; a chain of redirects adds ambiguity
Sitemap — a sitemap at /sitemap.xml helps crawlers discover content; the robots.txt Sitemap: directive points them to it
lang attribute — language identification helps models pick the right passages for a given query

Step 3: Close the gaps

Fix AI crawler blocks first

If the audit returns fail on AI crawlers, this is your highest-leverage fix. Add explicit allow rules before any blocking rules:

# Allow AI crawlers
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

# Your existing rules below
User-agent: *
Disallow: /private/
Disallow: /admin/

If you have a wildcard Disallow: /, you'll need to restructure the file. Crawler-specific rules listed before the wildcard take precedence in most parsers.

Remove nosnippet

If you have <meta name="robots" content="nosnippet"> anywhere, remove it unless you have a specific reason for it (legal content you don't want excerpted, for example). This one tag blocks all AI passage extraction.

Add JSON-LD

A minimal Article schema takes 10 minutes and typically improves your score by 10–15 points:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Your page title",
  "description": "One-sentence summary",
  "author": {
    "@type": "Person",
    "name": "Your Name"
  },
  "datePublished": "2026-06-23",
  "url": "https://yoursite.com/this-page"
}
</script>

For tool or SaaS pages, use WebApplication or SoftwareApplication instead of Article.

Create llms.txt

Drop a file at yourdomain.com/llms.txt:

# Site Name

> One-sentence description of what this site is and who it's for.

## Key pages

- [Home](https://example.com/)
- [Docs](https://example.com/docs/)
- [About](https://example.com/about/)

That's it. The description line (the > blockquote) is the most important part — write it as a factual summary, not a marketing tagline.

Improve prose citability (longer-term)

The structural fixes above move most sites from the 40–60 range to 65–80. Getting above 80 requires improving the prose itself:

Open each section with the conclusion, not the setup
Keep paragraphs to 3–4 sentences with a single claim each
Replace vague language with specific numbers, names, and outcomes
Avoid forward/backward references ("as mentioned above", "we'll cover this later")

Run the audit again after each change to measure the effect. The per-category breakdown tells you which fix had the most impact.

What to expect after fixing

A typical site that fixes crawler blocks + removes nosnippet + adds basic JSON-LD goes from ~50 to ~75 in one afternoon. The citability score moves more slowly because it depends on content rewriting, but it's the signal with the highest ceiling.

Free tier: 3 audits/day, no signup → https://citeready.sprytools.com

What's the most surprising thing the audit flags on your site? I'm curious whether structured data gaps or crawler blocks are more common in the dev.to audience.

DEV Community