Google sends you a link and lets you click. ChatGPT, Perplexity, and Gemini read your site for you and decide whether to mention it at all. That distinction changes everything about technical SEO.
I spent the last few months reverse-engineering how AI search engines discover, evaluate, and cite websites. The result is a list of 11 concrete technical signals you can audit today. No fluff — just the checks that matter and the fixes that work.
How AI Search Discovery Actually Works
Traditional search engines crawl your pages, index keywords, and rank links. AI search engines do something fundamentally different:
- They crawl with their own bots (GPTBot, PerplexityBot, Google-Extended) — separate from Googlebot
- They compress your content into embeddings stored in vector databases
- They retrieve relevant chunks at query time via RAG (Retrieval-Augmented Generation)
- They synthesize answers and may or may not cite the source
If your site blocks these bots, lacks structure, or is too slow to crawl efficiently, you're invisible to AI search — even if you rank #1 on Google.
Here are the 11 signals that determine your visibility.
Signal 1: robots.txt AI Crawler Access
What it is: Your robots.txt file controls which bots can crawl your site. Many default configurations inadvertently block AI crawlers.
Why it matters: If GPTBot or PerplexityBot is blocked, your content will never enter their index. Period.
The fix:
# robots.txt — Allow AI crawlers explicitly
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Applebot-Extended
Allow: /
Check yours right now: visit yoursite.com/robots.txt and look for Disallow rules targeting these user agents.
Signal 2: llms.txt
What it is: A proposed standard (similar to robots.txt) that provides AI systems with a structured summary of your site's content, purpose, and key pages.
Why it matters: It gives LLMs a machine-readable map of what your site offers, improving content retrieval accuracy.
The fix: Create /llms.txt at your root domain:
# YourSite
> Brief description of what your site does.
## Key Pages
- [Product Overview](https://yoursite.com/product): Main product page
- [Documentation](https://yoursite.com/docs): Technical docs
- [Pricing](https://yoursite.com/pricing): Plans and pricing
## Topics Covered
- Topic A
- Topic B
Signal 3: Structured Data / Schema Markup
What it is: JSON-LD markup that explicitly tells machines what your content represents — articles, products, FAQs, organizations, etc.
Why it matters: AI engines use schema markup to extract facts with high confidence. Pages with structured data are more likely to be cited because the AI can verify the information type.
The fix: Add JSON-LD to your pages:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Your Article Title",
"author": {
"@type": "Person",
"name": "Your Name"
},
"datePublished": "2026-03-19",
"description": "A concise description of the article content."
}
</script>
At minimum, implement Organization, WebSite, Article, and FAQPage schemas.
Signal 4: Content Structure (Headers and Hierarchy)
What it is: Clean H1 → H2 → H3 hierarchy with descriptive headings that match common query patterns.
Why it matters: RAG systems chunk content by sections. Well-structured headers produce better chunks, which means better retrieval and more accurate citations.
The fix:
- One H1 per page (your main topic)
- H2s for major subtopics, phrased as questions when possible
- H3s for supporting details
- Keep sections self-contained — each section should make sense in isolation
Signal 5: FAQ Markup
What it is: FAQPage schema that pairs questions with answers in a machine-readable format.
Why it matters: AI search engines love Q&A pairs. They map directly to user queries and are easy to extract and cite.
The fix:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "What is Generative Engine Optimization?",
"acceptedAnswer": {
"@type": "Answer",
"text": "GEO is the practice of optimizing websites to be discovered and cited by AI-powered search engines like ChatGPT, Perplexity, and Gemini."
}
}
]
}
</script>
Signal 6: Citation Readiness
What it is: Whether your content contains clear, quotable statements with supporting evidence — statistics, named sources, specific claims.
Why it matters: AI engines need confidence to cite. Content that reads like a Wikipedia article (neutral, factual, well-sourced) gets cited more than vague marketing copy.
The fix:
- Include specific numbers and statistics
- Attribute claims to named sources
- Write definitive statements rather than hedging language
- Add "According to..." attributions where possible
Signal 7: Page Speed
What it is: How fast your page loads and renders, measured by Core Web Vitals (LCP, FID, CLS).
Why it matters: AI crawlers have time budgets. Slow pages get partially crawled or skipped entirely. GPTBot's crawl timeout isn't publicly documented, but empirical testing shows it favors fast-loading pages.
The fix:
- Target LCP under 2.5 seconds
- Minimize JavaScript blocking
- Use CDN for static assets
- Compress images (WebP/AVIF)
Signal 8: Meta Descriptions
What it is: The <meta name="description"> tag that summarizes your page content.
Why it matters: AI engines use meta descriptions as a quick signal for page relevance during retrieval. A clear, keyword-rich description improves the chance of being selected from the index.
The fix:
- Write unique descriptions for every page
- Include your primary topic and key terms
- Keep it under 160 characters
- Make it a factual summary, not a sales pitch
Signal 9: Canonical URLs
What it is: The <link rel="canonical"> tag that tells search engines which version of a page is the "official" one.
Why it matters: Duplicate content confuses AI indexing. If the same content exists at multiple URLs without canonical tags, AI engines may index the wrong version or skip it entirely.
The fix:
<link rel="canonical" href="https://yoursite.com/your-page" />
Every page should have a self-referencing canonical URL. Always use absolute URLs with HTTPS.
Signal 10: Sitemap Accessibility
What it is: A properly formatted sitemap.xml that lists all your important pages.
Why it matters: AI crawlers use sitemaps to discover content efficiently. A missing or malformed sitemap means the crawler has to follow links to find pages — and it might miss important ones.
The fix:
- Ensure
/sitemap.xmlreturns a valid XML sitemap - Reference it in your
robots.txt:Sitemap: https://yoursite.com/sitemap.xml - Include
<lastmod>dates so crawlers prioritize fresh content - Remove 404/redirected URLs from the sitemap
Signal 11: HTTPS
What it is: Whether your site uses TLS encryption (HTTPS vs HTTP).
Why it matters: AI engines deprioritize or skip insecure sites. HTTPS is a baseline trust signal — without it, your content is less likely to be cited.
The fix: Get a free TLS certificate from Let's Encrypt and redirect all HTTP traffic to HTTPS. In 2026, there's no excuse for running HTTP.
How to Check All 11 Signals at Once
You can manually audit each of these, but it's tedious. I built GEOScore AI (free, no signup required) to check all 11 signals automatically. Enter your URL, and it returns a score with specific issues flagged and fix instructions.
Full disclosure: I built this tool. But I built it because I needed it myself — I was manually checking these signals across dozens of sites and wanted to automate the process. It's completely free to use.
The tool also includes:
- AI Robots.txt Generator — generates an AI-crawler-friendly robots.txt
- AI Crawler Access Checker — tests which AI bots can actually reach your site
The Priority Order
If you're starting from scratch, here's the order I'd tackle these:
- robots.txt — unblock AI crawlers (5 minutes)
- HTTPS — non-negotiable baseline (varies)
- Sitemap — make sure it exists and is valid (10 minutes)
- Content structure — clean up your heading hierarchy (ongoing)
- Meta descriptions — write unique ones for key pages (30 minutes)
- Structured data — add Organization + Article schemas (1 hour)
- FAQ markup — add to your most important pages (1 hour)
- llms.txt — create and deploy (15 minutes)
- Canonical URLs — audit and fix (30 minutes)
- Page speed — optimize Core Web Vitals (ongoing)
- Citation readiness — rewrite content to be more quotable (ongoing)
Conclusion
AI search is not replacing Google — it's creating a parallel discovery channel. The sites that show up in both traditional and AI search results will have a significant traffic advantage.
The good news: most of these fixes are one-time technical changes. The bad news: your competitors are already implementing them. Run a scan on your site today and see where you stand.
Have questions about GEO or AI search optimization? Drop them in the comments — happy to dig into specific implementation details.
Top comments (0)