Gyubin Kim

Posted on Jun 6

Is Your Site Blocking the AI Crawlers? (How to Check & Fix It, 2026)

#seo #ai #webdev #devops

Series: Getting Cited by AI — Post #4. Post #1: How to Get Cited by ChatGPT → · Post #2: Why AI Recommends Your Competitor → · Post #3: Copy-Paste Schema Templates →

The last three posts assumed one thing that isn't always true: that AI assistants can actually read your site in the first place. You can have perfect schema, clean facts, and answer-shaped copy — and still be invisible, because the crawler that feeds the model never got past your front door.

This post is the part almost nobody checks: AI crawler access. It's the plumbing under everything else. Five minutes here can be the difference between "my schema isn't working" and "the bot was blocked the whole time."

One honest caveat up front: allowing AI crawlers doesn't guarantee you get cited, and blocking them isn't always wrong (some businesses deliberately opt out). There's no fixed timeline for pickup either. What this post does is make sure the choice is yours — not an accidental default from your site builder or an SEO plugin you installed two years ago.

How AI actually sees your site

There are two different ways a model can end up quoting you, and they have different gatekeepers:

Training / index crawl — bots like GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Gemini), and Amazonbot crawl the open web ahead of time. If your robots.txt blocks them, your pages may never enter the corpus those models draw on.
Live retrieval ("answer engines") — when someone asks ChatGPT, Perplexity, or Copilot a question right now, a separate fetcher (e.g. OAI-SearchBot, PerplexityBot, ChatGPT-User) hits live pages to ground the answer. This is where most local-business citations come from in 2026, and it has its own set of user-agents.

The trap: people block one and assume they've allowed the other, or an SEO/security plugin quietly blocks all of them. You have to handle both.

The 5-minute access check

Do these in order. No developer required.

1. Read your robots.txt. Go to https://yourdomain.com/robots.txt in a browser. Look for any line like Disallow: / paired with an AI user-agent, or a blanket block. Common offenders:

User-agent: GPTBot
Disallow: /

User-agent: *
Disallow: /

The second one blocks everything, AI crawlers included. If you see either and you want AI visibility, that's your culprit.

2. Check for a firewall / WAF block. Cloudflare, Wordfence, and similar tools have "block AI bots / block AI scrapers" toggles that became default-on for many accounts in 2024–2025. These block at the network layer, so your robots.txt can say "allowed" while the request still gets a 403. Log into your CDN/security dashboard and look for an "AI Scrapers & Crawlers" or "Bot Fight Mode" setting.

3. Confirm the page renders without JavaScript. Many fetchers grab the raw HTML and don't run your JavaScript. If your prices, hours, or FAQ only appear after a script runs, the bot sees a blank shell. Quick test: right-click → "View Page Source" (not Inspect). If your key facts aren't in that raw HTML, they may be invisible to retrieval bots.

4. Check meta tags. Look in your page's <head> for <meta name="robots" content="noindex"> or noai / noimageai directives. These tell crawlers to skip the page.

5. Test a live fetch. Ask ChatGPT or Perplexity directly: "What can you tell me about [your business name] in [city] from their website?" If it can't find anything or pulls only from a directory (Yelp, Google) and never your own site, that's a signal your site isn't being reached or read.

The fix: an explicit "allow" robots.txt

If you've decided you want AI visibility, make it explicit rather than relying on defaults. Here's a clean, permissive robots.txt that welcomes the major AI crawlers while keeping normal SEO intact:

# Allow standard search engines
User-agent: *
Allow: /

# Explicitly welcome AI crawlers (training + retrieval)
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Replace the sitemap URL with your real one. Most site builders (Squarespace, Wix, Shopify, WordPress) either generate robots.txt automatically or let you edit it under SEO/crawling settings — search your builder's help for "edit robots.txt."

Note: user-agent names change as vendors add new bots. The list above covers the major ones in 2026, but the principle — Allow: / for the crawlers you want — is what lasts. Don't over-engineer it.

Bonus: the `llms.txt` file (optional, low-effort, forward-looking)

A newer, emerging convention is llms.txt — a plain-text file at https://yourdomain.com/llms.txt that gives AI models a clean, curated summary of your site and links to your most important pages. Think of it as a "table of contents for machines."

It is not an official standard and not guaranteed to be read by any model today — so treat it as a low-cost bet, not a fix. But it's a single small file, it can't hurt, and a few retrieval systems have started honoring it. A minimal version:

# Acme Roofing — Phoenix, AZ

> Family-owned roofing contractor serving the Phoenix metro since 1998.
> Free written estimates. Repairs and full replacements.

## Key pages
- [Services & pricing](https://yourdomain.com/services): repairs $400–$1,200; replacements from ~$9,000
- [Service area](https://yourdomain.com/areas): Phoenix, Scottsdale, Tempe, Mesa
- [Reviews](https://yourdomain.com/reviews): 4.8 average across 120+ Google reviews
- [Contact](https://yourdomain.com/contact): (555) 010-1234, open Mon–Sat 7am–6pm

Same honesty rule as everywhere in this series: only put in facts that are true and match your Google profile.

Priority order (do these in this sequence)

Unblock first. If a robots.txt rule or WAF toggle is blocking AI bots, nothing else matters until that's fixed.
Render facts in raw HTML. Make sure prices, hours, and answers appear in View Source, not just after JavaScript.
Add the explicit allow list so the choice is deliberate and survives plugin updates.
Add schema (see Post #3) so the now-readable facts are labeled.
llms.txt last as a cheap, optional extra.

Access before labeling before optimizing. Skip step 1 and the rest is wasted effort.

If you'd rather not crawl through your CDN settings and robots files yourself, I run a free AI-visibility snapshot: I check exactly which AI crawlers can and can't reach your site, whether your facts render for them, and what's blocking citation — then send you the specific fixes. No cost, no pitch. If it's useful and you want it done for you, we can talk from there. Reach out: faithpath25 (sales) — ask for the snapshot.

FAQ

Should every business allow AI crawlers?
No. If your content is your product (paywalled media, proprietary databases) you may want to block training bots. For a local service business that wants to be found and recommended, allowing them is almost always the right call.

If I allow GPTBot, will ChatGPT start citing me right away?
No — there's no guarantee and no fixed timeline. Allowing access is necessary, not sufficient. Whether and when you get picked up still depends on your schema, reviews, and how quotable your pages are.

I allowed the bots in robots.txt but still get blocked — why?
Almost always a firewall/WAF (Cloudflare, Wordfence) blocking at the network layer above robots.txt. Check your security dashboard for an "AI bots" or "Bot Fight" toggle.

Is llms.txt required?
No. It's an emerging, unofficial convention. It's a cheap optional extra — do the robots.txt and schema work first.

Will blocking AI bots hurt my normal Google ranking?
Blocking AI-specific bots (GPTBot, Google-Extended) does not affect classic Google Search ranking, which uses Googlebot. But a blanket Disallow: / for User-agent: * blocks everything, including Googlebot — that one will hurt you badly.

Two ways to act on this:

🔎 Free, no-strings: send your site URL to faithpath25@gmail.com with the subject "GEO snapshot" — I'll send back a 1-page read of exactly what AI assistants can and can't currently see on your site, plus the specific fixes. Free pilot, wherever you operate; if it's useful, a short review is all I ask.

🧰 Do it yourself: the copy-paste schema kits, checklists, and the full GEO audit live at SprintLanding → (includes a free starter). Prices in USD; Gumroad converts to your local currency at checkout.

DEV Community

Is Your Site Blocking the AI Crawlers? (How to Check & Fix It, 2026)

How AI actually sees your site

The 5-minute access check

The fix: an explicit "allow" robots.txt

Bonus: the `llms.txt` file (optional, low-effort, forward-looking)

Priority order (do these in this sequence)

FAQ

Top comments (0)

How AI actually sees your site

The 5-minute access check

The fix: an explicit "allow" robots.txt

Bonus: the llms.txt file (optional, low-effort, forward-looking)

Priority order (do these in this sequence)

FAQ

Bonus: the `llms.txt` file (optional, low-effort, forward-looking)