Searchless

Posted on May 9 • Originally published at searchless.ai

Your Website Might Be Invisible to ChatGPT and Gemini Right Now, and Your Hosting Provider Is the Reason Why

#aicrawlers #wordpresshosting #geoinfrastructure #aivisibilityaudit

Originally published on The Searchless Journal

Your SEO dashboard looks clean. Google Search Console shows no issues. Your pages are indexed. Traffic is stable. Everything appears to be working.

Except your site might be completely invisible to ChatGPT, Claude, and Gemini, and you would never know it from any standard SEO audit.

On May 6, 2026, Search Engine Land published an investigation by Will Scott that exposed a problem most site owners did not know existed. Managed WordPress hosting providers, WP Engine specifically in this case, are blocking AI crawlers at the server level. Not through robots.txt. Not through a plugin. Not through any setting the customer can see or control. The block is buried inside the hosting platform's own infrastructure, in a layer that sits between the CDN and WordPress itself.

If your content is being rejected before AI crawlers can even read it, every other GEO tactic you deploy is wasted. This is not a content quality problem. It is an infrastructure access problem, and it is the most common invisible failure mode in AI visibility today.

What the investigation found

Scott's team at Search Influence noticed something odd in their AI citation monitoring data. Their site, searchinfluence.com, showed strong presence in Google AI Mode (37.8%) and respectable showings in Copilot (22.2%) and Google Gemini (16.3%). ChatGPT cited them 9.6% of the time. Perplexity came in at 7.8%.

Claude: 0.0%. Meta AI: 0.0%.

Every crawler reads the same site. Content quality, topical authority, and relevance are identical across all platforms. The only variable that explains a gap from 37.8% to zero is whether each crawler is actually allowed to reach the content in the first place.

Scott opened the server logs. Seven days of Cloudflare data from April 4-10, 2026 revealed 29,099 bot requests, with 65.8% coming from AI bots. The breakdown told the story clearly:

Amazonbot: 51% of requests rate-limited (HTTP 429)
ClaudeBot: 29% rate-limited
GPTBot: 29% rate-limited
Bytespider: 61% blocked entirely (HTTP 403/5xx)
ChatGPT-User: 0% rate-limited
PerplexityBot: 0% rate-limited

The pattern was deliberate. Training crawlers, the bots that pull entire sites in large bursts, were being throttled or blocked. User-facing crawlers, the ones that make human-paced requests during live queries, were allowed through.

This is not random. This is a policy.

The detective work that uncovered the real culprit

What makes this problem so dangerous is that it hides behind normal-looking infrastructure. Scott's team spent hours eliminating suspects before finding the real cause.

They checked their WordPress security plugin, Solid Security, which has a built-in bot user agent blocklist. Toggled it off. Ran a 24-hour before/after comparison. No change.

They audited 24,538 firewall log entries over 30 days. Every single one was a brute-force lockout on wp-login.php. Zero entries for any AI crawler. Rules were empty. IP management was clean.

They checked Sucuri, their cloud WAF subscription. It turned out the subscription existed but was never activated. DNS resolved to Cloudflare, not Sucuri. Sucuri was never in the request path.

They examined Cloudflare itself. Originally dismissed because cache status showed "dynamic" or "bypass." But going back with the right filter, Security > Analytics > Events filtered by ClaudeBot user agent over 24 hours, showed zero security events. Cloudflare took no action on ClaudeBot, yet 608 ClaudeBot requests returned HTTP 429 in that same window.

Every layer the site owner could see was clean. The block was coming from somewhere else entirely.

The breakthrough came with a direct reproduction test. The team ran 60 curl requests with a ClaudeBot user agent against three different URL paths. All 60 returned HTTP 429. Control tests with a browser user agent returned 200. Same paths with a Googlebot user agent also returned 200.

The block was unambiguously based on the user agent string. And the smoking gun was in the response headers: x-powered-by: WP Engine.

The hosting provider was blocking AI crawlers at its own infrastructure layer, a layer no customer can access, configure, or even see.

The invisible architecture of the block

Once Scott's team knew where to look, they mapped the full bot-by-bot fingerprint:

Bot User Agent	Result	Status
ClaudeBot	60/60 x 429	Blocked
GPTBot	8/10 x 429, 2/10 cached 200	Blocked
Amazonbot	10/10 x 429	Blocked
Bytespider	10/10 x 520	Blocked
anthropic-ai (older UA)	10/10 x 200	Not blocked
CCBot (Common Crawl)	10/10 x 200	Not blocked

Two things stand out immediately.

The blocklist is dated. It targets the known AI training crawler set as of mid-2024. The older anthropic-ai user agent is allowed through. CCBot, which feeds data into Common Crawl and from there into multiple LLM training pipelines, is also allowed. If the goal is to prevent training data extraction, the fence has holes large enough to drive a truck through.

Cached responses serve through the block. WP Engine's edge cache returns cached pages to blocked bots without issue. Only cache-miss requests, the ones that need to hit the origin server, get the 429 treatment. This explains the mixed data perfectly: in a 24-hour window, 1,054 ClaudeBot requests returned 200 (cache hits) and 608 returned 429 (cache misses). Same bot, same site, two different outcomes depending on whether the page happened to be cached.

This means the block is not even fully effective at its own stated purpose. It is a partial, inconsistent filter that makes your AI visibility data noisy and unreliable without actually preventing determined scrapers from accessing your content.

Why standard audits cannot catch this

This is the part that should concern every SEO team and site owner reading this. The block is specifically designed to be invisible to the tools and processes most teams use.

It returns HTTP 429, not 403. A 403 "Forbidden" response can get a site flagged by search engines as having site-wide failures. A 429 "Too Many Requests" is safer from an SEO perspective but reads as a rate limit in every WAF analytics tool on the market. Investigators end up chasing rate-limit configurations at the wrong layer.

It fires below WordPress plugins. Wordfence, Sucuri, Solid Security, and similar tools all log at the WordPress application layer. The hosting provider's block fires at the platform edge, before the request ever reaches WordPress. Plugin logs show nothing.

It fires below the customer's own Cloudflare zone. WP Engine runs its own Cloudflare-backed bot management at the hosting edge. That is a separate Cloudflare instance sitting behind whatever CDN configuration the customer has set up. Events triggered there do not appear in the customer's Cloudflare dashboard.

The hosting provider's own documentation states that further information about their firewall "cannot be provided, as this can compromise its secure integrity." Customers are explicitly denied visibility into which bots are being blocked and how.

Scott reached a live WP Engine support agent after navigating through several rounds of canned responses. The agent confirmed three key points. The platform enforces "platform-wide rate limiting on certain high-impact bots" that "can't be selectively disabled per bot." Customer-facing web rules "do not override" the platform-level infrastructure rules. And the internal documentation acknowledges that blocking bots like Amazonbot "can impact their crawling and indexing."

That last point is critical. The hosting provider knows this affects SEO and AI visibility. They have documented it internally. But the customer is not told.

WordPress powers 43% of the web

The scale of this problem is difficult to overstate. WordPress powers approximately 43% of all websites globally as of early 2026, according to W3Techs data. The managed WordPress hosting segment, which includes providers like WP Engine, Kinsta, Flywheel, Pagely, and others, serves a significant portion of high-traffic, commercially important sites. These are exactly the sites that have the most to gain from AI visibility.

If even a subset of managed WordPress hosts is applying similar policies, millions of sites could be silently invisible to AI platforms. And because the block is invisible to standard audits, most of them will never know.

Not every host takes this approach. Kinsta's CTO stated publicly in March 2026 that they "will not block at the platform level" and will not bill customers for bot bandwidth. The divergence between hosting providers on this issue means that two identical WordPress sites, running the same plugins and content, could have radically different AI visibility profiles based solely on who hosts them.

This is the infrastructure equivalent of building a beautiful store and then discovering the landlord locked the front door without telling you.

How to check if your site is affected

The good news is that detecting this problem is straightforward, even if fixing it is harder.

The curl test. Run a request against your own site using an AI crawler user agent:

curl -I -A "ClaudeBot" https://yoursite.com/some-page/

If you get a 429 or 403 response, try the same URL with a regular browser user agent:

curl -I -A "Mozilla/5.0" https://yoursite.com/some-page/

If the browser agent returns 200 and the ClaudeBot agent returns 429, you have a server-level block in place. Test with GPTBot, Amazonbot, and PerplexityBot user agents as well to map the full scope.

Check server access logs. Look for AI crawler user agents in your raw server logs. Filter for ClaudeBot, GPTBot, ChatGPT-User, PerplexityBot, Amazonbot, and Bytespider. Compare the HTTP status codes. A high proportion of 429s relative to 200s for specific bots is a red flag.

Check your hosting provider's documentation. Look for any mention of bot management, rate limiting, or crawler policies. If they mention Cloudflare-powered bot management or platform-wide rate limiting without providing per-site controls, that is a warning sign.

Use an AI visibility audit. Tools designed specifically for GEO analysis can detect patterns of missing citations that correlate with crawler access issues. If you have strong AI visibility monitoring in place, a platform-specific zero presence, like Claude showing 0% while Google AI Mode shows 30%+, is a signal worth investigating.

Our AI visibility audit methodology includes crawler access checks as a foundational step. If the crawlers cannot reach your content, nothing else in the audit matters. You can also run a free check through the AI Visibility Audit tool.

What to do if you are blocked

The fix depends on where the block originates. If it is a robots.txt issue, you can resolve it yourself by editing the file. If it is a WordPress plugin, you can adjust settings or switch plugins.

If it is a hosting-level block, you have fewer options.

Contact your hosting provider directly. Ask specifically whether they enforce platform-level bot rate limiting or blocking. Reference the Search Engine Land investigation. Ask whether AI crawler user agents are affected and whether you can opt out.

Consider switching hosts. If your current provider cannot or will not disable platform-level AI bot blocking, moving to a host that explicitly allows AI crawlers is the most reliable fix. Kinsta has publicly committed to not blocking at the platform level. Other hosts may follow as awareness of this issue grows.

Add a secondary CDN layer. Some teams have worked around hosting-level blocks by routing traffic through their own Cloudflare or CDN configuration that sits in front of the host. This adds complexity and cost but can give you control over bot access policies.

Monitor continuously. This is not a one-time check. Hosting providers can update their bot policies at any time without notifying customers. Set up regular crawler access tests as part of your GEO monitoring routine. Understanding how to get cited by AI platforms starts with confirming the crawlers can actually reach you.

The broader context: who controls access to your content

This issue sits at the intersection of several emerging debates about AI crawler access, content ownership, and infrastructure control.

Google is currently testing Web Bot Auth, a cryptographic protocol designed to help websites verify that bot traffic is legitimate. The IETF webbotauth working group has published milestones targeting standards-track specifications for bot authentication by April 2026 and best current practice operational documents by August 2026. Cloudflare has already published integration documentation for Web Bot Auth verification.

The goal of Web Bot Auth is to distinguish between authentic crawlers (real GPTBot, real ClaudeBot) and impersonators. This matters because Scott's investigation found that roughly 100% of ClaudeBot traffic to their site came from a Microsoft Azure IP, not Anthropic's published AWS ranges. A significant portion of "AI crawler" traffic is actually scrapers wearing borrowed user agent strings.

But Web Bot Auth does not solve the managed hosting problem. A hosting provider that blocks verified, authenticated AI crawlers at the platform level is making a business decision, not a security decision. The infrastructure for distinguishing legitimate bots from imposters is being built. The question is whether hosting providers will use it to allow legitimate crawlers or to enforce blanket blocks more efficiently.

There is also a fundamental tension in the hosting economics. AI training crawlers consume enormous amounts of bandwidth relative to the value they return. Cloudflare's Q1 2026 data shows ClaudeBot makes 20,583 crawl requests for every referral it sends back. GPTBot: 1,255 to 1. Google: 5 to 1. Hosting providers absorb that bandwidth cost, and for sites on fixed-traffic plans, AI crawler traffic can push them into overage charges.

WP Engine reported mitigating 75 billion bot requests in their 2025 Year in Review. From the hosting provider's perspective, blocking aggressive AI training crawlers is a cost-saving measure. From the site owner's perspective, it is a silent tax on their AI visibility.

The bottom line

If you run a WordPress site on managed hosting, you need to check whether AI crawlers can actually access your content. This is not optional. It is the most fundamental prerequisite for any GEO strategy.

Your robots.txt might be perfectly configured. Your content might be excellent. Your structured data might be flawless. None of it matters if your hosting provider is returning HTTP 429 to every AI crawler that knocks on the door.

Run the curl test. Check your logs. Ask your host. And if they are blocking, decide whether the bandwidth savings are worth being invisible to the fastest-growing source of referral traffic on the internet.

Because right now, the citation set is hardening. The sources that AI platforms learn to trust now will have compounding advantages later. Treating AI crawler access as optional in 2026 is the same calculation as treating Googlebot access as optional in 2005. It worked for a while, until it did not.

Run a free AI visibility audit and find out if your hosting is blocking AI crawlers. Start your audit at searchless.ai

Sources

Will Scott, "Your managed WordPress might be blocking AI bots and you can't see it," Search Engine Land, May 6, 2026. https://searchengineland.com/managed-wordpress-blocking-ai-bots-476510
"Web Bot Auth, Google's new experimental method to validate authentic bots," Search Engine Land, May 5, 2026. https://searchengineland.com/web-bot-auth-googles-new-experimental-method-to-validate-authentic-bots-476483
Google Developers, "Authenticating Requests with Web Bot Auth (Experimental)." https://developers.google.com/crawling/docs/crawlers-fetchers/web-bot-auth
Cloudflare, "Web Bot Auth - Bot Verification." https://developers.cloudflare.com/bots/reference/bot-verification/web-bot-auth/
WordPress market share data: W3Techs, "Usage statistics and market share of WordPress," accessed May 2026. https://w3techs.com/technologies/details/cm-wordpress
Cloudflare Q1 2026 crawl-to-referral analysis, cited in Search Engine Land investigation.

FAQ

Is this only a WP Engine problem?
The Search Engine Land investigation specifically documents WP Engine, but the detection methodology applies to any managed WordPress host. Kinsta has publicly stated they do not block AI crawlers at the platform level. Other hosts have not publicly disclosed their policies. The only way to know is to test your own site.

Can I fix this by editing my robots.txt?
No. The block described in this investigation operates at the server infrastructure level, before the request reaches WordPress or your robots.txt. A perfectly configured robots.txt will not override a hosting-level block.

Should I allow all AI crawlers?
That depends on your business goals. If AI citations drive brand awareness, traffic, and leads, blocking crawlers costs you visibility. If you are concerned about content being used for training without compensation, blocking may be intentional. The problem is not the decision itself but the fact that the decision is being made for you without your knowledge.

How often should I check crawler access?
At minimum, quarterly. Hosting providers can update bot policies at any time. If you are actively investing in GEO and AI visibility, monthly checks using the curl test described above are prudent. Automated AI visibility monitoring can flag sudden drops in platform-specific citations that may indicate a new block.

Learn more about how AI platforms find, evaluate, and cite content in our AI Visibility guide. Explore the guide

DEV Community