You published a 2,000-word guide last month. Took you a full day to research and write. It ranks on Google, drives leads to your business, and brings in real readers.
Now imagine an AI model reading that entire guide, absorbing every word, and serving up a polished summary the next time someone asks a related question. The user gets the answer. You get nothing β no click, no visit, no credit.
That's not a hypothetical. It's happening at scale, right now, on WordPress sites exactly like yours.
The good news? There are technically enforced ways to stop it. Here's what actually works β and what doesn't.
First, Understand Who's at the Door
AI companies deploy automated bots to crawl the public web and collect text for training their large language models. The mechanics work exactly like a search engine crawler β except the destination is different. Instead of building an index that sends people to your site, AI training bots harvest your content to build models that answer questions without sending anyone anywhere.
Many of these bots are transparent about what they are. OpenAI uses GPTBot for model training, OAI-SearchBot for search, and ChatGPT-User for direct user requests. Anthropic has ClaudeBot. Google has Google-Extended. Perplexity has PerplexityBot. ByteDance has Bytespider. The list keeps growing.
The reaction from website owners has been swift:
π« ~5.6 million websites now block GPTBot β up from 3.3 million in July 2025 (nearly +70%)
π« ClaudeBot is blocked on ~5.8 million websites, up from 3.2 million in the same period
(Source: The Register, December 2025)
The reason is straightforward: search crawlers send traffic back to your site in exchange for content. AI training crawlers use that same content to answer questions directly inside their own apps, with far less traffic returned to you.
Why Most Site Owners Start with the Wrong Tool
The first instinct is robots.txt β add a few Disallow lines and call it done. Every SEO plugin makes this easy, and yes, you should do it.
But here's the reality: robots.txt has never been a technical barrier. It's a convention.
The Robots Exclusion Protocol is purely advisory and relies on the web robot's own compliance β it cannot enforce anything. This was confirmed legally in Ziff Davis v. OpenAI (S.D.N.Y. 2025), where the court ruled that robots.txt doesn't qualify as a "technological measure that effectively controls access" to copyrighted works β it's more like a sign than a lock.
The compliance gap is measurable. According to a Tollbit industry report:
Q4 2024: 3.3% of AI bot requests ignored robots.txt
Q2 2025: 13.26% of AI bot requests ignored robots.txt
Even Cloudflare's own documentation is direct about it:
"Respecting robots.txt is voluntary... Some crawler operators may disregard your robots.txt preferences and crawl your content regardless."
robots.txt is worth doing. But it should be the floor, not the ceiling.
What Actually Enforces the Block
The only reliable protection is blocking AI bots at the technical level β before they reach your content. That means firewall-level enforcement based on the bot's user-agent string.
Here's the logic: most legitimate AI crawlers identify themselves openly in their user-agent header. GPTBot says it's GPTBot. ClaudeBot says it's ClaudeBot. A properly configured firewall can intercept those requests and deny them before WordPress even loads β no content served, no server resources wasted, no data harvested.
How WP Ghost Handles This
WP Ghost is primarily known as a WordPress hack-prevention plugin β it hides your /wp-admin path, runs an 8G firewall, and makes your site harder for automated scanners to fingerprint. But the same firewall infrastructure that blocks malicious bots also handles AI crawlers through its Block by User Agent feature.
Here's the path:
WP Ghost > Firewall > Blacklist > Block User Agents
When you add a bot's user-agent string there, two things happen simultaneously:
- The request is blocked at the firewall level
- A Disallow rule is automatically added to your robots.txt
Hard block + policy signal. One action.
What's in the Free vs. Premium Version?
Free version β includes Block by User Agent. You can paste in AI bot strings manually and get real enforcement immediately. No upgrade required.
The full 2026 AI bot list from WP Ghost's official documentation includes:
AI2Bot, Amazonbot, AnthropicBot, anthropic-ai, Applebot-Extended,
Bytespider, CCBot, ChatGPT-Operator, ChatGPT-User, Claude-Code,
ClaudeBot, cohere-ai, DeepSeekBot, DuckAssistBot, Google-Extended,
GPTBot, GrokBot, img2dataset, Meta-externalagent, MistralAI-User,
OAI-SearchBot, PerplexityBot, Perplexity-User, YouBot...
Free users can paste this list in manually and get the full block.
Premium version β adds a dedicated AI Copyright Protection feature that loads and applies the complete, maintained list automatically. New crawlers are added with each plugin release. If you don't want to track a moving target yourself, this is where the value is.
β οΈ Important: When WP Ghost's firewall is active, legitimate search engine bots (Googlebot, Bingbot, Yandex) are automatically whitelisted. Your SEO stays completely intact.
Other Options Worth Knowing
Your SEO Plugin (Yoast, Rank Math, etc.)
Every major SEO plugin lets you edit your robots.txt. Do this first β it's free and takes five minutes. OpenAI's own documentation tells publishers to disallow GPTBot from sites they want excluded from AI training. At minimum, add rules for:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
Understand what this is and isn't: a policy signal for compliant bots, not a technical barrier for non-compliant ones.
Cloudflare Bot Management
Cloudflare sits in front of your server at the infrastructure level and can block bots before traffic ever hits your hosting. In 2025, they launched a managed robots.txt service and a Pay per Crawl feature for publishers. Powerful β but requires technical familiarity and sits behind paid tiers. Overkill for most individual WordPress site owners starting from scratch.
Content Gating (MemberPress, Paid Memberships Pro)
The most technically airtight method: content behind a login wall can't be scraped by bots that can't authenticate. The obvious trade-off β it removes that content from public search indexing entirely. Not the right fit for blogs or content marketing sites that depend on organic traffic.
The Realistic Comparison
| Approach | Technically Enforced | Bot List Maintained | SEO Impact | Effort |
|---|---|---|---|---|
| WP Ghost Free (manual) | β Yes | β Manual | None | Low |
| WP Ghost Premium (auto) | β Yes | β Auto-updated | None | Minimal |
| robots.txt only | β Advisory | β Manual | None | Very low |
| Cloudflare Bot Management | β Yes | β Managed | None | High |
| Content gating | β Yes | N/A | Removes from search | Medium |
What to Do Right Now β In Order
Step 1 (5 minutes, free)
Open your SEO plugin and add Disallow rules for the major AI bots in your robots.txt. Handles the compliant ones immediately.
Step 2 (10 minutes, free)
If you have WP Ghost installed (free version is enough), go to:
WP Ghost > Firewall > Blacklist > Block User Agents
Paste in the AI bot user-agent list from WP Ghost's documentation. You now have firewall-level enforcement + automatic robots.txt coverage. No upgrade needed.
Step 3 β Check your logs
Search your server access logs for strings like GPTBot or ClaudeBot. Most site owners are surprised by how frequently these appear β and how long they've been visiting.
Step 4 (optional)
If maintaining the AI bot list yourself feels like a recurring task β and it will grow as new AI companies launch crawlers β WP Ghost Premium's AI Copyright Protection feature keeps the list current automatically with every plugin release.
One Thing No Tool Can Fully Solve
Newer AI browsers and developer tools are increasingly indistinguishable from regular human traffic in server logs. That's an evolving challenge no plugin has a complete answer for today.
Legal frameworks around AI scraping and copyright are still developing, with multiple lawsuits active in 2025β2026 involving OpenAI, Anthropic, Perplexity, and others.
What you can control today is making your site a harder target than the next one. robots.txt is the "no trespassing" sign. A firewall-level block is the locked door. Both matter β but only one of them actually keeps anyone out.
For WordPress site owners, that locked door is available right now, in the free version of a plugin that also handles the rest of your site's security at the same time.
Have you checked your server logs for AI crawler activity? Most site owners are surprised by what's already in there.
Resources mentioned:

Top comments (0)