You spent months building your website, writing original content, and growing your audience. Then you check your server logs and discover dozens of AI bots crawling your pages every day — consuming bandwidth, scraping your content, and giving nothing back.
Why AI Crawlers Are a Growing Problem
Since 2024, the number of AI-powered crawlers has exploded. Companies training large language models send bots like GPTBot, ClaudeBot, Bytespider, and dozens of others to index web content at scale. Unlike Googlebot, which sends you traffic in return, most AI crawlers take your content without any direct benefit to you. For small site owners and indie developers, this means higher hosting bills, slower page loads for real users, and content being used without consent.
The traditional robots.txt file was designed for a simpler era. It relies on bots voluntarily obeying your rules — and many AI crawlers simply ignore it.
Step 1: Identify Which Bots Are Hitting Your Site
Before blocking anything, you need to know what you're dealing with. Check your server access logs for common AI bot user agents:
- GPTBot (OpenAI)
- ClaudeBot (Anthropic)
- Bytespider (ByteDance)
- CCBot (Common Crawl)
- Google-Extended (Google AI training)
- FacebookBot (Meta AI)
On Apache, run: grep -i 'gptbot\|claudebot\|bytespider\|ccbot' access.log | wc -l
On Nginx, check your access logs the same way. You might be surprised by the volume.
Step 2: Update Your robots.txt (But Don't Stop There)
Add disallow rules for known AI crawlers:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
This is a starting point, but it has two major weaknesses: new bots appear constantly, and not all crawlers respect robots.txt. You need server-level enforcement too.
Step 3: Block at the Server Level
For Nginx, add user-agent checks in your server block:
if ($http_user_agent ~* (GPTBot|ClaudeBot|Bytespider|CCBot)) {
return 403;
}
For Apache, use .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ClaudeBot|Bytespider) [NC]
RewriteRule .* - [F,L]
This is more reliable than robots.txt alone, but you still need to maintain and update these rules manually as new crawlers emerge.
Step 4: Consider Rate Limiting
Some bots disguise their user agent. Rate limiting suspicious traffic patterns catches what user-agent blocking misses. Tools like fail2ban or Cloudflare's rate limiting rules can help, though they require careful configuration to avoid blocking legitimate users.
A Simpler Approach
If maintaining blocklists and server configs sounds like more work than you want, tools like CrawlShield offer a managed solution. It keeps an updated database of AI crawler signatures and handles blocking automatically, which can save time if you're running multiple sites or don't want to monitor new bots yourself. At $9.99, it's one option worth evaluating alongside the manual approach.
Keep Monitoring
Whichever method you choose, blocking AI bots from crawling your website isn't a set-and-forget task. New crawlers appear regularly, and some rotate user agents to avoid detection. Set up a monthly log review to catch anything that slips through, and consider automated alerting for unusual traffic spikes.
Top comments (0)