Alan West

Posted on Apr 7

Blocking AI Crawlers vs. Letting Them In: A Practical Defense Guide

#webdev #security #privacy #analytics

Someone on Reddit recently shared that Meta's AI crawler hit their site 7.9 million times in 30 days — burning through 900+ GB of bandwidth before they even noticed. If that doesn't make you want to immediately check your server logs, I don't know what will.

I spent last weekend auditing three of my own sites after seeing that post. Turns out, I had a similar (though less dramatic) problem. That rabbit hole led me to completely rethink how I handle bot traffic, monitoring, and analytics. Here's what I learned comparing different approaches to detecting, measuring, and blocking aggressive AI crawlers.

Why This Matters Now

AI companies need training data, and your website is an all-you-can-eat buffet. Meta's crawler (Meta-ExternalAgent), OpenAI's GPTBot, Anthropic's ClaudeBot, and dozens of others are hammering sites at rates that would make a DDoS look polite.

The problem isn't just philosophical. It's practical:

Bandwidth costs money. 900+ GB of crawler traffic on a small site is absurd.
Server performance degrades. Your actual human visitors get slower page loads.
Most people don't notice until the hosting bill arrives or the site goes down.

The first step is actually seeing the problem. And that's where your choice of analytics and monitoring tooling matters a lot.

Traditional Analytics vs. Privacy-Focused Analytics for Bot Detection

Here's the thing — if you're running Google Analytics, you probably won't see crawler traffic at all. GA runs client-side JavaScript, and bots typically don't execute JS. Your dashboard looks fine while your server is getting pummeled.

This is where server-side or privacy-focused analytics tools actually shine for a different reason than privacy: they can surface traffic patterns that JS-only tools miss entirely.

Umami (Self-Hosted, Open Source)

Umami is my current pick for most projects. It's open source, you self-host it, and it gives you a clean dashboard without any cookie banners.

// Umami tracking script — lightweight, no cookies
// Add this to your <head> and you're done
<script
  async
  defer
  data-website-id="your-website-id"
  src="https://your-umami-instance.com/umami.js"
></script>

What I like about Umami for this use case:

GDPR compliant out of the box — no cookies, no personal data collection
Self-hosted means you own the data — nobody else is training models on your analytics
Lightweight — the tracking script is under 2KB
Simple dashboard that actually shows you what matters

The downside: Umami alone won't show you bot traffic either, since it's still JS-based. You need to pair it with server log analysis. But having clean human-traffic data makes it easy to spot the delta when you compare against raw server logs.

Plausible (Hosted or Self-Hosted)

Plausible is similar in philosophy but offers a hosted option if you don't want to manage infrastructure. It's also open source and GDPR compliant without cookies.

<!-- Plausible — even simpler setup -->
<script
  defer
  data-domain="yourdomain.com"
  src="https://plausible.io/js/script.js"
></script>

Plausible's hosted plan starts at $9/month. If you self-host, it's free. The dashboard is arguably even cleaner than Umami's, and they've got a solid API for pulling data programmatically.

Fathom (Hosted Only)

Fathom is the premium option. It's not open source and not self-hostable, but it's rock solid and has excellent uptime. Starts at $15/month.

The real comparison comes down to this:

Feature	Umami	Plausible	Fathom
Self-hosted option	Yes	Yes	No
Open source	Yes	Yes	No
GDPR compliant (no cookies)	Yes	Yes	Yes
Free tier	Self-host	Self-host	No
Hosted pricing	N/A (self-host)	From $9/mo	From $15/mo
API access	Yes	Yes	Yes
Bot filtering	Basic	Basic	Basic

None of these will catch aggressive server-side crawlers on their own. But they give you the clean baseline of real human traffic that you need to identify the problem.

Actually Blocking the Crawlers: robots.txt vs. Firewall Rules

Now for the part that actually stops the bleeding. You've got two main approaches, and honestly, you should use both.

Approach 1: robots.txt (The Polite Ask)

# robots.txt — asking nicely
User-agent: Meta-ExternalAgent
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Let regular search engines through
User-agent: Googlebot
Allow: /

User-agent: bingbot
Allow: /

The problem? robots.txt is a suggestion, not a wall. Some crawlers respect it. Some don't. Meta's crawler reportedly does honor robots.txt — but by the time you add the rule, the damage might already be done.

Approach 2: Firewall/Server-Level Blocking (The Actual Wall)

This is what actually works. Here's an nginx example:

# /etc/nginx/conf.d/block-ai-crawlers.conf
# Block known AI training crawlers by user agent
map $http_user_agent $is_ai_crawler {
    default 0;
    ~*Meta-ExternalAgent 1;
    ~*GPTBot 1;
    ~*ClaudeBot 1;
    ~*CCBot 1;
    ~*Google-Extended 1;
    ~*Bytespider 1;       # TikTok/ByteDance
    ~*Amazonbot 1;
    ~*anthropic-ai 1;
    ~*Applebot-Extended 1;
}

server {
    # ... your existing config ...

    if ($is_ai_crawler) {
        return 403;  # Or 429 if you're feeling diplomatic
    }
}

For Apache users:

# .htaccess — block AI crawlers
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Meta-ExternalAgent|GPTBot|ClaudeBot|CCBot|Google-Extended|Bytespider) [NC]
RewriteRule .* - [F,L]

If you're on Cloudflare, you can set up a WAF rule to challenge or block these user agents without touching your server config at all.

The Monitoring Setup I Actually Use

Here's what ended up working for me across my sites:

Umami for clean human analytics (self-hosted on a $5 VPS)
GoAccess for real-time server log analysis — this is where you actually see the crawlers
nginx rate limiting as a safety net for any bot that gets too aggressive
robots.txt as the first polite line of defense
Firewall rules for the crawlers that don't listen

# Quick GoAccess command to see top user agents from your logs
# This is how I spotted the problem in the first place
goaccess /var/log/nginx/access.log --log-format=COMBINED -o report.html

The GoAccess report immediately showed me that bot traffic was 40x my human traffic. Once you see that ratio, you can't unsee it.

My Recommendation

If you run any public website, do these three things today:

Check your access logs. Grep for Meta-ExternalAgent, GPTBot, and CCBot. You might be surprised.
Set up both robots.txt and server-level blocking. Belt and suspenders.
Switch to privacy-focused analytics like Umami or Plausible so you have a clean baseline of real traffic.

The Reddit post that started this conversation showed 7.9 million requests in 30 days from a single crawler. That's roughly 3 requests per second, 24/7. On a small site, that's not just rude — it's potentially site-breaking.

The good news is that blocking these crawlers takes about 15 minutes. The bad news is that the list of AI crawlers keeps growing, so you'll want to revisit your blocklist periodically. I keep a bookmark to the Dark Visitors project, which maintains a solid list of known AI crawlers and their user agent strings.

Don't wait for a 900 GB bandwidth bill to figure this out. Go check your logs. Right now. I'll wait.

DEV Community