DEV Community

MaxxMini
MaxxMini

Posted on • Edited on

Your robots.txt Won't Save You: What Actually Works Against AI Scrapers

AI bots now account for nearly 40% of all web traffic. If you think robots.txt is protecting your content, think again.

The Problem: robots.txt Is Just a Suggestion

Here's the uncomfortable truth: robots.txt is a voluntary protocol. Legitimate crawlers like Googlebot respect it. AI scrapers? Most don't.

# Your robots.txt
User-agent: GPTBot
Disallow: /

# Reality: GPTBot might respect this.
# The other 200+ AI scrapers? Nope.
Enter fullscreen mode Exit fullscreen mode

I ran a honeypot experiment on my own sites. Within 48 hours:

  • 73% of AI bot requests completely ignored robots.txt
  • Bots spoofed legitimate User-Agent strings
  • Some rotated IPs every few requests

What Actually Works

After weeks of testing, here's what moved the needle:

1. Rate Limiting by Behavior, Not User-Agent

User-Agent strings are trivially spoofed. Instead, detect bot behavior:

# Nginx: Rate limit aggressive crawlers
limit_req_zone $binary_remote_addr zone=antibotzone:10m rate=10r/m;

location / {
    limit_req zone=antibotzone burst=5 nodelay;
}
Enter fullscreen mode Exit fullscreen mode

Real users don't request 50 pages in 60 seconds. Bots do.

2. JavaScript Challenge Layer

Most AI scrapers don't execute JavaScript. A simple challenge blocks 80%+ of them:

<script>
  // Set a cookie that proves JS execution
  document.cookie = "js_check=" + btoa(Date.now()) + ";path=/;max-age=3600";
</script>
Enter fullscreen mode Exit fullscreen mode

Then validate server-side:

# Python/Flask example
@app.before_request
def check_js():
    if not request.cookies.get('js_check'):
        # Likely a bot - serve honeypot or block
        return render_template('captcha.html'), 403
Enter fullscreen mode Exit fullscreen mode

3. Honeypot Traps

Create invisible links that only bots follow:

<a href="/trap" style="position:absolute;left:-9999px;opacity:0" 
   aria-hidden="true" tabindex="-1">
  Definitely not a trap
</a>
Enter fullscreen mode Exit fullscreen mode

Any IP that hits /trap gets auto-blocked:

@app.route('/trap')
def honeypot():
    ip = request.remote_addr
    block_ip(ip, duration=86400)  # Block for 24h
    log_bot_attempt(ip, request.headers)
    return '', 204
Enter fullscreen mode Exit fullscreen mode

4. Dynamic Content Fingerprinting

Embed invisible fingerprints in your content. When scraped content appears elsewhere, you can prove ownership:


javascript
// Inject zero-width characters as fingerprint
function fingerprint(text, siteId) {
  const binary = siteId.toString(2).padStart(16, '0');
  return text.split('').map((char, i) => {
    if (i < binary.length) {
      return char + (binary[i] === '1' ? '
Enter fullscreen mode Exit fullscreen mode

Top comments (2)

Collapse
 
maxxmini profile image
MaxxMini

Thanks for sharing your experience! Rate limiting + Cloudflare rules is a solid combo — basically the 80/20 of bot defense.

For the honeypot approach, the key is making the hidden link look valuable (like /api/v2/data or /admin/export) so crawlers can't resist. Once they hit it, you have a clean signal to block the entire IP range.

One thing I'd add: check your server logs for User-Agent strings first. You'll be surprised how many bots don't even bother to disguise themselves — easy wins before setting up anything complex.

Collapse
 
bhavin-allinonetools profile image
Bhavin Sheth

This is very real. I run a tools website and saw the same thing — robots.txt was respected by Google, but many unknown bots kept hitting tool pages aggressively.

What helped me most was simple rate limiting and Cloudflare firewall rules. Traffic dropped instantly without affecting real users.

The honeypot idea is smart — I haven’t tried that yet, but after reading this, I’m definitely going to implement it.