MaxxMini

Posted on Feb 28 • Edited on Mar 2

Your robots.txt Won't Save You: What Actually Works Against AI Scrapers

#webdev #security #ai #tutorial

AI bots now account for nearly 40% of all web traffic. If you think robots.txt is protecting your content, think again.

The Problem: robots.txt Is Just a Suggestion

Here's the uncomfortable truth: robots.txt is a voluntary protocol. Legitimate crawlers like Googlebot respect it. AI scrapers? Most don't.

# Your robots.txt
User-agent: GPTBot
Disallow: /

# Reality: GPTBot might respect this.
# The other 200+ AI scrapers? Nope.

I ran a honeypot experiment on my own sites. Within 48 hours:

73% of AI bot requests completely ignored robots.txt
Bots spoofed legitimate User-Agent strings
Some rotated IPs every few requests

What Actually Works

After weeks of testing, here's what moved the needle:

1. Rate Limiting by Behavior, Not User-Agent

User-Agent strings are trivially spoofed. Instead, detect bot behavior:

# Nginx: Rate limit aggressive crawlers
limit_req_zone $binary_remote_addr zone=antibotzone:10m rate=10r/m;

location / {
    limit_req zone=antibotzone burst=5 nodelay;
}

Real users don't request 50 pages in 60 seconds. Bots do.

2. JavaScript Challenge Layer

Most AI scrapers don't execute JavaScript. A simple challenge blocks 80%+ of them:

<script>
  // Set a cookie that proves JS execution
  document.cookie = "js_check=" + btoa(Date.now()) + ";path=/;max-age=3600";
</script>

Then validate server-side:

# Python/Flask example
@app.before_request
def check_js():
    if not request.cookies.get('js_check'):
        # Likely a bot - serve honeypot or block
        return render_template('captcha.html'), 403

3. Honeypot Traps

Create invisible links that only bots follow:

<a href="/trap" style="position:absolute;left:-9999px;opacity:0" 
   aria-hidden="true" tabindex="-1">
  Definitely not a trap
</a>

Any IP that hits /trap gets auto-blocked:

@app.route('/trap')
def honeypot():
    ip = request.remote_addr
    block_ip(ip, duration=86400)  # Block for 24h
    log_bot_attempt(ip, request.headers)
    return '', 204

4. Dynamic Content Fingerprinting

Embed invisible fingerprints in your content. When scraped content appears elsewhere, you can prove ownership:


javascript
// Inject zero-width characters as fingerprint
function fingerprint(text, siteId) {
  const binary = siteId.toString(2).padStart(16, '0');
  return text.split('').map((char, i) => {
    if (i < binary.length) {
      return char + (binary[i] === '1' ? '

Top comments (2)

MaxxMini • Mar 1

Thanks for sharing your experience! Rate limiting + Cloudflare rules is a solid combo — basically the 80/20 of bot defense.

For the honeypot approach, the key is making the hidden link look valuable (like /api/v2/data or /admin/export) so crawlers can't resist. Once they hit it, you have a clean signal to block the entire IP range.

One thing I'd add: check your server logs for User-Agent strings first. You'll be surprised how many bots don't even bother to disguise themselves — easy wins before setting up anything complex.

Bhavin Sheth • Mar 1

This is very real. I run a tools website and saw the same thing — robots.txt was respected by Google, but many unknown bots kept hitting tool pages aggressively.

What helped me most was simple rate limiting and Cloudflare firewall rules. Traffic dropped instantly without affecting real users.

The honeypot idea is smart — I haven’t tried that yet, but after reading this, I’m definitely going to implement it.