Why I stopped relying on robots.txt and built my own AI crawler blocker

#ai #security #showdev #webscraping

robots.txt was never designed for this.
It was built in 1994 as a gentleman's agreement. A way for site owners to say "please don't crawl this." And for a long time, it worked fine.

Then AI happened.

GPTBot. ClaudeBot. CCBot. Bytespider. These bots are hitting every website on the internet, around the clock, feeding content into models worth billions. And a lot of them treat robots.txt as optional.
So I started building Alovia Shield.
The problem with user-agent blocking
The obvious first approach is block by user-agent. It works — until it doesn't. User-agent spoofing is trivial. Any bot can just pretend to be Googlebot and walk right through.
I needed something that holds up against that.

What I built instead
Behavioral fingerprinting. Instead of trusting what a bot claims to be, Alovia Shield looks at how it behaves — request patterns, timing, header anomalies, and a few other signals I'm still tuning.
The test suite runs 40 tests across 4 levels: LOW, MEDIUM, HARD, EXPERT. Currently at 92%. Two false positives fixed by whitelisting legitimate IPs that were getting caught.
What's next

Still pre-launch. Working on the watermarking layer next — so even if a crawler gets through, the content is traceable back to the source.

If you've dealt with aggressive AI crawlers on your own projects I'd love to hear how you handled it.
→ aloviaai.com