AI bots are crawling the web at unprecedented scale. GPTBot, ClaudeBot, Googlebot, and dozens of others visit millions of sites daily. Most site owners have no idea which bots visit, how often, or what they do.
We built a detection system to find out. Here's how it works.
Layer 1: User-Agent Detection
The simplest approach: match user-agent strings against known bot signatures. We maintain a database of 30+ AI bot user-agents including GPTBot, ClaudeBot, CCBot, Bytespider, PetalBot, and others. This catches roughly 80% of known bots.
The signatures are checked in Next.js middleware on every request, adding less than 1ms latency. Simple but effective.
Layer 2: Behavioral Fingerprinting
Some bots disguise their user-agent. We detect these through behavior:
- Request timing — bots are more regular than humans
- Header patterns — bots often omit Accept-Language
- TLS fingerprints — JA3/JA4 fingerprinting reveals bot clients
- Navigation patterns — bots don't scroll, hover, or generate mouse events
We track page transitions to build a crawl graph per visitor.
Layer 3: Capability Testing
The most interesting layer. We serve progressively harder challenges:
- Can the bot follow JavaScript-rendered links?
- Can it fill out a form?
- Can it parse structured data?
- Can it read a crypto wallet address?
Each test reveals different capability tiers — from basic crawlers to fully autonomous AI agents.
Architecture
The system runs as Next.js middleware on Vercel Edge. Bot detection happens at the edge with zero cold start. Detections are logged to Supabase in the background using event.waitUntil() so they don't block the response. A daily cron aggregates per-bot statistics and funnel metrics.
What We Found
After running this on global-chat.io for several weeks:
- 10 unique AI bots visit regularly
- Googlebot is the most frequent (2-3x daily)
- GPTBot and ClaudeBot visit within hours of content changes
- Most bots only crawl 1-2 pages per visit — crawl depth is surprisingly shallow
- Schema.org structured data correlates with more frequent re-crawls
- None of the crawlers have passed our form interaction test yet
Key Takeaway
If you want AI bots to find your content, focus on structured data (Schema.org JSON-LD), comprehensive sitemaps, and the IndexNow protocol. These signals matter more than raw content volume.
Full writeup with more details: How We Detect AI Bots
Top comments (0)