DEV Community

Cover image for How I tracked which AI bots actually crawl my site
Joseph Hernandez
Joseph Hernandez

Posted on

How I tracked which AI bots actually crawl my site

I launched a new domain two weeks ago and wanted to know which AI bots were actually showing up — not theoretically, but in my CloudFront logs. So I built a small tracker that parses access logs from S3 and reports hits per bot per URL.

After 5 days, here's what the data shows.

The setup

The site is easerva.com — static HTML on S3 + CloudFront, zero JavaScript, JSON-LD on every page, sitemap submitted to GSC and Bing Webmaster Tools, IndexNow integrated.

I enabled CloudFront standard logging (free, writes gzipped logs to S3 every few minutes), then wrote a script that filters by user-agent string for the bots that matter: Googlebot, Bingbot, OAI-SearchBot, ChatGPT-User, GPTBot, PerplexityBot, ClaudeBot, Claude-User, Applebot.

The 5-day results

Bot                Type                        Hits   URLs   Errors
Bingbot            Search crawler                16      8        3
OAI-SearchBot      Persistent index crawler      28      2        0
ChatGPT-User       Live fetch agent               0      0        0
PerplexityBot      Persistent index crawler       0      0        0
Googlebot          Search crawler                10      4        0
ClaudeBot          Persistent index crawler      80      2        0
Claude-User        Live fetch agent               0      0        0
Enter fullscreen mode Exit fullscreen mode

Three things jumped out

ClaudeBot is hungry. 80 hits in 5 days, all on /robots.txt and /sitemap.xml. No content fetches yet. This is normal early-stage discovery — crawlers poll permissions before allocating crawl budget — but the volume surprised me. 40 robots.txt fetches is significantly more than Googlebot or Bingbot did.

Bingbot is the canary. Only 16 hits, but unlike Claude and OpenAI it followed through to actual content. It also surfaced a real bug for me: 3 of those hits were 403 errors on URLs I hadn't actually published. My IndexNow code was generating URLs from a template pattern instead of from real S3 objects, so it was advertising pages that didn't exist. CloudFront returned 403 (S3's default for missing objects with restrictive bucket policies) instead of 404. I fixed both — added a CloudFront custom error response to rewrite 403 → 404, and refactored IndexNow to derive submitted URLs from the sitemap.

Live-fetch agents are silent. Zero hits from ChatGPT-User or Claude-User. Makes sense — these only fire when a user asks the AI a question that requires real-time browsing, and a brand-new domain isn't relevant to any query yet. Worth noting: as of December 2025, OpenAI's docs explicitly state ChatGPT-User does NOT respect robots.txt, since user-initiated fetches are treated as proxy human browsing.

What I'm operating on

  • Persistent crawlers (OAI-SearchBot, ClaudeBot, PerplexityBot) build indexes. Live-fetch agents (ChatGPT-User, Claude-User) fetch on demand. Different timing patterns, different optimization implications. Track them separately.
  • Don't read into early-stage silence. Discovery → robots.txt polling → sitemap fetch → content crawl is a multi-week process for new domains. Repeated robots.txt fetches are a good sign.
  • Bingbot surfaces bugs early because it follows through to content URLs faster than the AI-native crawlers. Watch its error column.

Setting up the same tracking on AWS

  1. Create an S3 bucket with BucketOwnerPreferred ownership and an ACL grant for CloudFront's log delivery canonical user
  2. Enable Standard Logging on your CloudFront distribution, point at the bucket
  3. Wait ~30 minutes, hit your site, confirm .gz files appear
  4. Parse logs: user-agent is tab-separated field 10, URI is field 7

Standard logging is free. Real-time via Kinesis costs money and isn't needed at low traffic.

Source for my tracker is on GitHub if you want to fork it instead of writing your own.

What I'm watching next

The transition from robots.txt polling to actual content crawling — when ClaudeBot and OAI-SearchBot start fetching /providers/... URLs instead of just /robots.txt. That's the signal the site has moved from "discovered" to "being indexed." I'll post a 30-day follow-up.

If you're tracking AI bot patterns on your own site, I'd love to hear what you're seeing.

Top comments (0)