<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joseph Hernandez</title>
    <description>The latest articles on DEV Community by Joseph Hernandez (@joseph_easerva).</description>
    <link>https://dev.to/joseph_easerva</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898056%2F66ca512c-897a-4159-8b6c-47f3655b85ea.png</url>
      <title>DEV Community: Joseph Hernandez</title>
      <link>https://dev.to/joseph_easerva</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joseph_easerva"/>
    <language>en</language>
    <item>
      <title>How I tracked which AI bots actually crawl my site</title>
      <dc:creator>Joseph Hernandez</dc:creator>
      <pubDate>Sat, 25 Apr 2026 21:55:44 +0000</pubDate>
      <link>https://dev.to/joseph_easerva/how-i-tracked-which-ai-bots-actually-crawl-my-site-33fg</link>
      <guid>https://dev.to/joseph_easerva/how-i-tracked-which-ai-bots-actually-crawl-my-site-33fg</guid>
      <description>&lt;p&gt;I launched a new domain two weeks ago and wanted to know which AI bots were actually showing up — not theoretically, but in my CloudFront logs. So I built a small tracker that parses access logs from S3 and reports hits per bot per URL.&lt;/p&gt;

&lt;p&gt;After 5 days, here's what the data shows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;The site is &lt;a href="https://easerva.com" rel="noopener noreferrer"&gt;easerva.com&lt;/a&gt; — static HTML on S3 + CloudFront, zero JavaScript, JSON-LD on every page, sitemap submitted to GSC and Bing Webmaster Tools, IndexNow integrated.&lt;/p&gt;

&lt;p&gt;I enabled CloudFront standard logging (free, writes gzipped logs to S3 every few minutes), then wrote a script that filters by user-agent string for the bots that matter: &lt;code&gt;Googlebot&lt;/code&gt;, &lt;code&gt;Bingbot&lt;/code&gt;, &lt;code&gt;OAI-SearchBot&lt;/code&gt;, &lt;code&gt;ChatGPT-User&lt;/code&gt;, &lt;code&gt;GPTBot&lt;/code&gt;, &lt;code&gt;PerplexityBot&lt;/code&gt;, &lt;code&gt;ClaudeBot&lt;/code&gt;, &lt;code&gt;Claude-User&lt;/code&gt;, &lt;code&gt;Applebot&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5-day results
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bot                Type                        Hits   URLs   Errors
Bingbot            Search crawler                16      8        3
OAI-SearchBot      Persistent index crawler      28      2        0
ChatGPT-User       Live fetch agent               0      0        0
PerplexityBot      Persistent index crawler       0      0        0
Googlebot          Search crawler                10      4        0
ClaudeBot          Persistent index crawler      80      2        0
Claude-User        Live fetch agent               0      0        0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Three things jumped out
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ClaudeBot is hungry.&lt;/strong&gt; 80 hits in 5 days, all on &lt;code&gt;/robots.txt&lt;/code&gt; and &lt;code&gt;/sitemap.xml&lt;/code&gt;. No content fetches yet. This is normal early-stage discovery — crawlers poll permissions before allocating crawl budget — but the volume surprised me. 40 robots.txt fetches is significantly more than Googlebot or Bingbot did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bingbot is the canary.&lt;/strong&gt; Only 16 hits, but unlike Claude and OpenAI it followed through to actual content. It also surfaced a real bug for me: 3 of those hits were 403 errors on URLs I hadn't actually published. My IndexNow code was generating URLs from a template pattern instead of from real S3 objects, so it was advertising pages that didn't exist. CloudFront returned 403 (S3's default for missing objects with restrictive bucket policies) instead of 404. I fixed both — added a CloudFront custom error response to rewrite 403 → 404, and refactored IndexNow to derive submitted URLs from the sitemap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live-fetch agents are silent.&lt;/strong&gt; Zero hits from &lt;code&gt;ChatGPT-User&lt;/code&gt; or &lt;code&gt;Claude-User&lt;/code&gt;. Makes sense — these only fire when a user asks the AI a question that requires real-time browsing, and a brand-new domain isn't relevant to any query yet. Worth noting: as of December 2025, OpenAI's docs explicitly state ChatGPT-User does NOT respect robots.txt, since user-initiated fetches are treated as proxy human browsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm operating on
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent crawlers (&lt;code&gt;OAI-SearchBot&lt;/code&gt;, &lt;code&gt;ClaudeBot&lt;/code&gt;, &lt;code&gt;PerplexityBot&lt;/code&gt;) build indexes. Live-fetch agents (&lt;code&gt;ChatGPT-User&lt;/code&gt;, &lt;code&gt;Claude-User&lt;/code&gt;) fetch on demand.&lt;/strong&gt; Different timing patterns, different optimization implications. Track them separately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't read into early-stage silence.&lt;/strong&gt; Discovery → robots.txt polling → sitemap fetch → content crawl is a multi-week process for new domains. Repeated robots.txt fetches are a &lt;em&gt;good&lt;/em&gt; sign.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bingbot surfaces bugs early&lt;/strong&gt; because it follows through to content URLs faster than the AI-native crawlers. Watch its error column.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setting up the same tracking on AWS
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Create an S3 bucket with &lt;code&gt;BucketOwnerPreferred&lt;/code&gt; ownership and an ACL grant for CloudFront's log delivery canonical user&lt;/li&gt;
&lt;li&gt;Enable Standard Logging on your CloudFront distribution, point at the bucket&lt;/li&gt;
&lt;li&gt;Wait ~30 minutes, hit your site, confirm &lt;code&gt;.gz&lt;/code&gt; files appear&lt;/li&gt;
&lt;li&gt;Parse logs: user-agent is tab-separated field 10, URI is field 7&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Standard logging is free. Real-time via Kinesis costs money and isn't needed at low traffic.&lt;/p&gt;

&lt;p&gt;Source for my tracker is on &lt;a href="https://github.com/YOUR-USERNAME/easerva-bot-tracker" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; if you want to fork it instead of writing your own.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm watching next
&lt;/h2&gt;

&lt;p&gt;The transition from robots.txt polling to actual content crawling — when ClaudeBot and OAI-SearchBot start fetching &lt;code&gt;/providers/...&lt;/code&gt; URLs instead of just &lt;code&gt;/robots.txt&lt;/code&gt;. That's the signal the site has moved from "discovered" to "being indexed." I'll post a 30-day follow-up.&lt;/p&gt;

&lt;p&gt;If you're tracking AI bot patterns on your own site, I'd love to hear what you're seeing.&lt;/p&gt;

</description>
      <category>seo</category>
      <category>ai</category>
      <category>webdev</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
