DEV Community

Cover image for Logging Googlebot Crawls for Free with Cloudflare Workers + D1
Yuki Nakazawa
Yuki Nakazawa

Posted on

Logging Googlebot Crawls for Free with Cloudflare Workers + D1

Introduction

When doing SEO work, there are times when you need to investigate whether Googlebot is properly crawling your pages.

Google Search Console has a crawl stats feature, but the sample URLs it surfaces are limited to 1,000 entries. For tracking the crawl status of specific pages over time, it falls a bit short.

Server access logs are the ideal solution for this kind of investigation.

I use this setup on LeapRows, a browser-based CSV tool I built on Vercel.

On a self-managed VPS or on-premise server, Googlebot access is automatically recorded in Nginx or Apache logs.

However, with serverless PaaS platforms like Vercel, there's no server management interface — which means no direct access to access logs.

This is where Cloudflare comes in. By routing your domain's DNS through Cloudflare, you can intercept requests with a Cloudflare Worker before they ever reach Vercel.

[Standard Vercel setup]
Googlebot → Vercel → Response (no logs)

[With Cloudflare]
Googlebot → Cloudflare Worker (logs recorded here) → Vercel → Response
Enter fullscreen mode Exit fullscreen mode

By saving the logs captured by the Worker into Cloudflare's D1 (a SQLite-based database), you can collect Googlebot crawl logs without touching the Vercel side at all — and it runs entirely within the free tier.

This article walks through the setup step by step.

What you can collect

  • Crawl timing per URL (when each page was crawled)
  • Status code monitoring (detecting 4xx/5xx crawl errors)
  • Cache hit rate (DYNAMIC vs HIT)
  • Bot type breakdown (InspectionTool vs Googlebot)

Prerequisites

  • Your domain is managed through Cloudflare
  • Node.js and the Wrangler CLI are available
  • Estimated time: ~30 minutes

Architecture Overview

Architecture diagram showing Googlebot sending a request to a Cloudflare Worker, which intercepts and detects the bot, forwards the request to Vercel, and asynchronously saves the crawl log to a D1 (SQLite) database using ctx.waitUntil.

The Worker intercepts every incoming request and writes crawl data to D1.

ctx.waitUntil is used to handle log saving asynchronously, so the response to Googlebot is never delayed.


Step 0: Install the Wrangler CLI

Install the Wrangler CLI to manage Cloudflare from your terminal. Once installed, log in to your account.

npm install -g wrangler
wrangler login
Enter fullscreen mode Exit fullscreen mode

Step 1: Create the D1 Database

Create a D1 database on Cloudflare.

wrangler d1 create googlebot-logs
Enter fullscreen mode Exit fullscreen mode

The output will include a database_id — make a note of it.

✅ Successfully created DB 'googlebot-logs'
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"  ← copy this
Enter fullscreen mode Exit fullscreen mode

Next, create the table definition file and apply it to D1.

Note: without the --remote flag, the command runs against your local D1 instance instead of the remote one — don't forget it.

# Create schema.sql
cat > schema.sql << 'EOF'
CREATE TABLE IF NOT EXISTS crawl_logs (
  id             INTEGER PRIMARY KEY AUTOINCREMENT,
  ts             TEXT NOT NULL,
  url            TEXT NOT NULL,
  method         TEXT,
  status         INTEGER,
  ua             TEXT,
  ip             TEXT,
  country        TEXT,
  cache          TEXT,
  referer        TEXT,
  bot_type       TEXT,
  content_length INTEGER
);

CREATE INDEX IF NOT EXISTS idx_ts  ON crawl_logs(ts);
CREATE INDEX IF NOT EXISTS idx_url ON crawl_logs(url);
EOF

# Apply to D1
wrangler d1 execute googlebot-logs --file=schema.sql --remote
Enter fullscreen mode Exit fullscreen mode

Step 2: Create the Worker

Create a Worker project locally.

mkdir googlebot-logger && cd googlebot-logger
npm init -y
Enter fullscreen mode Exit fullscreen mode

Create wrangler.toml with the following content.

name = "googlebot-logger"
main = "src/index.js"
compatibility_date = "2024-01-01"

# Domain configuration
[[routes]]
pattern = "yourdomain.com/*"  # enter your domain
zone_name = "yourdomain.com"  # enter your domain

# D1 binding
[[d1_databases]]
binding = "DB"
database_name = "googlebot-logs"
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"  # ID from Step 1
Enter fullscreen mode Exit fullscreen mode

Next, create src/index.js. Since we only want to track page-level crawls, static resource files under /_next/ (JS, CSS, etc.) are excluded from logging.

export default {
  async fetch(request, env, ctx) {
    // 1. Forward the request to the origin first
    const response = await fetch(request);

    // 2. Check the User-Agent
    const ua = request.headers.get("User-Agent") || "";
    const botType = detectGoogleBot(ua);

    // 3. If Googlebot, save the log asynchronously without delaying the response
    if (botType) {
      const logResponse = response.clone(); // clone before returning
      ctx.waitUntil(saveLog(env.DB, request, logResponse, ua, botType));
    }

    return response;
  },
};

// Identify the type of Googlebot
function detectGoogleBot(ua) {
  if (/Googlebot-Image/i.test(ua))       return "googlebot-image";
  if (/Googlebot-Video/i.test(ua))       return "googlebot-video";
  if (/Googlebot-News/i.test(ua))        return "googlebot-news";
  if (/AdsBot-Google/i.test(ua))         return "adsbot";
  if (/Google-InspectionTool/i.test(ua)) return "inspection-tool";
  if (/Googlebot/i.test(ua))             return "googlebot";
  return null; // not Googlebot
}

// Save log to D1
async function saveLog(db, request, response, ua, botType) {
  const url  = new URL(request.url);
  const path = url.pathname;
  const cf   = request.cf || {};

  // Exclude static resource files — page URLs only
  if (
    path.startsWith('/_next/') ||
    path.startsWith('/_vercel/') ||
    path.startsWith('/static/') ||
    /\.(js|css|ico|png|jpg|jpeg|svg|webp|woff|woff2|map|wasm)$/.test(path)
  ) {
    return;
  }

  // If Content-Length is absent, read the body to measure size
  let contentLength = parseInt(response.headers.get('Content-Length') || '0', 10);
  if (!contentLength) {
    const cloned = response.clone();
    const buf = await cloned.arrayBuffer();
    contentLength = buf.byteLength;
  }

  try {
    await db.prepare(`
      INSERT INTO crawl_logs (ts, url, method, status, ua, ip, country, cache, referer, bot_type, content_length)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    `).bind(
      new Date().toISOString(),
      path + url.search,
      request.method,
      response.status,
      ua,
      request.headers.get('CF-Connecting-IP') || '',
      cf.country || '',
      response.headers.get('CF-Cache-Status') || '',
      request.headers.get('Referer') || '',
      botType,
      contentLength
    ).run();
  } catch (e) {
    // Log failures should never affect site availability
    console.error('Log save failed:', e.message);
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Cloudflare DNS Configuration

Configure Cloudflare to route traffic through the Worker.

Verify SSL/TLS encryption mode

Go to SSL/TLS → Overview in the Cloudflare dashboard and confirm the encryption mode is set to Full.

Leaving it on Flexible and then enabling the proxy can cause an HTTPS redirect loop that takes your site down — worth checking first.

Enable proxy on your DNS record

Go to DNS → Records, find the A record for your domain, and click Edit.

Cloudflare DNS Records page showing an A record for leaprows.com with Proxy status set to

Enable the Proxy status toggle and save. The icon will turn into an orange cloud, which means requests will now flow through the Worker.

Cloudflare DNS record edit form showing the Proxy status toggle being switched to


Step 4: Deploy

With Cloudflare configured, deploy the Worker from your local project.

wrangler deploy
Enter fullscreen mode Exit fullscreen mode

That's everything needed to start collecting logs.


Step 5: Verify

To confirm logs are being recorded, run a live test from Google Search Console → URL Inspection → Test Live URL.

Google Search Console URL Inspection tool showing

Search Console's live test uses the Google-InspectionTool User-Agent, so in our setup it will be recorded with bot_type = inspection-tool.

After the test completes, check D1 with the following command:

wrangler d1 execute googlebot-logs --remote --command="SELECT * FROM crawl_logs ORDER BY ts DESC LIMIT 5"
Enter fullscreen mode Exit fullscreen mode

If you see a row with inspection-tool in the bot_type column, everything is working correctly.

Terminal output of a wrangler d1 execute command showing crawl log records in D1, including rows with bot_type values of


Free Tier

At roughly 500 bytes per record, the 5 GB free tier holds approximately 10 million records. For an indie SaaS or personal site, you're unlikely to come close to the limit.

Service Free tier
Workers 100,000 requests / day
D1 rows written 100,000 rows / day
D1 storage 5 GB (total across all databases)

If you'd like to keep things tidy, you can add a cron job to automatically delete old logs:

# Append to wrangler.toml
[triggers]
crons = ["0 0 * * 0"]  # runs every Sunday at midnight
Enter fullscreen mode Exit fullscreen mode
// Append to src/index.js
async function scheduled(event, env) {
  await env.DB.prepare(`
    DELETE FROM crawl_logs
    WHERE ts < datetime('now', '-90 days')
  `).run();
}

export default {
  async fetch(request, env, ctx) {
    // ... existing fetch handler code ...
  },
  scheduled,
};
Enter fullscreen mode Exit fullscreen mode

Conclusion

Serverless PaaS platforms like Vercel don't expose server access logs, but by using Cloudflare as a DNS proxy you can collect Googlebot crawl logs without any changes to your server-side code.

The D1 free tier is more than generous enough for small to mid-sized sites, making this essentially free to run.

As a next step, you could join this data with Google Search Console exports to analyze the relationship between crawl frequency and indexing status.

Top comments (0)