Yuki Nakazawa

Posted on Mar 20

Logging Googlebot Crawls for Free with Cloudflare Workers + D1

#webdev #buildinpublic #cloudflare #workers

Introduction

When doing SEO work, there are times when you need to investigate whether Googlebot is properly crawling your pages.

Google Search Console has a crawl stats feature, but the sample URLs it surfaces are limited to 1,000 entries. For tracking the crawl status of specific pages over time, it falls a bit short.

Server access logs are the ideal solution for this kind of investigation.

I use this setup on LeapRows, a browser-based CSV tool I built on Vercel.

On a self-managed VPS or on-premise server, Googlebot access is automatically recorded in Nginx or Apache logs.

However, with serverless PaaS platforms like Vercel, there's no server management interface — which means no direct access to access logs.

This is where Cloudflare comes in. By routing your domain's DNS through Cloudflare, you can intercept requests with a Cloudflare Worker before they ever reach Vercel.

[Standard Vercel setup]
Googlebot → Vercel → Response (no logs)

[With Cloudflare]
Googlebot → Cloudflare Worker (logs recorded here) → Vercel → Response

By saving the logs captured by the Worker into Cloudflare's D1 (a SQLite-based database), you can collect Googlebot crawl logs without touching the Vercel side at all — and it runs entirely within the free tier.

This article walks through the setup step by step.

What you can collect

Crawl timing per URL (when each page was crawled)
Status code monitoring (detecting 4xx/5xx crawl errors)
Cache hit rate (DYNAMIC vs HIT)
Bot type breakdown (InspectionTool vs Googlebot)

Prerequisites

Your domain is managed through Cloudflare
Node.js and the Wrangler CLI are available
Estimated time: ~30 minutes

Architecture Overview

The Worker intercepts every incoming request and writes crawl data to D1.

ctx.waitUntil is used to handle log saving asynchronously, so the response to Googlebot is never delayed.

Step 0: Install the Wrangler CLI

Install the Wrangler CLI to manage Cloudflare from your terminal. Once installed, log in to your account.

npm install -g wrangler
wrangler login

Step 1: Create the D1 Database

Create a D1 database on Cloudflare.

wrangler d1 create googlebot-logs

The output will include a database_id — make a note of it.

✅ Successfully created DB 'googlebot-logs'
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"  ← copy this

Next, create the table definition file and apply it to D1.

Note: without the --remote flag, the command runs against your local D1 instance instead of the remote one — don't forget it.

# Create schema.sql
cat > schema.sql << 'EOF'
CREATE TABLE IF NOT EXISTS crawl_logs (
  id             INTEGER PRIMARY KEY AUTOINCREMENT,
  ts             TEXT NOT NULL,
  url            TEXT NOT NULL,
  method         TEXT,
  status         INTEGER,
  ua             TEXT,
  ip             TEXT,
  country        TEXT,
  cache          TEXT,
  referer        TEXT,
  bot_type       TEXT,
  content_length INTEGER
);

CREATE INDEX IF NOT EXISTS idx_ts  ON crawl_logs(ts);
CREATE INDEX IF NOT EXISTS idx_url ON crawl_logs(url);
EOF

# Apply to D1
wrangler d1 execute googlebot-logs --file=schema.sql --remote

Step 2: Create the Worker

Create a Worker project locally.

mkdir googlebot-logger && cd googlebot-logger
npm init -y

Create wrangler.toml with the following content.

name = "googlebot-logger"
main = "src/index.js"
compatibility_date = "2024-01-01"

# Domain configuration
[[routes]]
pattern = "yourdomain.com/*"  # enter your domain
zone_name = "yourdomain.com"  # enter your domain

# D1 binding
[[d1_databases]]
binding = "DB"
database_name = "googlebot-logs"
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"  # ID from Step 1

Next, create src/index.js. Since we only want to track page-level crawls, static resource files under /_next/ (JS, CSS, etc.) are excluded from logging.

export default {
  async fetch(request, env, ctx) {
    // 1. Forward the request to the origin first
    const response = await fetch(request);

    // 2. Check the User-Agent
    const ua = request.headers.get("User-Agent") || "";
    const botType = detectGoogleBot(ua);

    // 3. If Googlebot, save the log asynchronously without delaying the response
    if (botType) {
      const logResponse = response.clone(); // clone before returning
      ctx.waitUntil(saveLog(env.DB, request, logResponse, ua, botType));
    }

    return response;
  },
};

// Identify the type of Googlebot
function detectGoogleBot(ua) {
  if (/Googlebot-Image/i.test(ua))       return "googlebot-image";
  if (/Googlebot-Video/i.test(ua))       return "googlebot-video";
  if (/Googlebot-News/i.test(ua))        return "googlebot-news";
  if (/AdsBot-Google/i.test(ua))         return "adsbot";
  if (/Google-InspectionTool/i.test(ua)) return "inspection-tool";
  if (/Googlebot/i.test(ua))             return "googlebot";
  return null; // not Googlebot
}

// Save log to D1
async function saveLog(db, request, response, ua, botType) {
  const url  = new URL(request.url);
  const path = url.pathname;
  const cf   = request.cf || {};

  // Exclude static resource files — page URLs only
  if (
    path.startsWith('/_next/') ||
    path.startsWith('/_vercel/') ||
    path.startsWith('/static/') ||
    /\.(js|css|ico|png|jpg|jpeg|svg|webp|woff|woff2|map|wasm)$/.test(path)
  ) {
    return;
  }

  // If Content-Length is absent, read the body to measure size
  let contentLength = parseInt(response.headers.get('Content-Length') || '0', 10);
  if (!contentLength) {
    const cloned = response.clone();
    const buf = await cloned.arrayBuffer();
    contentLength = buf.byteLength;
  }

  try {
    await db.prepare(`
      INSERT INTO crawl_logs (ts, url, method, status, ua, ip, country, cache, referer, bot_type, content_length)
      VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    `).bind(
      new Date().toISOString(),
      path + url.search,
      request.method,
      response.status,
      ua,
      request.headers.get('CF-Connecting-IP') || '',
      cf.country || '',
      response.headers.get('CF-Cache-Status') || '',
      request.headers.get('Referer') || '',
      botType,
      contentLength
    ).run();
  } catch (e) {
    // Log failures should never affect site availability
    console.error('Log save failed:', e.message);
  }
}

Step 3: Cloudflare DNS Configuration

Configure Cloudflare to route traffic through the Worker.

Verify SSL/TLS encryption mode

Go to SSL/TLS → Overview in the Cloudflare dashboard and confirm the encryption mode is set to Full.

Leaving it on Flexible and then enabling the proxy can cause an HTTPS redirect loop that takes your site down — worth checking first.

Enable proxy on your DNS record

Go to DNS → Records, find the A record for your domain, and click Edit.

Enable the Proxy status toggle and save. The icon will turn into an orange cloud, which means requests will now flow through the Worker.

Step 4: Deploy

With Cloudflare configured, deploy the Worker from your local project.

wrangler deploy

That's everything needed to start collecting logs.

Step 5: Verify

To confirm logs are being recorded, run a live test from Google Search Console → URL Inspection → Test Live URL.

Search Console's live test uses the Google-InspectionTool User-Agent, so in our setup it will be recorded with bot_type = inspection-tool.

After the test completes, check D1 with the following command:

wrangler d1 execute googlebot-logs --remote --command="SELECT * FROM crawl_logs ORDER BY ts DESC LIMIT 5"

If you see a row with inspection-tool in the bot_type column, everything is working correctly.

Free Tier

At roughly 500 bytes per record, the 5 GB free tier holds approximately 10 million records. For an indie SaaS or personal site, you're unlikely to come close to the limit.

Service	Free tier
Workers	100,000 requests / day
D1 rows written	100,000 rows / day
D1 storage	5 GB (total across all databases)

If you'd like to keep things tidy, you can add a cron job to automatically delete old logs:

# Append to wrangler.toml
[triggers]
crons = ["0 0 * * 0"]  # runs every Sunday at midnight

// Append to src/index.js
async function scheduled(event, env) {
  await env.DB.prepare(`
    DELETE FROM crawl_logs
    WHERE ts < datetime('now', '-90 days')
  `).run();
}

export default {
  async fetch(request, env, ctx) {
    // ... existing fetch handler code ...
  },
  scheduled,
};

Conclusion

Serverless PaaS platforms like Vercel don't expose server access logs, but by using Cloudflare as a DNS proxy you can collect Googlebot crawl logs without any changes to your server-side code.

The D1 free tier is more than generous enough for small to mid-sized sites, making this essentially free to run.

As a next step, you could join this data with Google Search Console exports to analyze the relationship between crawl frequency and indexing status.

DEV Community