Introduction
When doing SEO work, there are times when you need to investigate whether Googlebot is properly crawling your pages.
Google Search Console has a crawl stats feature, but the sample URLs it surfaces are limited to 1,000 entries. For tracking the crawl status of specific pages over time, it falls a bit short.
Server access logs are the ideal solution for this kind of investigation.
I use this setup on LeapRows, a browser-based CSV tool I built on Vercel.
On a self-managed VPS or on-premise server, Googlebot access is automatically recorded in Nginx or Apache logs.
However, with serverless PaaS platforms like Vercel, there's no server management interface — which means no direct access to access logs.
This is where Cloudflare comes in. By routing your domain's DNS through Cloudflare, you can intercept requests with a Cloudflare Worker before they ever reach Vercel.
[Standard Vercel setup]
Googlebot → Vercel → Response (no logs)
[With Cloudflare]
Googlebot → Cloudflare Worker (logs recorded here) → Vercel → Response
By saving the logs captured by the Worker into Cloudflare's D1 (a SQLite-based database), you can collect Googlebot crawl logs without touching the Vercel side at all — and it runs entirely within the free tier.
This article walks through the setup step by step.
What you can collect
- Crawl timing per URL (when each page was crawled)
- Status code monitoring (detecting 4xx/5xx crawl errors)
- Cache hit rate (
DYNAMICvsHIT) - Bot type breakdown (InspectionTool vs Googlebot)
Prerequisites
- Your domain is managed through Cloudflare
- Node.js and the Wrangler CLI are available
- Estimated time: ~30 minutes
Architecture Overview
The Worker intercepts every incoming request and writes crawl data to D1.
ctx.waitUntil is used to handle log saving asynchronously, so the response to Googlebot is never delayed.
Step 0: Install the Wrangler CLI
Install the Wrangler CLI to manage Cloudflare from your terminal. Once installed, log in to your account.
npm install -g wrangler
wrangler login
Step 1: Create the D1 Database
Create a D1 database on Cloudflare.
wrangler d1 create googlebot-logs
The output will include a database_id — make a note of it.
✅ Successfully created DB 'googlebot-logs'
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" ← copy this
Next, create the table definition file and apply it to D1.
Note: without the --remote flag, the command runs against your local D1 instance instead of the remote one — don't forget it.
# Create schema.sql
cat > schema.sql << 'EOF'
CREATE TABLE IF NOT EXISTS crawl_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
ts TEXT NOT NULL,
url TEXT NOT NULL,
method TEXT,
status INTEGER,
ua TEXT,
ip TEXT,
country TEXT,
cache TEXT,
referer TEXT,
bot_type TEXT,
content_length INTEGER
);
CREATE INDEX IF NOT EXISTS idx_ts ON crawl_logs(ts);
CREATE INDEX IF NOT EXISTS idx_url ON crawl_logs(url);
EOF
# Apply to D1
wrangler d1 execute googlebot-logs --file=schema.sql --remote
Step 2: Create the Worker
Create a Worker project locally.
mkdir googlebot-logger && cd googlebot-logger
npm init -y
Create wrangler.toml with the following content.
name = "googlebot-logger"
main = "src/index.js"
compatibility_date = "2024-01-01"
# Domain configuration
[[routes]]
pattern = "yourdomain.com/*" # enter your domain
zone_name = "yourdomain.com" # enter your domain
# D1 binding
[[d1_databases]]
binding = "DB"
database_name = "googlebot-logs"
database_id = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" # ID from Step 1
Next, create src/index.js. Since we only want to track page-level crawls, static resource files under /_next/ (JS, CSS, etc.) are excluded from logging.
export default {
async fetch(request, env, ctx) {
// 1. Forward the request to the origin first
const response = await fetch(request);
// 2. Check the User-Agent
const ua = request.headers.get("User-Agent") || "";
const botType = detectGoogleBot(ua);
// 3. If Googlebot, save the log asynchronously without delaying the response
if (botType) {
const logResponse = response.clone(); // clone before returning
ctx.waitUntil(saveLog(env.DB, request, logResponse, ua, botType));
}
return response;
},
};
// Identify the type of Googlebot
function detectGoogleBot(ua) {
if (/Googlebot-Image/i.test(ua)) return "googlebot-image";
if (/Googlebot-Video/i.test(ua)) return "googlebot-video";
if (/Googlebot-News/i.test(ua)) return "googlebot-news";
if (/AdsBot-Google/i.test(ua)) return "adsbot";
if (/Google-InspectionTool/i.test(ua)) return "inspection-tool";
if (/Googlebot/i.test(ua)) return "googlebot";
return null; // not Googlebot
}
// Save log to D1
async function saveLog(db, request, response, ua, botType) {
const url = new URL(request.url);
const path = url.pathname;
const cf = request.cf || {};
// Exclude static resource files — page URLs only
if (
path.startsWith('/_next/') ||
path.startsWith('/_vercel/') ||
path.startsWith('/static/') ||
/\.(js|css|ico|png|jpg|jpeg|svg|webp|woff|woff2|map|wasm)$/.test(path)
) {
return;
}
// If Content-Length is absent, read the body to measure size
let contentLength = parseInt(response.headers.get('Content-Length') || '0', 10);
if (!contentLength) {
const cloned = response.clone();
const buf = await cloned.arrayBuffer();
contentLength = buf.byteLength;
}
try {
await db.prepare(`
INSERT INTO crawl_logs (ts, url, method, status, ua, ip, country, cache, referer, bot_type, content_length)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
`).bind(
new Date().toISOString(),
path + url.search,
request.method,
response.status,
ua,
request.headers.get('CF-Connecting-IP') || '',
cf.country || '',
response.headers.get('CF-Cache-Status') || '',
request.headers.get('Referer') || '',
botType,
contentLength
).run();
} catch (e) {
// Log failures should never affect site availability
console.error('Log save failed:', e.message);
}
}
Step 3: Cloudflare DNS Configuration
Configure Cloudflare to route traffic through the Worker.
Verify SSL/TLS encryption mode
Go to SSL/TLS → Overview in the Cloudflare dashboard and confirm the encryption mode is set to Full.
Leaving it on Flexible and then enabling the proxy can cause an HTTPS redirect loop that takes your site down — worth checking first.
Enable proxy on your DNS record
Go to DNS → Records, find the A record for your domain, and click Edit.
Enable the Proxy status toggle and save. The icon will turn into an orange cloud, which means requests will now flow through the Worker.
Step 4: Deploy
With Cloudflare configured, deploy the Worker from your local project.
wrangler deploy
That's everything needed to start collecting logs.
Step 5: Verify
To confirm logs are being recorded, run a live test from Google Search Console → URL Inspection → Test Live URL.
Search Console's live test uses the Google-InspectionTool User-Agent, so in our setup it will be recorded with bot_type = inspection-tool.
After the test completes, check D1 with the following command:
wrangler d1 execute googlebot-logs --remote --command="SELECT * FROM crawl_logs ORDER BY ts DESC LIMIT 5"
If you see a row with inspection-tool in the bot_type column, everything is working correctly.
Free Tier
At roughly 500 bytes per record, the 5 GB free tier holds approximately 10 million records. For an indie SaaS or personal site, you're unlikely to come close to the limit.
| Service | Free tier |
|---|---|
| Workers | 100,000 requests / day |
| D1 rows written | 100,000 rows / day |
| D1 storage | 5 GB (total across all databases) |
If you'd like to keep things tidy, you can add a cron job to automatically delete old logs:
# Append to wrangler.toml
[triggers]
crons = ["0 0 * * 0"] # runs every Sunday at midnight
// Append to src/index.js
async function scheduled(event, env) {
await env.DB.prepare(`
DELETE FROM crawl_logs
WHERE ts < datetime('now', '-90 days')
`).run();
}
export default {
async fetch(request, env, ctx) {
// ... existing fetch handler code ...
},
scheduled,
};
Conclusion
Serverless PaaS platforms like Vercel don't expose server access logs, but by using Cloudflare as a DNS proxy you can collect Googlebot crawl logs without any changes to your server-side code.
The D1 free tier is more than generous enough for small to mid-sized sites, making this essentially free to run.
As a next step, you could join this data with Google Search Console exports to analyze the relationship between crawl frequency and indexing status.





Top comments (0)