137Foundry

Posted on May 21

How to Filter and Analyze Googlebot Requests from Server Logs with Python

#seo #webdev #productivity

Web server access logs are one of the richest data sources for technical SEO analysis, and Python makes processing them straightforward at any scale. This guide builds a complete Googlebot log analysis script step by step -- from parsing raw Combined Log Format entries to producing a status code report and top-crawled URL list. The full implementation is under 120 lines and requires only the Python standard library plus pandas for the optional DataFrame analysis.

The Log Format

Apache and Nginx both default to Combined Log Format. Each entry follows this structure:

IP - - [timestamp] "METHOD /path HTTP/version" status_code bytes "referer" "user_agent"

A real Googlebot entry looks like:

66.249.72.3 - - [21/May/2026:08:14:33 +0000] "GET /blog/seo-guide/ HTTP/1.1" 200 15423 "-" "Mozilla/5.0 (compatible; Googlebot/2.1)"

The fields we need for SEO analysis: IP address (field 1), URL path (in the quoted request, field 7), status code (field 9), and user agent (last quoted field). If you need response times, you must add them to your server's log format directive first -- they are not included in the default Combined Log Format.

Step 1: Parse the Log File

Use a regular expression to extract the structured fields from each log line. The Python re module handles this cleanly:

import re
from datetime import datetime

LOG_PATTERN = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<timestamp>[^\]]+)\] '
    r'"(?P<method>\S+) (?P<path>\S+) \S+" '
    r'(?P<status>\d{3}) (?P<bytes>\S+) '
    r'"(?P<referer>[^"]*)" "(?P<user_agent>[^"]*)"'
)

def parse_log_line(line):
    match = LOG_PATTERN.match(line)
    if match:
        return match.groupdict()
    return None

def parse_log_file(filepath):
    entries = []
    with open(filepath, 'r', encoding='utf-8', errors='replace') as f:
        for line in f:
            parsed = parse_log_line(line.strip())
            if parsed:
                entries.append(parsed)
    return entries

Unparseable lines are normal -- log rotation markers, partial writes, and non-request entries do not match the pattern. The errors='replace' parameter prevents encoding failures on non-UTF-8 log entries, which can appear when legacy systems write log files in platform-specific encodings.

Step 2: Filter for Googlebot

Filter entries where the user agent contains "Googlebot" and excludes image/video crawler variants if you want only the main web crawler:

def filter_googlebot(entries, crawler_type='main'):
    """
    crawler_type: 'main' for Googlebot web crawler,
                  'all' for all Google crawler variants
    """
    if crawler_type == 'main':
        return [
            e for e in entries
            if 'Googlebot/' in e['user_agent']
            and 'Googlebot-Image' not in e['user_agent']
            and 'Googlebot-Video' not in e['user_agent']
            and 'AdsBot-Google' not in e['user_agent']
        ]
    else:
        return [e for e in entries if 'Google' in e['user_agent']]

Note: this filtering does not verify that the requesting IP is genuinely a Google IP. For production analysis, verify IPs against Google's published IP ranges via reverse DNS lookup (the socket module handles this). The Screaming Frog Log File Analyser does this automatically. For quick diagnostic work, user agent filtering alone is usually sufficient, but note that impostor bots do spoof the Googlebot string.

Step 3: Status Code Distribution

Calculate how Googlebot's requests distribute across HTTP status codes:

from collections import Counter

def status_code_distribution(googlebot_entries):
    status_counts = Counter(e['status'] for e in googlebot_entries)
    total = sum(status_counts.values())

    print(f"Total Googlebot requests: {total}")
    print("\nStatus code distribution:")
    for status, count in sorted(status_counts.items()):
        pct = (count / total) * 100
        print(f"  {status}: {count:,} ({pct:.1f}%)")

    return status_counts

A healthy site shows the majority of requests as 200. High percentages of 301/302 indicate redirect debt. Any significant 404 percentage is actionable crawl waste. 5xx responses during Googlebot crawl windows are crawl-blocking errors that will appear in Search Console's coverage report.

Photo by MarandaP on Pixabay

Step 4: Top Crawled URLs and 404 Analysis

Identify the most frequently crawled URLs and isolate the ones returning 404:

def top_crawled_urls(googlebot_entries, n=50):
    url_counts = Counter(e['path'] for e in googlebot_entries)
    print(f"\nTop {n} most crawled URLs:")
    for url, count in url_counts.most_common(n):
        print(f"  {count:5d}  {url}")
    return url_counts

def top_404_urls(googlebot_entries, n=50):
    not_found = [e for e in googlebot_entries if e['status'] == '404']
    url_counts = Counter(e['path'] for e in not_found)

    print(f"\nTop {n} 404 URLs crawled by Googlebot:")
    for url, count in url_counts.most_common(n):
        print(f"  {count:5d}  {url}")

    return url_counts

The 404 URL list is an immediate action list. Sort by frequency and work from the top. The highest-frequency 404s represent the most crawl budget waste per fix applied. Each URL on that list should be traced to its source: an internal link, an external link, or a stale XML sitemap entry.

Step 5: Parameterized URL Detection

Flag URLs with query parameters that may be generating crawl waste:

from urllib.parse import urlparse, parse_qs

def parameterized_url_analysis(googlebot_entries, n=20):
    """Identify query-parameterized URLs consuming crawl budget."""
    param_entries = [
        e for e in googlebot_entries
        if '?' in e['path']
    ]

    if not param_entries:
        print("No parameterized URLs found in Googlebot requests.")
        return

    from collections import defaultdict
    param_patterns = defaultdict(list)

    for entry in param_entries:
        parsed = urlparse(entry['path'])
        params = parse_qs(parsed.query)
        param_keys = tuple(sorted(params.keys()))
        param_patterns[param_keys].append(entry['path'])

    print(f"\nQuery parameter patterns in Googlebot requests:")
    sorted_patterns = sorted(param_patterns.items(), key=lambda x: len(x[1]), reverse=True)

    for param_keys, urls in sorted_patterns[:20]:
        key_str = ', '.join(sorted(param_keys)) if param_keys else '(empty)'
        print(f"  [{key_str}] - {len(urls)} requests")

    return param_patterns

Parameter combinations appearing frequently indicate faceted navigation, filter pages, or session IDs being treated as distinct URLs by Googlebot. Each unique combination represents a URL Googlebot indexes separately, even if the content is nearly identical to the canonical page.

Step 6: Putting It Together

A complete analysis run:

import sys

def analyze_logs(log_filepath):
    print(f"Analyzing: {log_filepath}\n")

    entries = parse_log_file(log_filepath)
    print(f"Total log entries: {len(entries):,}")

    googlebot = filter_googlebot(entries)
    print(f"Googlebot requests: {len(googlebot):,}")

    if not googlebot:
        print("No Googlebot entries found. Check log format or file path.")
        return

    status_code_distribution(googlebot)
    top_crawled_urls(googlebot, n=25)
    top_404_urls(googlebot, n=25)
    parameterized_url_analysis(googlebot)

if __name__ == "__main__":
    log_file = sys.argv[1] if len(sys.argv) > 1 else '/var/log/nginx/access.log'
    analyze_logs(log_file)

Run with: python googlebot_analysis.py /path/to/access.log

Scaling to Large Log Files

For large sites, access logs can be several gigabytes per day. The line-by-line parsing approach above handles files larger than RAM. For multi-month analysis, use the pandas library for more efficient aggregation after parsing:

import pandas as pd

def load_to_dataframe(entries):
    df = pd.DataFrame(entries)
    df['status'] = df['status'].astype(int)
    return df

With a DataFrame, you can run cross-month comparisons by loading multiple log files and concatenating the results. Comparing this month's 404 URL list against last month's shows whether crawl waste is improving or growing. The pandas groupby and value_counts methods handle the aggregations cleanly, and the result can be exported to CSV for inclusion in audit reports.

From Code to Insight

The outputs of this analysis -- status code distribution, top 404 URLs, parameterized URL volumes -- translate directly into SEO action items. The code is a starting point; the interpretation and remediation work is where the analysis creates value.

The most actionable finding on sites running this analysis for the first time: a portion of Googlebot's requests return 404 for URLs that have not been linked internally in months. These persist because external links, stale XML sitemaps, and Googlebot's historical URL graph continue to surface them. Tracing each high-frequency 404 back to its referrer -- which the Combined Log Format referer field captures -- identifies where to apply the fix: update the internal link, clean the sitemap entry, or add a 301 redirect to the current destination. The log data shows which URLs to prioritize; the referrer data shows where the wasted requests originate.

For the broader context on why server log analysis matters for SEO and how to interpret what you find, see How to Use Server Logs for SEO: Uncovering Crawl Issues Your Analytics Miss. The technical SEO services at 137Foundry include this type of analysis as a standard part of crawlability audits for larger sites.

DEV Community