137Foundry

Posted on Jul 1

How to Write a Python Script That Finds Cannibalized Queries in a Search Console Export

#python #seo #tutorial

If you manage SEO for a site with more than a few hundred pages, running the cannibalization checks manually in Google Search Console gets old fast. The good news is that Search Console exports its Performance data in a format that pandas will happily chew through, and about 30 lines of Python will surface every cannibalized query on your site sorted by impact.

This walkthrough builds that script from scratch, explains what each check is doing, and produces a CSV you can hand to a content strategist.

What you need before you start

Two exports from Search Console. Open the Performance report, set the date range to the last 90 days, and export twice: once grouped by query, and once grouped by both page and query. The Search Console interface has an "export" button in the top right that produces CSV or Google Sheets.

Save both exports as CSVs. The rest of the walkthrough assumes they are named queries.csv and page_query.csv in the same directory as your script.

Setting up

import pandas as pd

queries = pd.read_csv('queries.csv')
pq = pd.read_csv('page_query.csv')

print(queries.head())
print(pq.head())

The queries.csv has one row per query with total clicks, impressions, CTR, and position. The page_query.csv has one row per unique page-query pair. The pair-level data is what surfaces cannibalization; the query-level data is used for scoring which cases matter most.

If pandas is not installed, pip install pandas handles it. The pandas documentation is worth bookmarking; the DataFrame operations below are all covered in the getting-started section.

Finding pages competing for the same query

counts = pq.groupby('Query')['Top pages'].nunique().reset_index()
counts.columns = ['Query', 'UniquePages']
cannibalized = counts[counts['UniquePages'] > 1]
print(f"Queries with more than one page: {len(cannibalized)}")

This is the base cannibalization signal: any query where more than one page received impressions is a candidate. On a typical site with a few hundred pages, this will surface hundreds or thousands of candidate queries. Most of them will be low-signal.

Note: the exact column names in the Search Console export can differ slightly depending on your locale. If your export uses "Top pages" instead of the query-page pair name, adjust accordingly. A quick print(pq.columns.tolist()) will show the actual column names.

Filtering to cases that actually matter

threshold_impressions = 100
significant_pairs = pq[pq['Impressions'] >= threshold_impressions]
significant_counts = significant_pairs.groupby('Query')['Top pages'].nunique().reset_index()
significant_counts.columns = ['Query', 'UniquePages']
worth_fixing = significant_counts[significant_counts['UniquePages'] > 1]
print(f"Queries with more than one page and 100+ impressions per page: {len(worth_fixing)}")

The threshold matters. A query where two pages each got three impressions in the last 90 days is not really cannibalization; it is noise. A query where two pages each got 500+ impressions and one is ranking at position 8 while the other ranks at position 14 is real cannibalization worth an hour of work to fix.

Adjust the threshold_impressions value based on your site's traffic. For a small site, 20 or 30 might be the right threshold. For a large site, 500 might be more appropriate.

Scoring by impact

enriched = worth_fixing.merge(
    queries[['Query', 'Impressions', 'Clicks', 'Position']],
    on='Query',
    how='left'
)
enriched = enriched.sort_values('Impressions', ascending=False)
print(enriched.head(20))

Sorted by total impressions, the top 20 queries in this list are the ones most worth fixing. High-impression cannibalization is expensive; low-impression cannibalization is cosmetic.

You can refine the scoring further by weighting the CTR. A high-impression query with a compressed CTR (say, below 2 percent at position 8) is doubly worth fixing because you are leaking clicks in addition to leaking rankings.

enriched['CTR'] = enriched['Clicks'] / enriched['Impressions']
enriched['Impact'] = enriched['Impressions'] * (0.05 - enriched['CTR']).clip(lower=0)
enriched = enriched.sort_values('Impact', ascending=False)

The impact score here uses a rough heuristic: multiply impressions by the "missing" CTR (assuming 5 percent is a reasonable target CTR for the average position). Queries with lots of impressions and CTR much lower than that heuristic bubble to the top.

Attaching the competing pages to each query

def get_competing_pages(query, pq_df, min_impressions=100):
    matches = pq_df[(pq_df['Query'] == query) & (pq_df['Impressions'] >= min_impressions)]
    return list(matches['Top pages'])

enriched['CompetingPages'] = enriched['Query'].apply(
    lambda q: get_competing_pages(q, pq)
)

Now each row has the list of URLs competing for that query. This is what the content strategist actually needs to make consolidation decisions.

Exporting the actionable report

report_cols = ['Query', 'Impressions', 'Clicks', 'CTR', 'Position', 'UniquePages', 'CompetingPages', 'Impact']
enriched[report_cols].to_csv('cannibalization_report.csv', index=False)
print("Report saved to cannibalization_report.csv")

Open the CSV in Google Sheets or Excel. Sort by Impact descending. The top 10 to 20 rows are the cases worth investigating this week.

A worked example

Suppose the top row of your output looks like this:

Query: "how to fix keyword cannibalization"
Impressions: 3,200
Clicks: 42
CTR: 1.3%
Position: 11.4
UniquePages: 3
CompetingPages: [blog/keyword-cannibalization-guide, blog/what-is-keyword-cannibalization, services/technical-seo-audit]

That is expensive cannibalization. Three of your pages are competing for a moderately high-volume query. The overall CTR is well below what any of them would earn if they were ranking alone. The services page is probably where the query should actually land (commercial intent). The two blog posts are the ones to consolidate or differentiate.

The specific decision framework for cases like this is covered in more depth in the free SEO resources by 137Foundry, including the three-part decision (redirect, canonical, or rewrite) for each case.

Automating the ongoing check

The above script runs against a one-time export. For ongoing monitoring, the Google Search Console API exposes the same data programmatically. Wrap the script above in an authenticated API call, schedule it to run weekly with cron, and email the diff (new cannibalization since last week) to whoever owns SEO on your team.

The Search Analytics for Sheets add-on is another lower-friction option if setting up API authentication is more work than you want to do. It exports the same data into Google Sheets on a schedule, which you can then pull into any script or dashboard.

The behavioral point

None of this matters if the output does not turn into consolidation decisions. Running the script once and producing a beautiful CSV that nobody acts on is not a win. The workflow that produces actual results is: run the script weekly, review the top 5 rows, decide the consolidation for each (redirect, canonical, or rewrite), implement, and re-run the following week to see whether the cannibalization has resolved.

The compounding effect over six months is dramatic. A site that resolves 5 cannibalization cases per week is a different site from one that keeps 100 cases in a backlog forever.

The one thing to instrument

Once the pipeline works, instrument it. Track how many cannibalization cases were open at the start of each month, how many were resolved during the month, and how many new ones appeared. If the "new cases per month" number is not trending down, your pre-publish process is not catching the problem at the source. The point of the ongoing script is to expose the underlying rate at which cannibalization is being created, not just to fix the accumulated debt.

That single metric is the difference between an SEO team that is winning against cannibalization and one that is running in place.

DEV Community