Jay Shah

Posted on Mar 24

Combining Google Search Console, GA4, and DataForSEO in One Python Pipeline

#python #seo #analytics #webdev

If you're running SEO for a content site, your data lives in three places: Google Search Console has your rankings and impressions, GA4 has your traffic and engagement, and a third-party tool like DataForSEO has competitor positions and keyword difficulty. Checking each dashboard separately wastes time and misses the connections between them.

I built a Python pipeline that pulls from all three APIs, merges the data into a single DataFrame, and scores every keyword by opportunity. This runs on Shatranj Live, a chess analytics platform where I publish 40+ SEO-optimized articles. Here's how it works.

Architecture Overview

┌──────────────────┐    ┌──────────────┐    ┌───────────────┐
│  Google Search   │    │    GA4       │    │  DataForSEO   │
│    Console       │    │  (via API)   │    │   (via API)   │
│                  │    │              │    │               │
│  - Rankings      │    │  - Sessions  │    │  - SERP data  │
│  - Impressions   │    │  - Bounce    │    │  - KD scores  │
│  - CTR           │    │  - Duration  │    │  - Competitor  │
│  - Clicks        │    │  - Events    │    │    positions   │
└────────┬─────────┘    └──────┬───────┘    └───────┬───────┘
         │                     │                     │
         └─────────┬───────────┘                     │
                   │                                 │
            ┌──────▼──────┐                          │
            │   Merge on  │◄─────────────────────────┘
            │  keyword /  │
            │    URL      │
            └──────┬──────┘
                   │
            ┌──────▼──────┐
            │ Opportunity │
            │   Scorer    │
            └──────┬──────┘
                   │
            ┌──────▼──────┐
            │  Prioritized│
            │   Actions   │
            └─────────────┘

Setting Up the Three APIs

Google Search Console

GSC gives you what no other tool can: your actual ranking data straight from Google.

from google.oauth2 import service_account
from googleapiclient.discovery import build

class GoogleSearchConsole:
    def __init__(self, credentials_path: str, site_url: str):
        creds = service_account.Credentials.from_service_account_file(
            credentials_path,
            scopes=['https://www.googleapis.com/auth/webmasters.readonly']
        )
        self.service = build('searchconsole', 'v1', credentials=creds)
        self.site_url = site_url

    def get_keyword_positions(self, days: int = 90, limit: int = 500) -> list:
        end = datetime.now().strftime('%Y-%m-%d')
        start = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')

        response = self.service.searchanalytics().query(
            siteUrl=self.site_url,
            body={
                'startDate': start,
                'endDate': end,
                'dimensions': ['query', 'page'],
                'rowLimit': limit,
                'dimensionFilterGroups': [{
                    'filters': [{
                        'dimension': 'country',
                        'expression': 'ind'  # filter by country if needed
                    }]
                }]
            }
        ).execute()

        rows = []
        for row in response.get('rows', []):
            rows.append({
                'query': row['keys'][0],
                'page': row['keys'][1],
                'clicks': row['clicks'],
                'impressions': row['impressions'],
                'ctr': row['ctr'],
                'position': row['position'],
            })
        return rows

The key fields: query (what users searched), position (your average rank), ctr (click-through rate), and impressions (how often you appeared).

GA4

GA4 tells you what happens after the click. I use the GA4 Data API (not the old Universal Analytics API):

from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import (
    RunReportRequest, DateRange, Dimension, Metric
)

class GoogleAnalytics:
    def __init__(self, credentials_path: str, property_id: str):
        self.client = BetaAnalyticsDataClient.from_service_account_json(
            credentials_path
        )
        self.property_id = f'properties/{property_id}'

    def get_page_metrics(self, days: int = 90) -> list:
        end = datetime.now().strftime('%Y-%m-%d')
        start = (datetime.now() - timedelta(days=days)).strftime('%Y-%m-%d')

        request = RunReportRequest(
            property=self.property_id,
            date_ranges=[DateRange(start_date=start, end_date=end)],
            dimensions=[Dimension(name='pagePath')],
            metrics=[
                Metric(name='sessions'),
                Metric(name='averageSessionDuration'),
                Metric(name='bounceRate'),
                Metric(name='engagedSessions'),
                Metric(name='screenPageViews'),
            ],
            limit=500,
        )

        response = self.client.run_report(request)
        rows = []
        for row in response.rows:
            rows.append({
                'page': row.dimension_values[0].value,
                'sessions': int(row.metric_values[0].value),
                'avg_duration': float(row.metric_values[1].value),
                'bounce_rate': float(row.metric_values[2].value),
                'engaged_sessions': int(row.metric_values[3].value),
                'pageviews': int(row.metric_values[4].value),
            })
        return rows

DataForSEO

DataForSEO fills the gap: keyword difficulty, search volume, and competitor SERP positions. Their API is REST-based and straightforward:

import requests
import base64

class DataForSEO:
    def __init__(self, login: str, password: str):
        cred = base64.b64encode(f'{login}:{password}'.encode()).decode()
        self.headers = {
            'Authorization': f'Basic {cred}',
            'Content-Type': 'application/json'
        }
        self.base_url = 'https://api.dataforseo.com/v3'

    def get_keyword_data(self, keywords: list, location: int = 2356) -> list:
        """Get search volume and difficulty for keywords.
        location 2356 = India, 2840 = United States"""

        payload = [{
            'keywords': keywords[:700],  # API limit per request
            'location_code': location,
            'language_code': 'en',
        }]

        resp = requests.post(
            f'{self.base_url}/keywords_data/google_ads/search_volume/live',
            headers=self.headers,
            json=payload
        )
        resp.raise_for_status()

        results = []
        for task in resp.json().get('tasks', []):
            for item in task.get('result', []):
                results.append({
                    'keyword': item['keyword'],
                    'volume': item.get('search_volume', 0),
                    'competition': item.get('competition', 0),
                    'cpc': item.get('cpc', 0),
                    'kd': item.get('keyword_difficulty', 0),
                })
        return results

    def get_serp_competitors(self, keyword: str, location: int = 2356) -> list:
        """Get top 10 SERP results for a keyword."""

        payload = [{
            'keyword': keyword,
            'location_code': location,
            'language_code': 'en',
            'depth': 10,
        }]

        resp = requests.post(
            f'{self.base_url}/serp/google/organic/live/regular',
            headers=self.headers,
            json=payload
        )
        resp.raise_for_status()

        competitors = []
        for task in resp.json().get('tasks', []):
            for result in task.get('result', []):
                for item in result.get('items', []):
                    if item['type'] == 'organic':
                        competitors.append({
                            'position': item['rank_absolute'],
                            'url': item['url'],
                            'domain': item['domain'],
                            'title': item['title'],
                        })
        return competitors

Merging the Data

This is where it gets useful. Each API uses different identifiers: GSC uses query + page URL, GA4 uses page path, DataForSEO uses keyword. The merge logic normalizes these:

import pandas as pd

class DataAggregator:
    def __init__(self, gsc, ga4, dataforseo):
        self.gsc = gsc
        self.ga4 = ga4
        self.dataforseo = dataforseo

    def build_unified_report(self, days: int = 90) -> pd.DataFrame:
        # Pull from all three sources
        gsc_data = pd.DataFrame(self.gsc.get_keyword_positions(days=days))
        ga4_data = pd.DataFrame(self.ga4.get_page_metrics(days=days))

        # Normalize URLs for joining
        gsc_data['page_path'] = gsc_data['page'].apply(
            lambda u: '/' + u.split('/', 3)[-1] if '/' in u else u
        )
        ga4_data['page_path'] = ga4_data['page']

        # Merge GSC + GA4 on page path
        merged = gsc_data.merge(
            ga4_data,
            on='page_path',
            how='left',
            suffixes=('_gsc', '_ga4')
        )

        # Enrich top keywords with DataForSEO data
        top_keywords = merged.nlargest(100, 'impressions')['query'].unique().tolist()
        kd_data = pd.DataFrame(
            self.dataforseo.get_keyword_data(top_keywords)
        )

        # Merge keyword difficulty
        final = merged.merge(
            kd_data,
            left_on='query',
            right_on='keyword',
            how='left'
        )

        return final

The Opportunity Scorer

Raw merged data is useful, but ranking keywords by opportunity is where the pipeline pays for itself. I score each keyword on 8 weighted factors:

def score_opportunity(row: dict) -> float:
    """Score a keyword opportunity from 0-100."""
    weights = {
        'volume': 0.25,
        'position': 0.20,
        'intent': 0.20,
        'competition': 0.15,
        'cluster': 0.10,
        'ctr_gap': 0.05,
        'freshness': 0.05,
        'trend': 0.05,
    }

    scores = {}

    # Volume score: higher volume = higher score
    vol = row.get('volume', 0)
    scores['volume'] = min(vol / 5000, 1.0) * 100

    # Position score: positions 4-15 are the sweet spot
    # (close enough to reach top 3, but not already there)
    pos = row.get('position', 100)
    if 4 <= pos <= 10:
        scores['position'] = 90
    elif 11 <= pos <= 15:
        scores['position'] = 70
    elif 1 <= pos <= 3:
        scores['position'] = 30  # already ranking well
    else:
        scores['position'] = max(0, 50 - (pos - 15) * 2)

    # CTR gap: expected CTR for position vs actual CTR
    expected_ctr = {1: 0.28, 2: 0.15, 3: 0.11, 4: 0.08,
                    5: 0.06, 6: 0.05, 7: 0.04, 8: 0.03}
    exp = expected_ctr.get(int(pos), 0.02)
    actual = row.get('ctr', 0)
    scores['ctr_gap'] = min((exp - actual) / max(exp, 0.01), 1.0) * 100

    # Competition: lower KD = higher opportunity
    kd = row.get('kd', 50)
    scores['competition'] = max(0, 100 - kd)

    # Weighted total
    total = sum(scores.get(k, 50) * v for k, v in weights.items())
    return round(total, 1)

Putting It Together

Here's the full script I run weekly:

#!/usr/bin/env python3
"""Weekly SEO opportunity report."""

from data_aggregator import DataAggregator
from google_search_console import GoogleSearchConsole
from google_analytics import GoogleAnalytics
from dataforseo import DataForSEO
from dotenv import load_dotenv
import os

load_dotenv('config/.env')

gsc = GoogleSearchConsole(
    credentials_path='credentials/ga4-credentials.json',
    site_url='sc-domain:shatranj.live'
)
ga4 = GoogleAnalytics(
    credentials_path='credentials/ga4-credentials.json',
    property_id=os.environ['GA4_PROPERTY_ID']
)
dforseo = DataForSEO(
    login=os.environ['DATAFORSEO_LOGIN'],
    password=os.environ['DATAFORSEO_PASSWORD']
)

# Build unified report
aggregator = DataAggregator(gsc, ga4, dforseo)
report = aggregator.build_unified_report(days=90)

# Score opportunities
report['opportunity_score'] = report.apply(score_opportunity, axis=1)

# Top 20 opportunities
top = report.nlargest(20, 'opportunity_score')
print(top[['query', 'position', 'clicks', 'volume', 'kd', 'opportunity_score']])

Sample output from a recent run on Shatranj Live:

                          query  position  clicks  volume  kd  opportunity_score
candidates tournament 2026         4.2      312    2400  45           82.3
fide rating system                 6.1      189    1900  52           78.6
chess rating categories            8.3       94    1600  38           76.1
how to get fide rating            11.2       67    1100  28           74.8
elo rating chess                  12.4       45    3200  61           71.2

The "sweet spot" keywords are positions 4-15 with decent volume and low-to-medium KD. These are the ones where a content refresh or new article can push you into the top 3.

What I Learned Running This

GSC is the ground truth. DataForSEO's volume estimates are directional, not precise. GSC tells you exactly how many impressions and clicks you actually got. Always prioritize GSC data when the two conflict.

GA4's engagement metrics catch content problems. If a page ranks well (GSC position 5) but has a 85% bounce rate (GA4), the content isn't matching search intent. That's a rewrite opportunity, not a link-building one.

DataForSEO's value is competitor intelligence. Knowing that position 1 for "candidates tournament 2026" is held by Chess.com with a 5,000-word guide tells you what you need to beat. Pull their SERP data, analyze the gap, and write accordingly.

Run it weekly, not daily. Rankings fluctuate daily. Weekly aggregation smooths the noise and shows real trends. I run this every Monday morning and feed the output into my content planning for the week.

Setup Checklist

GSC: Enable Search Console API in Google Cloud Console, create a service account, add it as a user in GSC
GA4: Enable Analytics Data API, use the same service account, grant Viewer access to your GA4 property
DataForSEO: Sign up at dataforseo.com, get API credentials (they have a pay-as-you-go model)
Dependencies: pip install google-api-python-client google-analytics-data pandas python-dotenv requests
Credentials: Store in .env file, never commit to git

The full pipeline runs in about 30 seconds for 500 keywords. The insights it surfaces in that time would take hours of manual dashboard-hopping.

I use this pipeline for Shatranj Live, where we publish SEO-optimized chess content. Track live Candidates Tournament 2026 standings and explore player ratings.

DEV Community