Devil Scrapes

Posted on May 31

Google AI Overview Tracker: 8-selector battery + citation drift telemetry

#webscraping #python #apify #seo

Quick answer: Google publishes no API for AI Overview citations. The only way to get the data programmatically is to render Google SERPs in a real browser and parse the citation carousel client-side. The Google AI Overview Citation Tracker does exactly that — one Pydantic-validated row per (query × cited source) at $5.50 per 1,000 rows, with selector-drift telemetry so you know when Google rotates its markup before your dashboard goes dark.

Answer Engine Optimization has a measurement problem no major SEO platform has solved. Ahrefs, Semrush, and Sistrix track your domain's SERP rank, but AI Overview appears above position 1 for roughly 30% of informational queries in 2026, and its citations are drawn from a different pool than your normal rankings. You can rank position 1 and still be invisible in AI Overview while a competitor's 2022 blog post gets cited six times — with no backlink your SEO tools can detect. That gap, structured per-query citation data you can query against a competitor list, is what this Actor closes.

What is Google AI Overview? 🔎

Google AI Overview is the AI-generated summary block at the top of Google's search results for informational queries. It rolled out broadly in the US in May 2024 and expanded globally through 2025 — Google's generative-AI answer inside the SERP, its response to Perplexity and ChatGPT Search. For a query like "what causes inflation", it renders a 3-5 sentence synthesis with a carousel of 4-8 cited sources below it.

The citations are the commercially interesting part. Those cited domains get free brand impressions, click-throughs, and authority signals that traditional SEO tools never surface. The shift is large enough that some publishers have watched informational-query traffic fall 20-40% even while their SERP rank held steady.

Does Google AI Overview have an API? 📡

No. As of 2026, Google publishes no official API, export endpoint, or structured feed for AI Overview citations. The only programmatic surface is what the browser renders client-side. Google's Search Central documentation covers AEO best practices but provides no access to citation data. To collect it at scale you render real Google SERPs in a real browser and parse the output yourself — the entire reason this Actor exists instead of a three-line API call.

What the data looks like

Every citation in an AI Overview carousel produces one flat, typed row:

{
  "query": "what causes inflation",
  "country": "us",
  "language": "en",
  "ai_overview_appeared": true,
  "ai_overview_text_excerpt": "Inflation is caused by a combination of demand-pull factors, cost-push factors...",
  "citation_position": 1,
  "source_domain": "imf.org",
  "source_url": "https://www.imf.org/en/Publications/fandd/issues/Series/Back-to-Basics/Inflation",
  "source_title": "Inflation: Prices on the Rise",
  "selector_used": "div[aria-label=\"AI Overview\"]",
  "scraped_at": "2026-05-16T20:50:00.000Z"
}

When AI Overview did not appear for a query, the Actor still emits a row — ai_overview_appeared: false, all citation fields null. That absence is itself a valid AEO signal: you need to know which queries don't trigger AI Overview today, because that changes.

Eleven fields total, validated through Pydantic v2 ResultRow.model_validate before writing. Drop it straight into BigQuery, Sheets, or a pandas pivot — no positional-array wrangling on your side.

The naive approach (and why it falls apart) 🔩

The mental model most people start with: open DevTools, find whatever request the SERP makes, replay it in Python. Three failure modes kill that before the first result lands.

1. Google hard-blocks datacenter IPs. Our recon showed the sorry/index reCAPTCHA interstitial appearing within one second for direct-IP requests, regardless of fingerprint quality. Proxy is load-bearing, not optional. We thread Apify residential proxies, rotate the session ID per query (Apify's session_id regex requires ^[\w._~]+$ — no hyphens), and fall back to BUYPROXIES94952 when residential is unavailable on your plan.

2. AI Overview lazy-renders client-side. The carousel appears 5-7 seconds after domcontentloaded via a separate async render pass — a tool that scrapes the raw HTML response gets nothing, because the container does not exist in the initial DOM. We render with Camoufox (the Firefox fork with anti-detection patches our org mandates per ADR-0002) and wait a configurable 4-15 seconds for the overlay to settle before probing.

3. Google rotates the AI Overview markup. This kills scrapers quietly. Since launch in May 2024, the container's identifying attributes have changed at least three times. A scraper that hardcodes div[aria-label="AI Overview"] works until Google A/B-tests a new attribute, then silently returns zero citations.

We absorb all three. We rotate browser fingerprints through Camoufox's Firefox TLS and navigator stack, and on 408 / 429 / 5xx or a CAPTCHA intercept we rotate the proxy session and retry once before emitting a marker row. We back off when Google rate-limits, and surface partial success with a clear Actor.set_status_message — we never silently return an empty dataset. The selector_used field makes drift detection a single SQL query, which brings us to the most interesting part of this build.

The Actor: 8-selector fall-through battery 🎯

I packaged this as an Apify Actor: Google AI Overview Citation Tracker. The selector battery is the load-bearing decision — eight selectors probed in priority order, first hit wins:

Priority	Selector	Origin
1	`div[aria-label="AI Overview"]`	Current canonical (2026)
2	`div[data-attrid="AI Overview"]`	2025 rotation
3	`div[data-attrid="wa:/description"]`	Historical knowledge-panel reuse
4	`div[jsname][data-rl="ai_overview"]`	2025 rotation
5	`div[data-async-context*="ai_overview"]`	Async-loaded variant
6	`div#m-x-content`	Mobile SGE legacy id
7	Any `h1/h2/h3` whose text starts `AI Overview`	Last-resort text fallback
8	`div` containing `h2` with text `AI Overview`	Last-resort structural fallback

Every row records which selector fired. When selector 1 stops winning and selector 4 starts, open an issue — we'll add the new attribute to the battery.

Run it from the Apify Console or programmatically:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/ai-overview-citations").call(
    run_input={
        "queries": [
            "best CRM for startups 2026",
            "what causes inflation",
            "how to reduce churn rate",
        ],
        "country": "us",
        "language": "en",
        "maxQueries": 25,
        "waitMsAfterLoad": 8000,
        "proxyConfiguration": {
            "useApifyProxy": True,
            "apifyProxyGroups": ["RESIDENTIAL"],
        },
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item["ai_overview_appeared"]:
        print(item["source_domain"], item["citation_position"], item["query"])

Input accepts 1-50 queries per run, plus country (ISO-3166 alpha-2 → gl=) and language (ISO-639-1 → hl=) for locale targeting. waitMsAfterLoad (default 8000ms) controls how long the Actor waits after domcontentloaded before probing — raise it to 12000-15000ms for slow proxy exits.

Use cases

AEO dashboard. Schedule a weekly run for your 50 highest-priority informational queries. Chart source_domain share-of-citation over time alongside ai_overview_appeared rate, and catch when a competitor first appears in the carousel for a query where you rank position 1. A 50-query run yields roughly 95 rows — about $0.53 per run.

Competitive citation gap analysis. Run the 20 queries you want to rank for and map which domains Google currently cites for them. That list is your outreach shortlist — a mention from a site Google already trusts beats generic link-building.

Brand monitoring. Run your core product-category queries weekly and alert when your domain drops out of the citation set — or when a direct competitor appears. Most brands have no instrumentation here.

Localized AEO comparison. Run identical query lists with country=us vs country=gb. Citations for "best mortgage rates" differ sharply between US and UK — different markets entirely.

Pricing — exact numbers 💰

Pay-per-event: actor-start is $0.05 once per run, result-row is $0.005 per row written (citation hit or no-AI-Overview marker). You pay only for rows that land in your dataset.

Scenario	Rows	Cost
10-query spot check (~30% hit rate, ~4 citations/hit)	~19	~$0.15
50-query weekly AEO audit	~95	~$0.53
500-query category sweep	~950	~$4.80
1,000-row dataset (effective rate)	1,000	~$5.50

The $5.50/1,000 effective rate sits above commodity SERP scrapers because this citation data is essentially unavailable elsewhere at this granularity. Ahrefs and Semrush are beginning to ship AEO modules at $300-1,500/month — and they only track your own domain. Apify's $5 free trial credit covers roughly 900-950 rows, no credit card needed.

The technically interesting part: why we record which selector fired

The selector_used field is deliberately shipped operational telemetry. Google has rotated the AI Overview container's attributes multiple times since launch in 2024, and each rotation silently kills scrapers that hardcode one selector — the parser falls through to empty, the dataset looks fine, until someone notices the citation count dropped to zero. Recording which of the 8 selectors matched on every row turns that into a dead-simple query against your own dataset:

SELECT selector_used, COUNT(*) as hits, DATE(scraped_at) as day
FROM your_aeo_dataset
GROUP BY 1, 3
ORDER BY 3 DESC, 2 DESC;

When the distribution shifts — selector 1 dropping, selector 4 climbing — you get a 24-48 hour warning before coverage degrades. The alternative is waking up to a week of empty citation data and no idea why.

Limitations (the honest list) 🚧

AI Overview triggers on roughly 30% of queries. Transactional, navigational, and trademark-heavy queries mostly produce ai_overview_appeared=false marker rows; informational queries (what is, how to, best X 2026) have the best trigger rate. Marker rows are charged at the same per-row rate — the absence is a valid data point.
v0.1 is English-tuned. The text-based fallback selectors (7 and 8) match the literal string AI Overview, so non-English locales may produce false negatives on that path. The CSS battery (selectors 1-6) is locale-agnostic.
Apify Proxy is required — not optional. Google hard-blocks Apify datacenter IPs; without proxy enrichment the Actor fails fast at startup with a clear status message. FREE tier gets BUYPROXIES94952 (5 datacenter IPs, higher CAPTCHA rate); paid plans with RESIDENTIAL get substantially cleaner runs.
Mobile SERP is out of scope for v0.1. Mobile AI Overview has a different DOM structure; a mobile-variant Actor is planned.
The Actor records citation URLs but does not follow them. For destination page content, pair it with a downstream HTTP scraper.

FAQ

Is scraping Google Search results legal?
The data returned is public — the same content anyone sees in a browser. This Actor reads only what Google renders in the public SERP, at a paced rate with per-query session isolation, and collects no personal data. hiQ Labs v. LinkedIn (9th Circuit, 2022) affirmed that scraping publicly accessible data is not a CFAA violation. Legality still varies by jurisdiction and use case — review Google's Terms of Service and your local regulations for your situation.

Can I export the data to Sheets, CSV, or a data warehouse?
Yes. The Apify Console downloads CSV / Excel / JSON directly from the dataset view. You can also webhook the dataset on ACTOR.RUN.SUCCEEDED into Make, Zapier, or n8n, or pull it via the Apify API using the datasetId from the run response.

Is there an official Google API for AI Overview citations?
No. As of 2026, Google provides no API or structured export for AI Overview citation data. Google Search Central documents general AEO guidance but no programmatic citation access. This Actor is the practical alternative.

Why emit a row even when AI Overview didn't appear?
Because the absence is meaningful AEO data. Run the same query set weekly and you want the ai_overview_appeared rate over time — when a query transitions from non-triggering to triggering, that's the moment a citation opportunity opens. Marker rows make the transition visible, charged at the same $0.005 per-row rate.

Try it

The Actor is on the Apify Store: apify.com/DevilScrapes/ai-overview-citations.

Free $5 trial credit, no credit card. Run it on your 10 most important informational queries and you'll have the citation breakdown in your dataset within minutes. Find a selector miss, a locale that doesn't work, or a field you wish it returned? Drop it in the comments — real reported drift is exactly what I build the next selector battery from.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community