Sample dataset analysis: a 20-row snapshot of Google Ads Transparency Center

#webscraping #apify #leadgen #dataengineering

I pulled a 20-row sample of Google Ads Transparency Center to see whether the dataset is rich enough to support outbound prospecting, ICP enrichment, account research and territory planning, or whether it is the kind of feed you have to enrich heavily before it becomes useful. Short answer: richer than I expected. Long answer below.

What is in the sample

Google Ads Transparency Center Scraper Competitor Ads, Impressions & Spend Scrape the Google Ads Transparency Center at scale and extract every Google ad your competitors are running across Search, Display, Shopping, and YouTube. Each record has the following fields:

adId -- ad id
advertiserId -- advertiser id
advertiserName -- advertiser name
advertiserDomain -- advertiser domain
format -- format
surface -- surface
imageUrl -- image url
imageWidth -- image width
imageHeight -- image height
imageHtml -- image html
iframeUrl -- iframe url
previewUrl -- preview url
variationCount -- variation count
firstShown -- first shown
lastShown -- last shown
variantUrls -- variant urls
targetingCategory -- targeting category
impressionsRange -- impressions range
impressionsRegions -- impressions regions
spendRange -- spend range
firstShownDetailed -- first shown detailed
lastShownDetailed -- last shown detailed
payer -- payer
detailFormatCode -- detail format code
searchedDomain -- searched domain
searchedAdvertiser -- searched advertiser
searchedRegions -- searched regions
searchedFormat -- searched format
scrapedAt -- scraped at
advertiserTotalAdsMin -- advertiser total ads min
advertiserTotalAdsMax -- advertiser total ads max

The fields divide into three groups: identifiers (stable across re-scrapes), descriptive content (the actual signal you want), and metadata (timestamps, source URLs, scrape provenance). For most analytical workflows you only really touch the middle group, but the identifiers matter the moment you start joining across runs.

Two example records

Here are two rows from the sample, trimmed slightly so they fit:

{
  "adId": "CR17484233965576388609",
  "advertiserId": "AR16735076323512287233",
  "advertiserName": "Nike, Inc.",
  "advertiserDomain": "nike.com",
  "format": "IMAGE",
  "surface": "SEARCH",
  "imageUrl": "https://tpc.googlesyndication.com/archive/simgad/17926873754417759183",
  "imageWidth": 380,
  "imageHeight": 199,
  "imageHtml": "<img src=\"https://tpc.googlesyndication.com/archive/simgad/17926873754417759183\" height=\"199\" width=\"380\">"
}

{
  "adId": "CR02684696164518854657",
  "advertiserId": "AR16832577870747402241",
  "advertiserName": "NIKE GLOBAL TRADING B.V. SINGAPORE BRANCH",
  "advertiserDomain": "nike.com",
  "format": "DISPLAY",
  "surface": "SHOPPING",
  "imageUrl": null,
  "imageWidth": null,
  "imageHeight": null,
  "imageHtml": null
}

Even without aggregation you can see the cardinality is interesting. The descriptive fields vary widely across rows, which means a 20-row sample is enough to do meaningful exploratory analysis but probably not enough for any production-grade modelling -- you would want at least an order of magnitude more.

What I would do with the data

A non-exhaustive list of analyses this dataset directly supports:

Frequency analysis on the categorical columns to spot dominant clusters and long-tail outliers.
Time-series breakdowns using the timestamp fields to see daily, weekly and seasonal patterns.
Text analysis on the free-form fields -- topic modelling, keyword extraction, sentiment if the content warrants it.
Cross-joins with external reference data (outbound prospecting, ICP enrichment, account research and territory planning typically needs a second-source enrichment step) to produce something more valuable than either input alone.

Quirks I noticed

A few practical observations from poking at the rows:

Some optional fields are missing rather than null. Normalise on load.
Long-form text occasionally contains newlines and the odd unicode quirk; clean before tokenising.
Identifier-like fields are strings; do not let your warehouse coerce them to int.

How I would shape it for downstream use

If I were dropping this dataset into a warehouse the rough plan would be: stage the raw JSON unchanged in a landing zone partitioned by scrape date, then create a curated view that casts the identifier fields to strings, parses the timestamps as native DATE/TIMESTAMP types, splits any compound columns, and trims long-form text. Keeping that two-layer structure means you can replay history without re-scraping, and you can iterate on the curated schema without losing fidelity.

For analytical queries the curated view is what you point dashboards and notebooks at. Common patterns I would pre-build as additional models: a daily-rollup view aggregating numeric columns by the most useful categorical breakdown, a recency view filtered to the last N days for "what is new" dashboards, and a delta view that diffs the latest snapshot against yesterday so you can surface additions and removals cheaply.

Bottom line

For a sample pull it is more than enough to validate the use-case fit. If the analytical questions you want to answer are reasonable on a 20-row sample, the full dataset will comfortably answer them. The next step is a longer-horizon pull -- a week or two of recurring snapshots -- which lets you stop treating each row as a one-off and start treating the dataset as a feed with its own dynamics.

For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/google-ads-transparency-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.