I pulled a 20-row sample of Google Ads Transparency Center to see whether the dataset is rich enough to support outbound prospecting, ICP enrichment, account research and territory planning, or whether it is the kind of feed you have to enrich heavily before it becomes useful. Short answer: richer than I expected. Long answer below.
What is in the sample
Google Ads Transparency Center Scraper Competitor Ads, Impressions & Spend Scrape the Google Ads Transparency Center at scale and extract every Google ad your competitors are running across Search, Display, Shopping, and YouTube. Each record has the following fields:
-
adId-- ad id -
advertiserId-- advertiser id -
advertiserName-- advertiser name -
advertiserDomain-- advertiser domain -
format-- format -
surface-- surface -
imageUrl-- image url -
imageWidth-- image width -
imageHeight-- image height -
imageHtml-- image html -
iframeUrl-- iframe url -
previewUrl-- preview url -
variationCount-- variation count -
firstShown-- first shown -
lastShown-- last shown -
variantUrls-- variant urls -
targetingCategory-- targeting category -
impressionsRange-- impressions range -
impressionsRegions-- impressions regions -
spendRange-- spend range -
firstShownDetailed-- first shown detailed -
lastShownDetailed-- last shown detailed -
payer-- payer -
detailFormatCode-- detail format code -
searchedDomain-- searched domain -
searchedAdvertiser-- searched advertiser -
searchedRegions-- searched regions -
searchedFormat-- searched format -
scrapedAt-- scraped at -
advertiserTotalAdsMin-- advertiser total ads min -
advertiserTotalAdsMax-- advertiser total ads max
The fields divide into three groups: identifiers (stable across re-scrapes), descriptive content (the actual signal you want), and metadata (timestamps, source URLs, scrape provenance). For most analytical workflows you only really touch the middle group, but the identifiers matter the moment you start joining across runs.
Two example records
Here are two rows from the sample, trimmed slightly so they fit:
{
"adId": "CR17484233965576388609",
"advertiserId": "AR16735076323512287233",
"advertiserName": "Nike, Inc.",
"advertiserDomain": "nike.com",
"format": "IMAGE",
"surface": "SEARCH",
"imageUrl": "https://tpc.googlesyndication.com/archive/simgad/17926873754417759183",
"imageWidth": 380,
"imageHeight": 199,
"imageHtml": "<img src=\"https://tpc.googlesyndication.com/archive/simgad/17926873754417759183\" height=\"199\" width=\"380\">"
}
{
"adId": "CR02684696164518854657",
"advertiserId": "AR16832577870747402241",
"advertiserName": "NIKE GLOBAL TRADING B.V. SINGAPORE BRANCH",
"advertiserDomain": "nike.com",
"format": "DISPLAY",
"surface": "SHOPPING",
"imageUrl": null,
"imageWidth": null,
"imageHeight": null,
"imageHtml": null
}
Even without aggregation you can see the cardinality is interesting. The descriptive fields vary widely across rows, which means a 20-row sample is enough to do meaningful exploratory analysis but probably not enough for any production-grade modelling -- you would want at least an order of magnitude more.
What I would do with the data
A non-exhaustive list of analyses this dataset directly supports:
- Frequency analysis on the categorical columns to spot dominant clusters and long-tail outliers.
- Time-series breakdowns using the timestamp fields to see daily, weekly and seasonal patterns.
- Text analysis on the free-form fields -- topic modelling, keyword extraction, sentiment if the content warrants it.
- Cross-joins with external reference data (outbound prospecting, ICP enrichment, account research and territory planning typically needs a second-source enrichment step) to produce something more valuable than either input alone.
Quirks I noticed
A few practical observations from poking at the rows:
- Some optional fields are missing rather than null. Normalise on load.
- Long-form text occasionally contains newlines and the odd unicode quirk; clean before tokenising.
- Identifier-like fields are strings; do not let your warehouse coerce them to int.
How I would shape it for downstream use
If I were dropping this dataset into a warehouse the rough plan would be: stage the raw JSON unchanged in a landing zone partitioned by scrape date, then create a curated view that casts the identifier fields to strings, parses the timestamps as native DATE/TIMESTAMP types, splits any compound columns, and trims long-form text. Keeping that two-layer structure means you can replay history without re-scraping, and you can iterate on the curated schema without losing fidelity.
For analytical queries the curated view is what you point dashboards and notebooks at. Common patterns I would pre-build as additional models: a daily-rollup view aggregating numeric columns by the most useful categorical breakdown, a recency view filtered to the last N days for "what is new" dashboards, and a delta view that diffs the latest snapshot against yesterday so you can surface additions and removals cheaply.
Bottom line
For a sample pull it is more than enough to validate the use-case fit. If the analytical questions you want to answer are reasonable on a 20-row sample, the full dataset will comfortably answer them. The next step is a longer-horizon pull -- a week or two of recurring snapshots -- which lets you stop treating each row as a one-off and start treating the dataset as a feed with its own dynamics.
For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/google-ads-transparency-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.
Top comments (0)