I had a short window this week to evaluate Website Contact as a data source. Here is the condensed write-up of what the data looks like, what surprised me, and the bits of infrastructure that paid off.
The source
Website Contact Scraper Email, Phone & Social Media Extractor Extract emails, phone numbers, LinkedIn, Instagram, Twitter/X, Facebook, and YouTube links from any website automatically. The relevant questions for any new source are always: is the markup stable, is pagination sensible, and how aggressively does it rate-limit. For this one, all three answers are "good enough that you can build on it" -- which is honestly more than I can say for a lot of supposedly easy targets.
The schema
What you get back per record:
-
url-- url -
rootDomain-- root domain -
pageType-- page type -
pageTitle-- page title -
metaDescription-- meta description -
emails-- emails -
phones-- phones -
socials-- socials -
scrapedAt-- scraped at
Nothing exotic, which is exactly what you want from a feed. Flat records, predictable keys, types you can guess from the names.
Real rows
Two records from a sample run, trimmed for the inevitable wall of text:
{
"url": "https://apify.com",
"rootDomain": "apify.com",
"pageType": "Home",
"pageTitle": "Apify: Full-stack web scraping and data extraction platform",
"metaDescription": "Cloud platform for web scraping, browser automation, AI agents, and data for AI. Use 30,000+ ready-made tools, code templates, or order a...",
"emails": [],
"phones": [],
"socials": {
"linkedin": "http://linkedin.com/company/apify/",
"twitter": "https://x.com/apify",
"instagram": null,
"facebook": null,
"youtube": "https://www.youtube.com/apify"
},
"scrapedAt": "2026-05-15T10:51:58.385Z"
}
{
"url": "https://apify.com/contact",
"rootDomain": "apify.com",
"pageType": "Contact/About",
"pageTitle": "Contact us · Apify",
"metaDescription": "Contact details for Apify, including address, support information, and social media channels.",
"emails": [
"hello@apify.com"
],
"phones": [],
"socials": {
"linkedin": "https://www.linkedin.com/company/apify/",
"twitter": "https://x.com/apify",
"instagram": null,
"facebook": null,
"youtube": "https://www.youtube.com/apify"
},
"scrapedAt": "2026-05-15T10:51:58.818Z"
}
Gotchas
A few things I would not have known without actually pulling data:
- Optional fields disappear instead of being null. Not the end of the world, but it means every loader needs to be tolerant of missing keys.
- Long-form text fields contain control characters. Newlines, tabs, the occasional rogue carriage return. Strip them at load time unless you actively want them.
- Timestamps are UTC ISO-8601 which is great, but it does mean any local-time dashboard needs an explicit conversion.
- Some numeric fields are emitted as strings. Cast on load.
- Re-scraping with overlapping windows creates duplicates. Dedup on the natural ID.
What I would build next
A few directions this dataset would support nicely:
- A daily snapshot pipeline that lands raw JSON into object storage, then materialises a curated table for dashboards.
- A change-detection layer that computes row-level diffs between consecutive scrapes -- great for surfacing new and removed records.
- A text-extraction layer over the long-form content fields, feeding into search or topic modelling.
- A small validation suite that runs after every scrape: row count above a floor, key fields present in 100% of rows, timestamp parses cleanly. Cheap to write, catches schema drift in minutes instead of weeks.
Cost considerations
Worth thinking about before you commit. The dominant cost on a recurring feed is not the per-record extraction price -- it is the maintenance time when the upstream source changes. A solid heuristic: budget half a day per source per quarter for maintenance work, and twice that for sources with active anti-bot defences. If that maintenance budget is too steep for the value the dataset provides, the project is not a fit.
The other cost worth modelling is storage. Raw JSON partitioned by date is cheap if you compress it -- a few cents per gigabyte per month on most clouds -- but it stops being cheap if you forget about retention. Set a lifecycle policy that ages anything older than your useful replay window into a colder tier, and revisit the policy every few months.
Bottom line
For an afternoon's evaluation work this was time well spent. The dataset is structurally clean, the scraper handled rate-limits without me having to think about it, and the records are rich enough to start asking real questions immediately. If the upstream source stays stable for a quarter -- which is the realistic horizon for most public sources -- the cost-benefit of integrating this feed is firmly positive.
For live, customizable extractions of this data, the actor that produced the dataset shown above is published on the Apify Store: logiover/website-contact-scraper. It supports JSON, CSV and Excel exports and runs on a schedule.
Top comments (0)