NexGenData

Posted on Jun 26 • Originally published at thenextgennexus.com

How to Extract Contact Information from Company Websites

#marketing #api #webscraping #opensource

Every outbound motion — SaaS BDR sequences, agency lead-gen delivery, founder-led cold email, recruiter sourcing — eventually hits the same wall: the contact data you can buy is stale, the contact data you need is buried three clicks deep on an About page, and aggregators charge per-credit for fields the company itself published last week. This article is about pulling those contacts directly off company websites at scale, the surfaces where named contacts actually live, and the Apify actors that turn the job into a configurable pipeline.

For directory-sourced contact data — Crunchbase, YC, business registries — see our companion piece on extracting contact information for lead generation workflows. This guide is the website-first counterpart.

1. The Problem: Modern Company Sites Are Built to Hide Contacts

Ten years ago you could regex an entire company website for mailto: links and walk away with a usable list. That era is over. Marketing teams figured out exposed emails get scraped by spam bots, so the modern company site does three things specifically to defeat naive extraction:

Webform gatekeeping. The Contact page is a HubSpot or Marketo form. There is no email address on the page. The form posts to a routing rule marketing ops owns. Cold outreach has nowhere to land.
Email obfuscation. Addresses render as inline SVG, base64-encoded images, or assembled at runtime from a JavaScript array. A static HTML parser sees nothing.
Catch-all routing. Even when visible, addresses are info@ or hello@ — the equivalent of mailing "Occupant." Reply rates on catch-alls run 80–90% lower than named contacts.

The SaaS response — Apollo, ZoomInfo, Cognism, Lusha, Clearbit — is to license aggregated data and resell per seat or per credit. That works until you do the math on a 50,000-account TAM at $40K+/year for a database whose freshness you cannot audit. The alternative: extract from the source, render the JavaScript, walk the About / Team / Leadership / Press pages where named contacts live. That is what a purpose-built website email extractor does.

2. Why This Data Matters — and Who Actually Needs It

Website-extracted contact data is workhorse fuel for a specific set of GTM motions:

SDRs hitting 200 accounts/week. Apollo gives 40% coverage of your target list; the remaining 60% has to come from the company's own About page.
Agencies enriching 10,000-company client lists overnight. Manual research is a non-starter; aggregator credits would eat the margin.
Recruiters cold-emailing engineering teams. InMail caps at a few hundred/month and gets ignored. The roster on /team is gold — names, titles, sometimes GitHub handles.
Founders running their own outbound. Pre-Series A you have no SDR — just 300 dream-fit customers and a deadline.
Sales-ops teams building enrichment pipelines. Website extraction is one stage of a nightly pipeline SDRs file requests against.
B2B prospecting tool builders. The extraction layer is a commodity you want to call as an API, not maintain.

3. What Can Be Extracted From a Company Website

Not all pages yield contact data of equal quality. The table below maps the surfaces, what they contain, how reliable extraction is, and the NexGenData actor purpose-built for each.

Page Type	What's There	Reliability	Best Extractor
/contact	Catch-all email, phone, office address; often a webform with no exposed address	Low — catch-alls dominate	Contact Info Scraper
/about	Founder names, occasionally founder emails or a press contact	Medium	Website Email Extractor
/team, /people	Named individuals with titles, headshots, and (on smaller firms) direct emails	High for sub-200-employee firms	Website Email Extractor
/leadership	C-suite and VPs by name, titles, sometimes LinkedIn links	High for names, low for direct emails	Company Enrichment
Footer	Office addresses, support emails, legal contacts, social handles	Medium	Contact Info Scraper
/press, /newsroom	Named press contact with direct email and phone — highest-quality contact on the site	Very high when present	Website Email Extractor
/careers	Recruiter name, hiring manager occasionally, talent@ address	Medium	LinkedIn Jobs Scraper
Blog bylines	Author names, sometimes bios with email or LinkedIn	Medium — useful for content outreach	Website Email Extractor

Practical implication: a one-page crawl of the homepage yields catch-alls and not much else. A recursive crawl prioritizing /team, /leadership, and /press yields 3–5× more named contacts at modest additional compute cost.

4. Example Workflow: An SDR Enrichment Pipeline for 500 SaaS Companies

This is the workflow in production at a mid-market SaaS BDR org running account-based outbound. The team owns 500 SaaS companies fitting ICP (50–500 employees, US-based, Series B+, using a competing CRM). They need named contacts in Ops, RevOps, and SalesOps seats.

Step 1 — Ingest the domain list. CSV with one domain column, assembled from Crunchbase exports, Sales Navigator searches, and inbound MQLs that never converted. Upload as JSON to the actor input or reference via S3 URL.

Step 2 — Run the Website Email Extractor across all 500 domains. Crawl up to 30 pages per domain with priority on URL patterns matching /team, /about, /leadership, /people, /press. Filter social media and CDN paths. Max crawl depth 2. Runtime: 45–90 minutes at default concurrency. Yield: 8–15 emails per domain, of which 3–6 are named contacts rather than catch-alls.

Step 3 — Enrich with company metadata. Pipe through the Company Enrichment Tool to append technographics (CRM in use, MarTech stack), firmographics (employees, funding stage, HQ), and inferred industry. This is what lets you score against ICP — a 50-employee Salesforce shop is a different sequence from a 400-employee HubSpot shop.

Step 4 — Verify, score, push to CRM. Run emails through the Bulk Email Validator to drop bounces. Apply ICP scoring; push top-tier rows to Salesforce or HubSpot via webhook; route second-tier to SDR research queue. Wall-clock CSV to enriched, scored, CRM-loaded: under three hours.

5. Use Cases Beyond the SDR Pipeline

SDR pre-meeting research. Five minutes before discovery, extract the prospect's VP of Sales, RevOps lead, and CFO.
M &A target outreach. PE associates building roll-up lists need every operating-company site parsed for leadership. Aggregators miss founder-owned businesses.
Recruiter cold email to engineering. Extract the IC roster from /team, append GitHub handles, personalize at scale.
Agency client list enrichment. Take client TAM, return it with named contacts and verified emails. Margin lives in the automation.
Vendor due diligence and KYB. Procurement teams need named finance, security, and legal contacts — not the RFP rep.
Partnership and BD outreach. Surface the partner's BD lead, head of ecosystem, and product owner before pitching.
Founder-led cold email. Pre-Series-A founders running 300-account outbound personally — extraction replaces three weeks of LinkedIn scrolling.
Journalist source-finding. Reporters need press contacts and named executives across 100+ companies in a single day.
OSINT and competitive intelligence. Combine /careers extraction with the LinkedIn Jobs Scraper to infer roadmap and headcount growth.
Local-business outreach. Agencies pitching dentists, law firms, or manufacturers pair extraction with the Google Maps Lead Scraper.

6. Run the Website Email Extractor on Apify

The fastest path from "I have a list of domains" to "I have a list of named contacts and verified emails" is the actor itself. It accepts CSV or JSON arrays of starting URLs, crawls to configurable depth with priority on contact-bearing page patterns, and returns a normalized dataset you can export as JSON, CSV, Excel, or push to a webhook.

→ Launch the Website Email Extractor on Apify

Pay-per-event pricing means you only pay for successful extractions. Free tier covers the first few hundred domains for evaluation. For ongoing pipelines, browse the full NexGenData actor catalog.

7. Related Actors and Cross-Functional Tools

Contact Info Scraper — general-purpose extractor for emails, phones, and social handles from any URL.
Company Enrichment Tool — domain-to-firmographics resolver for employee count, funding, technographics.
B2B Leads Finder — Apollo alternative for when website extraction yields nothing and you need aggregator fallback.
Lead List Enricher — input CSV of domains, output emails, phones, socials appended.
Google Maps Lead Scraper — for SMB and local-business outbound where the website may not exist.
LinkedIn Jobs Scraper — source decision-makers by reading hiring signals. Three RevOps roles posted is a buying signal worth a sequence.

For directory-sourced contact discovery — Crunchbase, YC, SEC, ACRA — see contact extraction for lead generation workflows. For pairing contacts with funding-event intent data, see startup funding data for investors, recruiters, and sales. Full lead generation data tools category covers adjacent workflows.

8. Frequently Asked Questions

Is extracting contact information from public company websites GDPR-compliant?

Publicly published business contact data sits in a different legal bucket than personal data, but GDPR and CCPA still apply when the address belongs to an identifiable person. The defensible playbook: rely on legitimate interest under Article 6(1)(f), document the balancing test, honor opt-out requests within 30 days, and never scrape behind login walls. Pair scraping with email verification before sending.

What is the difference between scraping a company website versus using LinkedIn or Apollo?

Apollo, ZoomInfo, and Cognism are aggregated databases — they license and merge dozens of sources, then resell per-seat or per-credit. Scraping a company site directly gets data straight from the source, fresher and free of aggregator staleness, but only what the company chose to publish. Mature outbound teams run a hybrid: a website extractor catches named contacts on About / Team / Press that aggregators miss, and a paid database fills gaps.

How do I avoid generic info@ or contact@ catch-all emails?

Configure the Website Email Extractor to flag role-based local-parts (info, hello, sales, contact, admin). Run extracted addresses through the Email Verification Tool — catch-all domains return ambiguous SMTP responses you can route separately. Prioritize emails from /team, /about, /leadership, and press releases over /contact, which is where catch-alls cluster.

Can I verify the emails I extract before sending?

Yes, and you should. Run every address through a bulk verifier doing MX lookup, SMTP handshake, and disposable-domain detection. Gmail Postmaster and Microsoft SNDS start punishing sender reputation above 2% bounce — unverified scraped lists routinely bounce at 15–30%. The NexGenData Bulk Email Validator handles this at fractions of a cent per address.

Do these actors bypass Cloudflare and anti-bot protection?

Apify's infrastructure includes residential proxy rotation, browser fingerprint randomization, and headless Chromium rendering for JavaScript-heavy sites. For most company marketing sites — which want to be crawled by Google — extraction works on the first attempt. Hardened sites may require the Scraping Browser. None of the listed actors bypass authentication walls; they only read publicly-served HTML.

What is the cost per 1,000 company websites?

On NexGenData pay-per-event pricing, extracting contact data from 1,000 typical company websites runs in the low-single-digit-dollar range — orders of magnitude cheaper than Apollo or ZoomInfo per-credit pricing. Cost scales with page depth: crawling just the homepage is cheapest; recursively walking About + Team + Leadership + Press yields 3–5× more named contacts.

Can I bulk-process an existing account list?

That is the primary workflow. The Website Email Extractor accepts an array of starting URLs — feed it 500, 5,000, or 50,000 domains as CSV and it parallelizes across Apify's compute. The Lead List Enricher is purpose-built for this: input domains, output emails, phones, socials appended. Both push to webhook endpoints for direct CRM or warehouse ingestion.

How does this compare to Hunter.io or Clearbit?

Hunter and Clearbit are SaaS layers on top of public web data — they crawl, normalize, and charge per lookup. The NexGenData actors give you the underlying crawl-and-extract capability at infrastructure cost rather than retail SaaS pricing, with the trade-off that you handle orchestration. For teams enriching more than ~5,000 domains per month, running actors directly wins on math.

DEV Community