Vhub Systems

Posted on Apr 2

Stop Manually Researching Prospects — Automate CRM Contact Enrichment With a Web Scraper

#webscraping #marketing #python #automation

Your CRM is full of company names and domains with no contact data. Manually researching each one takes 5–10 minutes per prospect. At 100 prospects, that's a full day of work every week.

Here's how to automate it.

The enrichment gap

Sales teams typically capture a company name and domain from an inbound form or LinkedIn search — then stop there. The result: a CRM with 2,000 entries that look like:

Company: Acme Corp
Website: acmecorp.com
Email: [empty]
Phone: [empty]
Owner: [empty]

To actually reach someone, a rep has to manually visit the site, find the contact page, copy the email, and paste it into the CRM. Multiply by 500 prospects and you've wasted a week.

What automated enrichment looks like

A contact enrichment pipeline has three stages:

Extract — Pull all publicly visible contact info from company websites
Deduplicate + Validate — Remove duplicate emails, flag invalid formats
Sync — Push enriched data back into your CRM

The extraction step is where most teams get stuck. Here's the simplest way to do it.

The contact scraper

The Contact Info Scraper crawls any list of URLs and returns structured contact data:

{
  "url": "https://acmecorp.com",
  "emails": ["hello@acmecorp.com", "sales@acmecorp.com"],
  "phones": ["+1-415-555-0100"],
  "linkedIn": "https://linkedin.com/company/acme-corp",
  "twitter": "https://twitter.com/acmecorp",
  "facebook": "https://facebook.com/acmecorp"
}

It crawls up to 10 pages per domain by default (enough to hit /contact, /about, /team, and the footer), extracts all contact patterns using regex and semantic detection, and returns structured JSON.

Building the enrichment pipeline

Step 1: Export domains from CRM

Pull all records with an empty email field:

# HubSpot example
import hubspot

client = hubspot.Client.create(access_token="YOUR_TOKEN")

# Get contacts without email
contacts = client.crm.contacts.search_api.do_search({
    "filterGroups": [{
        "filters": [{"propertyName": "email", "operator": "NOT_HAS_PROPERTY"}]
    }],
    "properties": ["company", "website"]
})

domains = [c.properties.get('website') for c in contacts.results if c.properties.get('website')]
print(f"Enriching {len(domains)} records")

Step 2: Run the contact scraper

import requests

API_TOKEN = "your_apify_token"
ACTOR = "lanky_quantifier~contact-info-scraper"

run = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/runs",
    params={"token": API_TOKEN},
    json={
        "startUrls": [{"url": d if d.startswith("http") else f"https://{d}"} for d in domains],
        "maxDepth": 2,
        "maxPagesPerCrawl": 10
    }
).json()["data"]

print(f"Run ID: {run['id']} — started")

Step 3: Wait and retrieve

import time

run_id = run['id']

while True:
    r = requests.get(
        f"https://api.apify.com/v2/acts/{ACTOR}/runs/{run_id}",
        params={"token": API_TOKEN}
    ).json()["data"]

    if r["status"] in ("SUCCEEDED", "FAILED"):
        print(f"Done: {r['status']} | Items: {r.get('stats',{}).get('itemCount',0)}")
        break
    time.sleep(10)

# Get items
items = requests.get(
    f"https://api.apify.com/v2/acts/{ACTOR}/runs/{run_id}/dataset/items",
    params={"token": API_TOKEN}
).json()

Step 4: Push back to CRM

for item in items:
    domain = item['url'].replace('https://', '').replace('http://', '').rstrip('/')
    emails = item.get('emails', [])
    phones = item.get('phones', [])

    if not emails:
        continue

    # Update HubSpot contact
    client.crm.contacts.basic_api.update(
        contact_id=domain_to_contact_id[domain],
        simple_public_object_input={
            "properties": {
                "email": emails[0],
                "phone": phones[0] if phones else "",
                "hs_linkedin_company_page": item.get('linkedIn', '')
            }
        }
    )
    print(f"Enriched: {domain} → {emails[0]}")

Real numbers

In a test against 500 B2B SaaS company domains:

432 domains returned at least one email address (86%)
218 domains returned a phone number (44%)
380 domains returned a LinkedIn company page (76%)
Average run time: ~4 minutes for 500 URLs

The 14% that returned nothing: mostly enterprise sites with no public contact info (large banks, governments), or aggressive bot detection (Cloudflare Enterprise).

Scheduling for ongoing enrichment

Run this weekly on new CRM entries:

# n8n workflow trigger: every Monday at 9am
# 1. Query CRM for contacts added in last 7 days without email
# 2. Run contact scraper on their domains
# 3. Push enriched data back
# 4. Send Slack alert with enrichment summary

Or use Apify's built-in scheduler to run the actor on a recurring basis against a continuously updated URL list.

What to do with the data

Once you have email addresses:

Validate before sending — use Hunter.io or NeverBounce API to verify deliverability
Segment by contact type — hello@ = generic, cto@ = executive, support@ = ops
Personalize — check LinkedIn URL for company updates before outreach
Sequence — push into your outreach tool (Apollo, Lemlist, Instantly) with warm-up enabled

Automate the whole thing

The extraction step is solved. If you want the full pipeline — CRM pull → scrape → validate → sequence — that's what the AI Lead Gen Kit ($49) covers. Two complete n8n workflows, documented and import-ready.

Actor link: Contact Info Scraper on Apify — 831 runs, pay-per-result pricing.

DEV Community