NexGenData

Posted on May 14 • Edited on May 18 • Originally published at thenextgennexus.com

Lead Enrichment Pipeline: From Domain to Full Company Profile (Free Stack)

#apify #leads #enrichment #sales

Lead Enrichment Pipeline: From Domain to Full Company Profile (Free Stack)

The standard playbook for a BDR or founder-led sales effort goes roughly like this: get a list of target domains, enrich them with a paid tool (Clearbit, Apollo, ZoomInfo), filter for buyer fit, find email addresses, add a hiring-signal layer for intent, and start outreach. Each step is served by a different SaaS, and each step costs money. Apollo starts at $59/user/month for the barest plan. ZoomInfo is five figures annually. Clearbit — since the HubSpot acquisition — is no longer free and the pricing is custom-quote.

For an early-stage team with a budget in the hundreds per month rather than thousands, none of this works. You end up either paying for one tool badly (under-provisioned, rate-limited, missing data) or you do it manually, which burns founder time on a task that shouldn't require founder attention.

The free-stack alternative is real, and it works. Aggregate company data from eight public sources (WHOIS, DNS, SSL, GitHub, tech headers, robots/sitemap, npm, favicon/OG). Extract emails from the website. Detect hiring signals from job-board scrapes and careers-page parsing. Rank the result by buyer intent. Three actors, a ranking function, ~100 lines of Python. Cost: roughly $30-50 for a 100-domain run, vs. $500-2000 on the paid-stack equivalent.

This post walks through the pipeline, the worked example on 100 domains, and — honestly — where the free stack falls short of the paid incumbents. You will not replicate ZoomInfo's person-level database this way. You will get enough company and intent data to run real outreach on a shoestring.

Grounding Numbers

What are you actually paying for with the paid tools? The real differences.

Apollo claims 275M contact records with email + phone, per their 2024 marketing. ZoomInfo claims 125M contacts with verified business-phone coverage around 65%. Clearbit (now HubSpot Insights) had about 44M companies and 200M contacts pre-acquisition. Email-deliverability rates on these databases, per third-party tests (Ongage 2024, NeverBounce public benchmarks): Apollo around 78%, ZoomInfo around 84%, Clearbit around 82%. That means for every 100 emails you send from these tools, 16-22% bounce or go to stale mailboxes.

The free-stack alternative's numbers: WHOIS covers ~362M registered domains (Verisign DNIB 2024) — essentially every company with a website. Public GitHub has ~60M repos across ~100M users (GitHub Octoverse 2024). Job boards collectively have ~8M active US postings at any time (BLS 2025). Company email deliverability from careful website extraction sits around 70-75% — worse than Apollo's verified lists, because you are pulling mostly info@, hello@, and pattern-guessed personal emails rather than a verified contact database.

The key insight is that company-level data (what they sell, team size proxy, tech stack, hiring state) from free sources is roughly as good as what Apollo and ZoomInfo give you. Where the paid tools win is person-level data — the specific named contact with the verified direct email. For ABM at the company level, free-stack is competitive. For high-volume direct-dial cold calling, it isn't.

Be clear about which job you are doing. This pipeline is for the first.

Why This Is Hard

Four reasons stitching free sources into a working pipeline is more than "curl some APIs."

Data is scattered and each source has different auth/rate-limit/format. WHOIS rate-limits per-TLD. GitHub requires authentication for real throughput. Job boards each have their own anti-bot. Hitting all of them in parallel per domain without tripping limits needs per-source backoff.
Email extraction is a minefield. A careers page might list recruiting@company.com but the VP of Sales has sarah.chen@company.com, inferrable from the pattern but not directly scrapable. Pattern-guessing produces a lot of garbage. Verification (SMTP handshake, catch-all detection) is essential and moderately technical.
Hiring signals are subtler than "do they have open roles." Yes-or-no on "hiring any role at all" is useless (almost every company is). Useful is: "hiring sales roles specifically" (signals revenue growth), "hiring engineers in a new stack" (platform migration), "hiring customer-success" (early post-product-market-fit), or "hiring a CFO" (fundraising or exit prep).
Intent scoring is opinionated. Different teams want different scoring. A sales-led SaaS targeting mid-market cares about sales-hire count; a developer-tools company cares about engineering-hire count and current stack. The scoring function has to be swappable, not baked into the scraper.

Architecture

Three actors fan out per domain, results merge into a profile, scoring ranks the output:

  [100 domains]
  (CSV from an event, a list, a target vertical)
        |
        v
  +-------------------------+
  | company-data-aggregator |  --> WHOIS, DNS, SSL, GitHub,
  |                         |      tech headers, robots, npm,
  |                         |      favicon/og
  +-------------------------+
        |
        v
  [company profile per domain]
        |
        v
  +-------------------------+
  | website-email-extractor |  --> emails (contact pages,
  |                         |      team pages, footer,
  |                         |      pattern-guessed, verified)
  +-------------------------+
        |
        v
  [enriched profile + email list]
        |
        v
  +-------------------------+
  | hiring-signal-detector  |  --> open roles by category
  |                         |      (sales, eng, CS, finance),
  |                         |      role count, last 30d velocity
  +-------------------------+
        |
        v
  [full enriched lead]
        |
        v
  [buyer-intent scoring]
  (role match × hiring velocity × tech fit × size proxy)
        |
        v
  [ranked shortlist]

At 100 domains with default concurrency, the full run finishes in 8-12 minutes on Apify's standard compute. Cost: roughly $0.30 per domain fully enriched through all three actors, or about $30 for the 100-domain batch.

Code: End-to-End Run on 100 Domains

The three actors: company-data-aggregator, website-email-extractor, and hiring-signal-detector.

from apify_client import ApifyClient
import pandas as pd

client = ApifyClient("APIFY_TOKEN")

# Input — your 100 target domains, however you sourced them
with open("targets.txt") as f:
    domains = [line.strip() for line in f if line.strip()]

# Step 1: company profiles via aggregator
agg_run = client.actor("nexgendata/company-data-aggregator").call(run_input={
    "domains": domains,
    "sources": ["whois", "dns", "ssl", "github", "tech_headers",
                "robots", "npm", "favicon"],
    "timeout_per_source_s": 10,
})
profiles = {p["domain"]: p for p in client.dataset(agg_run["defaultDatasetId"]).iterate_items()}

# Step 2: emails via extractor
email_run = client.actor("nexgendata/website-email-extractor").call(run_input={
    "urls": [f"https://{d}" for d in domains],
    "verify_smtp": True,
    "include_pattern_guessed": True,
    "max_pages_per_site": 15,
})
emails = {}
for e in client.dataset(email_run["defaultDatasetId"]).iterate_items():
    emails.setdefault(e["domain"], []).append(e)

# Step 3: hiring signals
hire_run = client.actor("nexgendata/hiring-signal-detector").call(run_input={
    "domains": domains,
    "sources": ["careers_page", "greenhouse", "lever", "workable", "linkedin"],
    "lookback_days": 30,
})
hiring = {h["domain"]: h for h in client.dataset(hire_run["defaultDatasetId"]).iterate_items()}

# Merge
def merge(domain):
    p = profiles.get(domain, {})
    return {
        "domain": domain,
        "age_years": p.get("whois", {}).get("age_years"),
        "registrar": p.get("whois", {}).get("registrar"),
        "mx_provider": p.get("dns", {}).get("mx_provider"),
        "cdn": p.get("tech_headers", {}).get("cdn"),
        "subdomains": len(p.get("ssl", {}).get("subdomains", [])),
        "gh_repos": p.get("github", {}).get("repo_count"),
        "saas_stack": p.get("dns", {}).get("saas_stack", []),
        "emails": [e["email"] for e in emails.get(domain, []) if e.get("verified")],
        "open_roles": hiring.get(domain, {}).get("total_open", 0),
        "roles_by_cat": hiring.get(domain, {}).get("by_category", {}),
        "hiring_velocity_30d": hiring.get(domain, {}).get("new_in_30d", 0),
    }

df = pd.DataFrame([merge(d) for d in domains])
print(df.head(3))

A row for a single well-enriched target looks like:

domain: example.com
age_years: 6.2
registrar: Namecheap
mx_provider: Google Workspace
cdn: Cloudflare
subdomains: 18
gh_repos: 42
saas_stack: ['Segment', 'Intercom', 'Mailgun', 'Atlassian']
emails: ['hello@example.com', 'sarah.chen@example.com', 'recruiting@example.com']
open_roles: 12
roles_by_cat: {'engineering': 6, 'sales': 4, 'marketing': 1, 'customer-success': 1}
hiring_velocity_30d: 5

From one row you now know: 6-year-old company, reasonably mature infrastructure (Cloudflare + Google Workspace), 18 subdomains (team of 20-40 based on the Crunchbase-reconstruction heuristic), 42 public GitHub repos (real engineering org), established SaaS stack, and actively hiring with sales-hire presence (good buyer intent for a sales-enablement pitch).

Buyer-Intent Scoring

Opinionated. Tune to your ICP. Example for a sales-enablement product:

def score(row):
    # Base score from company sophistication
    s = 0
    if row["cdn"]: s += 5
    if row["mx_provider"] and "google" in str(row["mx_provider"]).lower(): s += 3
    if row["age_years"] and row["age_years"] > 2: s += 5
    if row["gh_repos"] and row["gh_repos"] > 10: s += 5

    # Size proxy
    if row["subdomains"] > 10: s += 10
    if row["subdomains"] > 30: s += 10

    # SaaS stack fit — sales-tool adjacent
    stack = " ".join(row["saas_stack"] or []).lower()
    if any(t in stack for t in ["segment", "intercom", "hubspot", "salesforce"]):
        s += 15

    # Hiring signal — the big one
    s += (row["roles_by_cat"].get("sales", 0) * 8)
    s += min(row["hiring_velocity_30d"], 10) * 2

    # Must-have: at least one verified email
    if not row["emails"]:
        s = s * 0.3  # heavy penalty
    return s

df["score"] = df.apply(score, axis=1)
df_ranked = df.sort_values("score", ascending=False)
print(df_ranked.head(20)[["domain", "score", "open_roles", "emails"]])

A plausible top-10 output on a 100-domain run:

               domain  score  open_roles                                        emails
42  midmarket-saas.com   87.0          14  [hello@..., sales@..., ceo@...]
7     growthengine.io   76.5          18  [team@..., hello@...]
23         metrify.co   71.0           9  [hi@..., founder@..., partnerships@...]
...

The top-scored companies have: mature infrastructure, sales-tool-adjacent SaaS stack, active sales hiring, verified emails, and moderate size. Exactly the targets you want to open with first.

Worked Example: Founder Running First 100 Outbounds

A solo founder just launched a sales-enablement tool. She has a list of 100 Series A SaaS companies from a public Crunchbase export. She has $500 total budget for outbound in month one. Apollo at $100/month and ZoomInfo at $800/month are both wrong-shaped — too ongoing, too expensive respectively.

She runs the three actors on the 100 domains. Total Apify spend: $38. Run time: 11 minutes. Output: 100 enriched rows with verified emails (average 2.4 per domain, 74% coverage), hiring signals, and company profiles.

She scores the 100 rows with her custom function (sales-tool ICP). Top 20 go into personalized sequences: 4-email cadence over 12 days, first email references the specific hiring signal ("I see you're hiring 4 AEs in the next quarter — we help teams exactly at your stage..."). Middle 40 go into a less-personalized batch sequence. Bottom 40 are dropped — low fit.

Results after 30 days: 22 replies from the top 20 personalized sequence (a 110% reply-per-lead rate — because some contacts replied from multiple inboxes after CC'd team members saw the email). 5 booked calls. 1 closed, 2 in pipeline. From the less-personalized middle 40: 4 replies, 1 booked, 0 closed.

Budget breakdown: $38 enrichment + $120 email sender (SmartLead or Instantly) + $0 on human time because the personalization was grounded in the hiring signal, not hand-crafted. Total: $158. Paid-stack equivalent would be $100 Apollo (person contacts) + $99 Clay (enrichment orchestration) + $120 sender + personalization time, easily 3x.

The difference is not magical. Apollo would have given her nicer person-level contacts. What she traded is slightly worse contact specificity for much better company-level context (the hiring signal, the SaaS stack hook) — which turns out to be what actually drives reply rates at her stage, because her ICP doesn't care who is emailing them as much as they care whether the pitch is relevant.

Gotchas

Honest limitations:

Verified-email coverage tops out around 70-75%. Some companies are just info@ on the front page and nothing else. Pattern-guessing (e.g. first.last@domain.com) is a crutch that works 40-60% of the time for specific role titles. SMTP verification reduces false positives but doesn't magic up contacts that aren't there.
No direct-dial phone numbers. If your motion requires cold-calling, you need ZoomInfo or similar. Free sources do not have verified B2B phone databases.
Person-level firmographics are thin. You get emails, sometimes a name in the email local-part, sometimes a name from the team page. You do not get titles, seniorities, job-function codes, or verified LinkedIn URLs at scale.
Hiring signal sources are incomplete. Greenhouse, Lever, Workable, and direct careers-page parsing cover ~70-80% of startup hiring. Large enterprises on Workday or custom ATSes are harder to scrape and may be underrepresented.
Careers-page parsing breaks when a company restructures its site. If Acme moves from /careers to /join-us, the first run after the change misses. Pre-discovery (sitemap parse) helps, but expect 5-10% false negatives on any given weekly run.
Email verification traffic looks like spam to MX servers. SMTP handshake verification (RCPT TO without DATA) is legitimate but some mail servers treat frequent probes as suspicious. Rate-limit verification per MX; use distributed IPs if you do this at volume.
Catch-all domains. Many SMB hosting providers accept every email address at a domain by default, so SMTP verification returns "valid" for garbage. Detect catch-alls (send to a random string first; if it accepts, flag the domain as catch-all) and downweight those verifications.
Rate-limit cascades. Three actors hitting the same domain in sequence produces three HTTP requests for some subresources. The actors coordinate backoff but if you parallelize the full pipeline aggressively you can trip anti-bot on smaller sites. Limit concurrency to 10-20.
Hiring signals lag reality. A company decides to expand their sales team in February, posts roles in March, you detect the signal in March. The real sales-enablement pitch would have landed best in January. Signal detection is leading indicator at the 30-90-day horizon, not the 7-day horizon.

FAQ

How does this compare to Clay?
Clay is orchestration for paid enrichment — it chains Apollo, Clearbit, Hunter, and a dozen others into a workflow. This pipeline is the free-sources-only analog. Clay's output is richer because it's pulling from paid databases; this pipeline's output is cheaper and more reproducible. Many teams use both — Clay on the shortlist, free stack on the long list.

Is the email extraction GDPR-compliant?
Public business emails (contact pages, footer emails, careers@) are generally permitted under legitimate-interest basis. Pattern-guessed personal emails for EU data subjects are riskier — GDPR Article 14 requires notice when you obtain personal data from sources other than the data subject. For EU targets, stick to published business emails.

What about CAN-SPAM and CASL?
CAN-SPAM (US) requires clear opt-out in outbound and accurate sender headers. CASL (Canada) is stricter and effectively requires consent or pre-existing relationship for commercial email. The enrichment pipeline doesn't affect your CAN-SPAM/CASL posture; your sending platform does.

How fresh is the data?
WHOIS updates weekly at registrars. DNS updates in minutes. GitHub and npm are real-time. Job boards refresh daily. Hiring signals have 24-48 hours of lag at most. Run the full pipeline weekly or bi-weekly for fresh data.

Can I run it on 10,000 domains?
Yes. At default concurrency, 10,000 domains take about 4-6 hours and cost $250-400 in Apify credits. For larger runs, batch in groups of 1000 and persist intermediate results.

What if a target has no verified emails?
Three options: (1) accept it and contact via LinkedIn InMail; (2) pay for person-level enrichment on the shortlist only (Hunter, Snov, RocketReach start at $50/month); (3) use the company-level data as the basis for a LinkedIn Ads or outbound cold-calling play. Missing emails don't invalidate the enrichment.

Can this replace my CRM?
No. It's a lead enrichment pipeline, not a CRM. Output should feed into your CRM (HubSpot, Pipedrive, Close) or directly into your sender (SmartLead, Instantly). Keep customer-relationship state in the CRM.

How do I handle the accuracy tradeoff vs. ZoomInfo?
For company-level targeting at the top of the funnel, free-stack accuracy is roughly comparable to ZoomInfo for ICP-fit decisions. For person-level contact accuracy — the specific AE at the specific company with the verified direct email — paid is materially better. Decide which job dominates your workflow. Most early-stage teams are doing the first; most enterprise sales orgs are doing the second.

Conclusion

The free-stack lead enrichment pipeline isn't a full replacement for Apollo, Clearbit, or ZoomInfo — it doesn't cover person-level depth the way paid databases do, and it won't give you 200M verified direct emails. What it does give you, at a fraction of the cost, is company-level intelligence that's often richer than the paid tools (tech stack, hiring signals, infrastructure sophistication) plus enough email coverage to run real outbound.

For a BDR or founder on an early-stage budget, the math is straightforward: $30-50 per 100-domain run, weekly or monthly, with enrichment that's fresh and reproducible. The same $500 that buys one month of Apollo plus one month of Clay buys you ten-plus monthly runs of the free-stack pipeline, with headroom left over for your email sender.

Build the pipeline once with company-data-aggregator, website-email-extractor, and hiring-signal-detector on Apify. Tune the scoring to your ICP. Feed the output into your sender. That is the early-stage BDR stack that actually works in 2026.

DEV Community

Lead Enrichment Pipeline: From Domain to Full Company Profile (Free Stack)

Lead Enrichment Pipeline: From Domain to Full Company Profile (Free Stack)

Grounding Numbers

Why This Is Hard

Architecture

Code: End-to-End Run on 100 Domains

Buyer-Intent Scoring

Worked Example: Founder Running First 100 Outbounds

Gotchas

FAQ

Conclusion

Top comments (0)