How to Build a LinkedIn Lead Generation Pipeline with Python and Apify
B2B outreach lives and dies by lead quality. Cold-emailing generic lists burns your domain reputation and wastes everyone's time. What you actually need is a pipeline that finds the right people at the right companies — and keeps that data fresh.
In this guide, I'll walk you through building a LinkedIn lead generation pipeline using Python and Apify. We'll extract job postings (which reveal hiring intent, budget, and tech stack), enrich the data, and output a clean CSV ready for outreach.
Why LinkedIn Job Postings Are a Gold Mine for B2B
Most people scrape LinkedIn profiles. That's the wrong target for lead gen. Here's why job postings are better:
- Hiring intent = budget. A company posting 5 Python developer roles has money to spend.
- Job descriptions reveal pain points. "Looking for someone to build our data pipeline" = they need data infrastructure.
- Contact info is easier to find. You know the company name and department — that's enough to find the hiring manager.
- Less competition. Everyone scrapes profiles. Few people systematically mine job postings.
Architecture Overview
LinkedIn Jobs → Apify Actor → Raw JSON → Python Pipeline → Enriched CSV
↓
Filtering & Scoring
↓
CRM / Outreach Tool
The pipeline has three stages:
- Extract: Pull job postings matching your criteria
- Transform: Clean, deduplicate, and score leads
- Load: Output to CSV or push to your CRM
Step 1: Set Up the Apify Actor
First, you'll need an Apify account. The LinkedIn Jobs Scraper actor handles the extraction — it manages proxies, rate limiting, and pagination automatically.
from apify_client import ApifyClient
import os
# Initialize the Apify client
client = ApifyClient(os.environ.get("APIFY_TOKEN"))
# Configure the actor input
run_input = {
"searchQueries": [
"Python developer",
"data engineer",
"DevOps engineer"
],
"location": "United States",
"maxResults": 500,
"proxy": {
"useApifyProxy": True,
"apifyProxyGroups": ["RESIDENTIAL"]
}
}
# Run the actor
run = client.actor("cryptosignals/linkedin-jobs-scraper").call(run_input=run_input)
print(f"Actor finished. Dataset ID: {run['defaultDatasetId']}")
This pulls up to 500 job postings matching your search queries. The actor handles LinkedIn's anti-bot measures — rotating proxies, request timing, and session management.
Step 2: Extract and Clean the Data
import pandas as pd
from datetime import datetime
# Fetch results from the dataset
dataset = client.dataset(run["defaultDatasetId"])
items = list(dataset.iterate_items())
print(f"Raw results: {len(items)} job postings")
# Convert to DataFrame for easier manipulation
df = pd.DataFrame(items)
# Keep only the fields we need
columns_to_keep = [
"title", "company", "location", "description",
"postedAt", "applicationsCount", "salary", "url"
]
df = df[[c for c in columns_to_keep if c in df.columns]]
# Remove duplicates (same company + same title)
df = df.drop_duplicates(subset=["company", "title"])
# Parse dates
df["postedAt"] = pd.to_datetime(df["postedAt"], errors="coerce")
df = df.sort_values("postedAt", ascending=False)
print(f"After cleaning: {len(df)} unique postings")
Step 3: Score and Filter Leads
Not all job postings are equal. Score them based on your ideal customer profile:
def score_lead(row):
score = 0
desc = str(row.get("description", "")).lower()
title = str(row.get("title", "")).lower()
# High-value signals
if any(kw in desc for kw in ["data pipeline", "etl", "scraping", "automation"]):
score += 30
if any(kw in desc for kw in ["series a", "series b", "funded", "growing"]):
score += 20
if any(kw in title for kw in ["senior", "lead", "head of", "director"]):
score += 15
# Budget signals
if row.get("salary") and "150" in str(row.get("salary", "")):
score += 10
# Freshness bonus (posted in last 7 days)
if row.get("postedAt"):
days_old = (datetime.now() - row["postedAt"]).days
if days_old <= 7:
score += 15
elif days_old <= 14:
score += 5
return score
df["lead_score"] = df.apply(score_lead, axis=1)
df = df.sort_values("lead_score", ascending=False)
# Filter: only keep scores above threshold
high_quality = df[df["lead_score"] >= 25]
print(f"High-quality leads: {len(high_quality)}")
Step 4: Enrich with Company Data
For each company, you want to find the decision-maker. Here's a simple enrichment step:
def extract_company_domain(company_name):
"""Guess the company domain from the name."""
clean = company_name.lower().strip()
clean = clean.replace(" ", "").replace(",", "").replace(".", "")
return f"{clean}.com"
def build_outreach_record(row):
return {
"company": row["company"],
"job_title": row["title"],
"location": row.get("location", ""),
"job_url": row.get("url", ""),
"lead_score": row["lead_score"],
"domain_guess": extract_company_domain(row["company"]),
"posted": row.get("postedAt", ""),
"outreach_angle": generate_angle(row)
}
def generate_angle(row):
desc = str(row.get("description", "")).lower()
if "data pipeline" in desc or "etl" in desc:
return "They're building data infrastructure — offer pipeline consulting"
if "scraping" in desc or "web data" in desc:
return "They need web data — offer scraping services"
if "automation" in desc:
return "They're automating — offer integration help"
return "General outreach — reference their job posting"
# Build the outreach list
outreach = [build_outreach_record(row) for _, row in high_quality.iterrows()]
outreach_df = pd.DataFrame(outreach)
Step 5: Export and Automate
# Save to CSV
output_file = f"leads_{datetime.now().strftime('%Y%m%d')}.csv"
outreach_df.to_csv(output_file, index=False)
print(f"Saved {len(outreach_df)} leads to {output_file}")
Automating the Pipeline
Set this up as a daily cron job or use Apify's scheduling:
# Schedule the actor to run daily at 6 AM UTC
schedule = client.schedules().create(
name="daily-linkedin-leads",
cron_expression="0 6 * * *",
actions=[{
"type": "RUN_ACTOR",
"actorId": "cryptosignals/linkedin-jobs-scraper",
"runInput": run_input
}]
)
Results You Can Expect
Running this pipeline daily for a week typically yields:
- 2,000-3,000 raw job postings
- 300-500 deduplicated, scored leads
- 50-100 high-quality leads (score ≥ 25)
The key advantage is freshness. You're reaching out to companies while they're actively looking to solve the problem you can help with.
Common Pitfalls
- Don't scrape too aggressively. The Apify actor handles rate limiting, but if you're running custom code, respect LinkedIn's limits.
- Deduplicate across runs. Companies repost jobs. Keep a SQLite database of seen postings.
- Score before outreach. A pipeline without scoring is just spam with extra steps.
- Personalize. Reference the specific job posting in your outreach. "I saw you're hiring a data engineer" beats "Dear Sir/Madam."
Wrapping Up
LinkedIn job postings are an underused signal for B2B lead generation. By combining Apify's scraping infrastructure with a Python scoring pipeline, you can build a system that surfaces high-intent leads daily — without manual prospecting.
The code above is a starting point. Customize the scoring function for your ICP, add more enrichment sources, and connect it to your outreach tool of choice.
Ready to try it? The LinkedIn Jobs Scraper actor is available on Apify Store:
It handles proxies, pagination, and anti-bot detection out of the box. Just configure your search queries and hit run.
Need custom scraping? We build it for you.
If this guide helped but you need scraping at scale or a custom solution:
👉 Get a custom web scraper built in 48h → (from $99, pay with crypto)
Or use our ready-made Apify actors: cryptosignals on Apify Store
Top comments (0)