DEV Community

agenthustler
agenthustler

Posted on • Originally published at web-data-labs.com

How to Build a LinkedIn Talent Pipeline Scraper in 2026 (Without LinkedIn's API)

I spent the last two months helping a friend's recruiting agency move off a $4,000/month sourcing tool. The pitch was simple: they wanted to pull a few thousand LinkedIn profiles a week based on job titles, enrich them, score them, and feed the top matches into their ATS. LinkedIn's official API, as anyone who has tried it knows, is basically a locked door unless you're a Fortune 500 partner. So we went the scraper route — and it worked better than either of us expected.

Here's how I built it, what I learned, and the Python code you can steal.

Why not the official API?

LinkedIn's partner API (Talent Solutions) is gated behind sales calls, contracts, and minimums you don't want to see. For a small agency or a solo recruiter, it's not an option. The Sign In With LinkedIn OAuth endpoints only give you basic profile info for the user who logged in — not search, not lookups, not bulk data.

Everyone I know who scales LinkedIn sourcing does one of two things:

  1. Pays a SaaS that scrapes on their behalf and wraps it in a UI.
  2. Runs their own scraper.

Option 2 is cheaper, more flexible, and the data is yours to do with as you please.

The stack

I used Apify's LinkedIn Profile Scraper as the data layer. It handles the proxy rotation, fingerprinting, and retry logic that you really don't want to maintain yourself. I built the pipeline in plain Python — no framework, just requests, sqlite3, and a couple of CSV dumps.

Pricing was the other reason I went with this actor: $0.005 per result. For 5,000 profiles a week that's $25, which is absolutely nothing compared to the $4k/month the agency was paying before.

Step 1 — Pull the profiles

Let's say we want Senior Python Engineers in Berlin. I keep my seed list in a plain text file: one profile URL per line. You can generate that seed list from a LinkedIn search URL using the same actor, but for this article I'll assume you already have URLs.

import os
import requests

APIFY_TOKEN = os.environ["APIFY_TOKEN"]
ACTOR_ID = "cryptosignals~linkedin-profile-scraper"

def scrape_profiles(urls: list[str]) -> list[dict]:
    run_input = {
        "profileUrls": urls,
        "includeSkills": True,
        "includeExperience": True,
    }
    resp = requests.post(
        f"https://api.apify.com/v2/acts/{ACTOR_ID}/run-sync-get-dataset-items",
        params={"token": APIFY_TOKEN},
        json=run_input,
        timeout=600,
    )
    resp.raise_for_status()
    return resp.json()
Enter fullscreen mode Exit fullscreen mode

run-sync-get-dataset-items blocks until the actor finishes and returns the dataset in one shot. For small batches that's the easiest pattern. For 5,000+ profiles, switch to the async runs endpoint and poll.

Step 2 — Store and deduplicate

Recruiters re-run the same searches every week. You do not want to re-scrape profiles you already have fresh data on, both for cost and politeness.

import sqlite3
from datetime import datetime, timedelta

DB = sqlite3.connect("talent.db")
DB.execute("""
CREATE TABLE IF NOT EXISTS profiles (
    url TEXT PRIMARY KEY,
    name TEXT,
    headline TEXT,
    location TEXT,
    skills TEXT,
    last_seen TEXT
)
""")

def needs_refresh(url: str, max_age_days: int = 14) -> bool:
    row = DB.execute(
        "SELECT last_seen FROM profiles WHERE url = ?", (url,)
    ).fetchone()
    if not row:
        return True
    last = datetime.fromisoformat(row[0])
    return datetime.utcnow() - last > timedelta(days=max_age_days)

def upsert(profile: dict) -> None:
    DB.execute("""
    INSERT INTO profiles(url, name, headline, location, skills, last_seen)
    VALUES (?, ?, ?, ?, ?, ?)
    ON CONFLICT(url) DO UPDATE SET
        name=excluded.name,
        headline=excluded.headline,
        location=excluded.location,
        skills=excluded.skills,
        last_seen=excluded.last_seen
    """, (
        profile["url"],
        profile.get("fullName"),
        profile.get("headline"),
        profile.get("location"),
        ",".join(profile.get("skills", [])),
        datetime.utcnow().isoformat(),
    ))
    DB.commit()
Enter fullscreen mode Exit fullscreen mode

A 14-day TTL works well in practice. People don't update their headline every week.

Step 3 — Score candidates

This is where a pipeline stops being a scraper and starts being a tool. The agency cared about three signals: years in role, relevance of past companies, and whether the person was open to contract work (mentioned in the headline or about section).

def score(profile: dict) -> int:
    points = 0
    headline = (profile.get("headline") or "").lower()
    if "senior" in headline or "staff" in headline:
        points += 2
    if "open to" in headline or "contract" in headline:
        points += 3

    for exp in profile.get("experience", []):
        title = (exp.get("title") or "").lower()
        if "python" in title:
            points += 2
        if exp.get("company") in TARGET_COMPANIES:
            points += 5
    return points
Enter fullscreen mode Exit fullscreen mode

Tune TARGET_COMPANIES to your industry. The agency keeps a list of ~60 companies whose alumni they love to source from; hitting one is a huge signal.

Step 4 — Ship to the ATS

Once you've ranked profiles, push the top N into wherever your team actually works. For the agency that meant a CSV drop into a shared folder, but a webhook into Greenhouse or Airtable works just as well.

import csv

def export_top(n: int = 50, path: str = "top_candidates.csv") -> None:
    rows = DB.execute(
        "SELECT url, name, headline, location FROM profiles ORDER BY last_seen DESC LIMIT ?",
        (n,),
    ).fetchall()
    with open(path, "w", newline="") as f:
        w = csv.writer(f)
        w.writerow(["url", "name", "headline", "location"])
        w.writerows(rows)
Enter fullscreen mode Exit fullscreen mode

What I'd do differently

Two things bit me in the first month:

  • Don't scrape the same profile twice in one day. Even with proxy rotation, you're wasting money. The dedupe check above exists for a reason.
  • Rate yourself, not just the scraper. The actor handles its side. You should still cap your runs — I schedule one batch of 500 profiles every few hours rather than 5,000 in one shot. Smoother results, easier debugging.

Total cost

At $0.005/result, the agency's weekly 5,000-profile refresh costs $25. Add a few dollars for compute. Compared to the SaaS they cancelled, it pays for a nice dinner every week and then some.

If you want to skip the code and just try the actor, it's here: LinkedIn Profile Scraper on Apify. The input schema is documented and the free tier lets you run a few hundred profiles before you put a card down.

Happy sourcing.

Top comments (0)