Vhub Systems

Posted on Apr 3

How to Build a Remote Jobs Aggregator: Scraping LinkedIn, WWR, and Remotive

#webscraping #python #javascript #automation

Remote job boards post thousands of listings daily across dozens of platforms. Manually checking LinkedIn, Remote.co, We Work Remotely, Remotive, and AngelList is a full-time job. Here's how to automate the aggregation.

Why build a remote jobs aggregator?

The same job is often posted on 3-7 different boards with different salary ranges, different application deadlines, and slightly different requirements. An aggregator lets you:

Deduplicate across boards (same role, same company)
Set custom alerts (Python developer, $120K+, async-first)
Track which companies are growing (consistent hiring = healthy)
Build lead lists (companies hiring = companies with budget)

The architecture

Job boards → Scrapers → Deduplication → Database → Alerts/API

Each board needs a separate scraper since they all use different HTML structures and anti-bot approaches.

Board-by-board approach

LinkedIn Jobs (largest volume)

LinkedIn limits anonymous access but their job search API is partially accessible:

import requests

def scrape_linkedin_jobs(keywords: str, location: str = "Remote", count: int = 25) -> list:
    url = "https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search"
    params = {
        "keywords": keywords,
        "location": location,
        "start": 0,
        "count": count,
        "f_WT": "2",  # Remote only
    }
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/122.0.0.0",
        "Accept": "application/json",
    }

    response = requests.get(url, params=params, headers=headers)

    if response.status_code == 200:
        from bs4 import BeautifulSoup
        soup = BeautifulSoup(response.text, "html.parser")
        jobs = []

        for card in soup.select(".job-search-card"):
            jobs.append({
                "title": card.select_one(".job-search-card__title")?.get_text(strip=True),
                "company": card.select_one(".job-search-card__company-name")?.get_text(strip=True),
                "location": card.select_one(".job-search-card__location")?.get_text(strip=True),
                "posted": card.select_one("time")?.get("datetime"),
                "url": card.select_one("a")?.get("href"),
            })
        return jobs
    return []

jobs = scrape_linkedin_jobs("python developer")
print(f"Found {len(jobs)} LinkedIn jobs")

We Work Remotely (cleaner HTML, no auth needed)

def scrape_wwr(category: str = "programming") -> list:
    url = f"https://weworkremotely.com/categories/remote-{category}-jobs.rss"

    import feedparser
    feed = feedparser.parse(url)

    return [
        {
            "title": entry.title,
            "company": entry.get("company", ""),
            "url": entry.link,
            "published": entry.get("published", ""),
            "description": entry.get("summary", "")[:500],
        }
        for entry in feed.entries
    ]

wwr_jobs = scrape_wwr("programming")
print(f"Found {len(wwr_jobs)} WWR jobs")

Remotive (has an official free API)

def get_remotive_jobs(category: str = "software-dev") -> list:
    response = requests.get(
        "https://remotive.com/api/remote-jobs",
        params={"category": category, "limit": 50}
    )

    if response.status_code == 200:
        return response.json().get("jobs", [])
    return []

AngelList/Wellfound (requires auth)

AngelList's job data requires a session. Use Playwright:

from playwright.async_api import async_playwright
import asyncio

async def scrape_wellfound_jobs(keywords: str) -> list:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        await page.goto(f"https://wellfound.com/jobs?q={keywords}&remote=true")
        await page.wait_for_selector("[data-test='JobListItem']", timeout=10000)

        jobs = await page.evaluate("""
            () => Array.from(document.querySelectorAll('[data-test="JobListItem"]')).map(el => ({
                title: el.querySelector('[data-test="job-title"]')?.innerText,
                company: el.querySelector('[data-test="company-name"]')?.innerText,
                salary: el.querySelector('[data-test="salary"]')?.innerText,
            }))
        """)

        await browser.close()
        return jobs

Deduplication logic

The same role appears on multiple boards. Deduplicate by company + title similarity:

from difflib import SequenceMatcher

def deduplicate_jobs(jobs: list) -> list:
    unique = []

    for job in jobs:
        is_duplicate = False
        key = f"{job.get('company','').lower()} {job.get('title','').lower()}"

        for existing in unique:
            existing_key = f"{existing.get('company','').lower()} {existing.get('title','').lower()}"
            similarity = SequenceMatcher(None, key, existing_key).ratio()

            if similarity > 0.85:  # 85% similar = same job
                is_duplicate = True
                break

        if not is_duplicate:
            unique.append(job)

    return unique

# Aggregate and deduplicate
all_jobs = (
    scrape_linkedin_jobs("python developer") +
    scrape_wwr("programming") +
    get_remotive_jobs("software-dev")
)

unique_jobs = deduplicate_jobs(all_jobs)
print(f"Total: {len(all_jobs)} | After dedup: {len(unique_jobs)}")

Scheduling and alerts

Run this pipeline on schedule (n8n, cron, or Apify schedule) and filter for your criteria:

def filter_jobs(jobs: list, filters: dict) -> list:
    results = []
    for job in jobs:
        title = job.get("title", "").lower()
        salary_text = job.get("salary", "") or ""

        # Keyword filter
        if filters.get("keywords"):
            if not any(kw.lower() in title for kw in filters["keywords"]):
                continue

        # Salary filter (rough)
        if filters.get("min_salary"):
            import re
            salaries = re.findall(r'\$(\d+)', salary_text.replace(",", ""))
            if salaries:
                max_sal = max(int(s) for s in salaries)
                if max_sal < filters["min_salary"]:
                    continue

        results.append(job)

    return results

target_jobs = filter_jobs(unique_jobs, {
    "keywords": ["python", "backend", "api"],
    "min_salary": 100000
})

# Send alerts
for job in target_jobs:
    send_slack_notification(job)  # or email, Telegram, etc.

The pre-built option

The Remote Jobs Aggregator on Apify covers LinkedIn, WWR, Remotive, and Wellfound in one run. Input your keywords and salary filters, get deduplicated results as JSON or webhook push.

116+ production runs. Pay-per-result pricing.

Using this for lead generation

Beyond job hunting, remote job data is powerful for sales:

Companies hiring remotely at scale = have budget + culture fit for tools
5+ backend engineer openings in 3 months = platform rebuild in progress
"DevOps" + "Kubernetes" + "security" = enterprise security concerns

Filter job postings by your ICP, then use the Contact Info Scraper to get decision-maker contacts from those company websites.

n8n AI Automation Pack ($39) — 5 production-ready workflows

Skip the setup

Apify Scrapers Bundle — $29 one-time

Includes the Remote Jobs Aggregator and 34 other production scrapers.

DEV Community