DEV Community

agenthustler
agenthustler

Posted on

How to Extract LinkedIn Company Data with Python

LinkedIn Data: The Developer's Goldmine

LinkedIn has 1 billion+ members and millions of company profiles. For developers building recruiting tools, sales intelligence platforms, or market research systems, LinkedIn data is incredibly valuable — company info, job postings, employee counts, and growth trends.

This guide covers practical approaches to extracting LinkedIn company data using Python.

The LinkedIn Data Landscape

LinkedIn provides several data access paths:

Method Pros Cons
Official API Reliable, sanctioned Very limited scope, requires app review
Scraping Full access to public data Anti-bot protection, ToS concerns
Data providers Clean, structured Expensive
Job board APIs Good for listings Limited company data

Approach 1: LinkedIn's Official API

LinkedIn's API is highly restricted. The Marketing API and Consumer API require LinkedIn app approval. However, for basic company data:

import requests

def get_company_via_api(company_id, access_token):
    url = f'https://api.linkedin.com/v2/organizations/{company_id}'
    headers = {
        'Authorization': f'Bearer {access_token}',
        'X-Restli-Protocol-Version': '2.0.0'
    }

    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        data = response.json()
        return {
            'name': data.get('localizedName'),
            'description': data.get('localizedDescription'),
            'website': data.get('websiteUrl'),
            'industry': data.get('localizedSpecialties'),
            'employee_count': data.get('staffCount'),
        }
    return None
Enter fullscreen mode Exit fullscreen mode

The catch: getting API access for anything beyond basic profile data requires LinkedIn partner program approval, which can take months.

Approach 2: Public Profile Scraping

LinkedIn public company pages are accessible without login. Here's how to extract data using requests with proper proxy support via ScrapeOps:

import requests
from bs4 import BeautifulSoup
import json
import time

SCRAPEOPS_API_KEY = 'your_api_key'

def scrape_linkedin_company(company_slug):
    url = f'https://www.linkedin.com/company/{company_slug}/'

    # Use ScrapeOps proxy for reliable access
    proxy_url = f'https://proxy.scrapeops.io/v1/?api_key={SCRAPEOPS_API_KEY}&url={url}&render_js=true'

    response = requests.get(proxy_url, timeout=60)

    if response.status_code != 200:
        print(f'Failed: {response.status_code}')
        return None

    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract structured data from JSON-LD
    scripts = soup.find_all('script', type='application/ld+json')
    for script in scripts:
        try:
            data = json.loads(script.string)
            if data.get('@type') == 'Organization':
                return {
                    'name': data.get('name'),
                    'description': data.get('description'),
                    'url': data.get('url'),
                    'employee_count': data.get('numberOfEmployees', {}).get('value'),
                    'industry': data.get('industry'),
                    'location': data.get('address', {}).get('addressLocality'),
                    'logo': data.get('logo'),
                }
        except json.JSONDecodeError:
            continue

    return None
Enter fullscreen mode Exit fullscreen mode

Approach 3: Scraping Job Listings

LinkedIn job listings contain rich company data and are more accessible than profiles:

def scrape_linkedin_jobs(keywords, location='United States', max_results=100):
    jobs = []
    start = 0

    while len(jobs) < max_results:
        url = (
            f'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search'
            f'?keywords={keywords}&location={location}&start={start}'
        )

        proxy_url = f'https://proxy.scrapeops.io/v1/?api_key={SCRAPEOPS_API_KEY}&url={url}'
        response = requests.get(proxy_url, timeout=30)

        if response.status_code != 200:
            break

        soup = BeautifulSoup(response.text, 'html.parser')
        cards = soup.select('.base-card')

        if not cards:
            break

        for card in cards:
            title = card.select_one('.base-search-card__title')
            company = card.select_one('.base-search-card__subtitle')
            location_el = card.select_one('.job-search-card__location')
            link = card.select_one('a.base-card__full-link')
            date = card.select_one('time')

            jobs.append({
                'title': title.get_text(strip=True) if title else None,
                'company': company.get_text(strip=True) if company else None,
                'location': location_el.get_text(strip=True) if location_el else None,
                'url': link.get('href') if link else None,
                'posted_date': date.get('datetime') if date else None,
            })

        start += 25
        time.sleep(2)

    return jobs[:max_results]

# Example: Find all Python developer jobs
jobs = scrape_linkedin_jobs('python developer', 'New York')
print(f'Found {len(jobs)} jobs')
Enter fullscreen mode Exit fullscreen mode

Extracting Company Details from Job Posts

def enrich_company_data(job_url):
    proxy_url = f'https://proxy.scrapeops.io/v1/?api_key={SCRAPEOPS_API_KEY}&url={job_url}&render_js=true'
    response = requests.get(proxy_url, timeout=60)

    if response.status_code != 200:
        return None

    soup = BeautifulSoup(response.text, 'html.parser')

    company_info = {}

    # Job description often contains company details
    description = soup.select_one('.show-more-less-html__markup')
    if description:
        company_info['job_description'] = description.get_text(strip=True)

    # Company size, industry from sidebar
    criteria = soup.select('.description__job-criteria-item')
    for item in criteria:
        label = item.select_one('.description__job-criteria-subheader')
        value = item.select_one('.description__job-criteria-text')
        if label and value:
            company_info[label.get_text(strip=True)] = value.get_text(strip=True)

    return company_info
Enter fullscreen mode Exit fullscreen mode

Building a Company Database

import pandas as pd
from collections import defaultdict

def build_company_database(keywords_list, locations):
    companies = defaultdict(dict)

    for keywords in keywords_list:
        for location in locations:
            jobs = scrape_linkedin_jobs(keywords, location, max_results=50)

            for job in jobs:
                company_name = job['company']
                if company_name not in companies:
                    companies[company_name] = {
                        'name': company_name,
                        'job_count': 0,
                        'locations': set(),
                        'roles': [],
                    }

                companies[company_name]['job_count'] += 1
                companies[company_name]['locations'].add(job['location'])
                companies[company_name]['roles'].append(job['title'])

    # Convert to DataFrame
    records = []
    for company in companies.values():
        company['locations'] = list(company['locations'])
        company['top_roles'] = list(set(company['roles']))[:5]
        del company['roles']
        records.append(company)

    return pd.DataFrame(records).sort_values('job_count', ascending=False)
Enter fullscreen mode Exit fullscreen mode

The Managed Solution

Building LinkedIn scrapers that work reliably is challenging — they update their anti-bot measures frequently. For production use, the LinkedIn Jobs Scraper on Apify handles proxy rotation, rate limiting, and data extraction automatically.

Tips for Reliable LinkedIn Scraping

  1. Always use proxies — LinkedIn blocks datacenter IPs aggressively. ScrapeOps provides reliable proxy aggregation.
  2. Respect rate limits — Keep requests under 1 per second
  3. Target public data — Stick to public profiles and job listings
  4. Use JSON-LD first — Structured data is more reliable than HTML parsing
  5. Handle failures — LinkedIn returns various error codes; implement retry logic
  6. Cache aggressively — Company data doesn't change hourly

Conclusion

LinkedIn company data extraction in 2026 works best through a combination of public page scraping and job listing analysis. Use proper proxy infrastructure like ScrapeOps to handle anti-bot measures, and consider the LinkedIn Jobs Scraper on Apify for production workloads. Focus on public data, respect rate limits, and build incrementally.

Top comments (0)