LinkedIn Data: The Developer's Goldmine
LinkedIn has 1 billion+ members and millions of company profiles. For developers building recruiting tools, sales intelligence platforms, or market research systems, LinkedIn data is incredibly valuable — company info, job postings, employee counts, and growth trends.
This guide covers practical approaches to extracting LinkedIn company data using Python.
The LinkedIn Data Landscape
LinkedIn provides several data access paths:
| Method | Pros | Cons |
|---|---|---|
| Official API | Reliable, sanctioned | Very limited scope, requires app review |
| Scraping | Full access to public data | Anti-bot protection, ToS concerns |
| Data providers | Clean, structured | Expensive |
| Job board APIs | Good for listings | Limited company data |
Approach 1: LinkedIn's Official API
LinkedIn's API is highly restricted. The Marketing API and Consumer API require LinkedIn app approval. However, for basic company data:
import requests
def get_company_via_api(company_id, access_token):
url = f'https://api.linkedin.com/v2/organizations/{company_id}'
headers = {
'Authorization': f'Bearer {access_token}',
'X-Restli-Protocol-Version': '2.0.0'
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
data = response.json()
return {
'name': data.get('localizedName'),
'description': data.get('localizedDescription'),
'website': data.get('websiteUrl'),
'industry': data.get('localizedSpecialties'),
'employee_count': data.get('staffCount'),
}
return None
The catch: getting API access for anything beyond basic profile data requires LinkedIn partner program approval, which can take months.
Approach 2: Public Profile Scraping
LinkedIn public company pages are accessible without login. Here's how to extract data using requests with proper proxy support via ScrapeOps:
import requests
from bs4 import BeautifulSoup
import json
import time
SCRAPEOPS_API_KEY = 'your_api_key'
def scrape_linkedin_company(company_slug):
url = f'https://www.linkedin.com/company/{company_slug}/'
# Use ScrapeOps proxy for reliable access
proxy_url = f'https://proxy.scrapeops.io/v1/?api_key={SCRAPEOPS_API_KEY}&url={url}&render_js=true'
response = requests.get(proxy_url, timeout=60)
if response.status_code != 200:
print(f'Failed: {response.status_code}')
return None
soup = BeautifulSoup(response.text, 'html.parser')
# Extract structured data from JSON-LD
scripts = soup.find_all('script', type='application/ld+json')
for script in scripts:
try:
data = json.loads(script.string)
if data.get('@type') == 'Organization':
return {
'name': data.get('name'),
'description': data.get('description'),
'url': data.get('url'),
'employee_count': data.get('numberOfEmployees', {}).get('value'),
'industry': data.get('industry'),
'location': data.get('address', {}).get('addressLocality'),
'logo': data.get('logo'),
}
except json.JSONDecodeError:
continue
return None
Approach 3: Scraping Job Listings
LinkedIn job listings contain rich company data and are more accessible than profiles:
def scrape_linkedin_jobs(keywords, location='United States', max_results=100):
jobs = []
start = 0
while len(jobs) < max_results:
url = (
f'https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search'
f'?keywords={keywords}&location={location}&start={start}'
)
proxy_url = f'https://proxy.scrapeops.io/v1/?api_key={SCRAPEOPS_API_KEY}&url={url}'
response = requests.get(proxy_url, timeout=30)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, 'html.parser')
cards = soup.select('.base-card')
if not cards:
break
for card in cards:
title = card.select_one('.base-search-card__title')
company = card.select_one('.base-search-card__subtitle')
location_el = card.select_one('.job-search-card__location')
link = card.select_one('a.base-card__full-link')
date = card.select_one('time')
jobs.append({
'title': title.get_text(strip=True) if title else None,
'company': company.get_text(strip=True) if company else None,
'location': location_el.get_text(strip=True) if location_el else None,
'url': link.get('href') if link else None,
'posted_date': date.get('datetime') if date else None,
})
start += 25
time.sleep(2)
return jobs[:max_results]
# Example: Find all Python developer jobs
jobs = scrape_linkedin_jobs('python developer', 'New York')
print(f'Found {len(jobs)} jobs')
Extracting Company Details from Job Posts
def enrich_company_data(job_url):
proxy_url = f'https://proxy.scrapeops.io/v1/?api_key={SCRAPEOPS_API_KEY}&url={job_url}&render_js=true'
response = requests.get(proxy_url, timeout=60)
if response.status_code != 200:
return None
soup = BeautifulSoup(response.text, 'html.parser')
company_info = {}
# Job description often contains company details
description = soup.select_one('.show-more-less-html__markup')
if description:
company_info['job_description'] = description.get_text(strip=True)
# Company size, industry from sidebar
criteria = soup.select('.description__job-criteria-item')
for item in criteria:
label = item.select_one('.description__job-criteria-subheader')
value = item.select_one('.description__job-criteria-text')
if label and value:
company_info[label.get_text(strip=True)] = value.get_text(strip=True)
return company_info
Building a Company Database
import pandas as pd
from collections import defaultdict
def build_company_database(keywords_list, locations):
companies = defaultdict(dict)
for keywords in keywords_list:
for location in locations:
jobs = scrape_linkedin_jobs(keywords, location, max_results=50)
for job in jobs:
company_name = job['company']
if company_name not in companies:
companies[company_name] = {
'name': company_name,
'job_count': 0,
'locations': set(),
'roles': [],
}
companies[company_name]['job_count'] += 1
companies[company_name]['locations'].add(job['location'])
companies[company_name]['roles'].append(job['title'])
# Convert to DataFrame
records = []
for company in companies.values():
company['locations'] = list(company['locations'])
company['top_roles'] = list(set(company['roles']))[:5]
del company['roles']
records.append(company)
return pd.DataFrame(records).sort_values('job_count', ascending=False)
The Managed Solution
Building LinkedIn scrapers that work reliably is challenging — they update their anti-bot measures frequently. For production use, the LinkedIn Jobs Scraper on Apify handles proxy rotation, rate limiting, and data extraction automatically.
Tips for Reliable LinkedIn Scraping
- Always use proxies — LinkedIn blocks datacenter IPs aggressively. ScrapeOps provides reliable proxy aggregation.
- Respect rate limits — Keep requests under 1 per second
- Target public data — Stick to public profiles and job listings
- Use JSON-LD first — Structured data is more reliable than HTML parsing
- Handle failures — LinkedIn returns various error codes; implement retry logic
- Cache aggressively — Company data doesn't change hourly
Conclusion
LinkedIn company data extraction in 2026 works best through a combination of public page scraping and job listing analysis. Use proper proxy infrastructure like ScrapeOps to handle anti-bot measures, and consider the LinkedIn Jobs Scraper on Apify for production workloads. Focus on public data, respect rate limits, and build incrementally.
Top comments (0)