DEV Community

agenthustler
agenthustler

Posted on

LinkedIn Company Scraper: Extract Business Profiles and Employee Data

Introduction: The LinkedIn Data Goldmine

LinkedIn is the world's largest professional network, with over 1 billion members and 67 million company pages. For anyone in sales, recruiting, market research, or competitive intelligence, LinkedIn company data is incredibly valuable — employee counts, industry classifications, recent activity, funding information, and organizational structure.

But LinkedIn's official API is heavily restricted. The Marketing API requires partner approval, and even then, the data you can access is limited. If you need comprehensive company profile data at scale, scraping public LinkedIn pages is often the only practical option.

In this guide, I'll cover how to extract business profiles and employee data from LinkedIn's public company pages, with practical code examples and a production-ready approach using Apify.


What Data Can You Extract from LinkedIn Company Pages?

LinkedIn company pages are surprisingly data-rich. Here's what's publicly available without logging in:

Company Overview

  • Company name and tagline
  • Industry classification
  • Company size (employee range: 1-10, 11-50, 51-200, etc.)
  • Headquarters location
  • Founded year
  • Company type (Public, Private, Nonprofit, etc.)
  • Website URL
  • Specialties (keywords the company lists)
  • About/Description text

Employee Insights

  • Total employee count on LinkedIn
  • Employee distribution by function (Engineering, Sales, Marketing, etc.)
  • Employee growth rate (visible on some pages)
  • Notable employees (people with high follower counts)
  • New hires vs. departures trend

Activity Data

  • Recent posts by the company page
  • Post engagement (likes, comments, shares)
  • Posting frequency
  • Content themes (what topics they post about)

Jobs

  • Active job listings count
  • Job locations and types
  • Growth signals (lots of hiring = growing)

Understanding LinkedIn's Page Structure

LinkedIn company pages follow a consistent URL pattern:

https://www.linkedin.com/company/{company-slug}/
https://www.linkedin.com/company/{company-slug}/about/
https://www.linkedin.com/company/{company-slug}/people/
https://www.linkedin.com/company/{company-slug}/posts/
https://www.linkedin.com/company/{company-slug}/jobs/
Enter fullscreen mode Exit fullscreen mode

The main page shows an overview, while sub-pages provide detailed sections. For scraping, the /about/ page is the most data-dense — it contains structured fields in a consistent format.

The Public vs. Authenticated View

LinkedIn shows different data depending on whether you're logged in:

Data Point Public View Logged-in View
Company name/description Yes Yes
Employee count Approximate Exact
Industry Yes Yes
Employee list No Yes (limited)
Follower count Yes Yes
Recent posts Limited (3-5) Full feed
Job listings Count only Full listings

For this guide, we'll focus on public data extraction, which doesn't require authentication and avoids ToS complications.


Building a LinkedIn Company Scraper

LinkedIn's public pages are server-side rendered but include structured data in JSON-LD and microdata formats. Here's how to extract it:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    maxConcurrency: 2, // LinkedIn is aggressive about rate limiting

    preNavigationHooks: [
        async ({ request }) => {
            // Essential: mimic a real browser
            request.headers = {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
                'Accept-Language': 'en-US,en;q=0.9',
                'Accept-Encoding': 'gzip, deflate, br',
                'Cache-Control': 'no-cache',
            };

            // Random delay: 3-8 seconds between requests
            const delay = 3000 + Math.random() * 5000;
            await new Promise(r => setTimeout(r, delay));
        },
    ],

    async requestHandler({ request, $, body }) {
        const companyData = await extractCompanyData($, body, request.url);
        await Dataset.pushData(companyData);
    },
});
Enter fullscreen mode Exit fullscreen mode

Extracting Structured Data

LinkedIn embeds JSON-LD structured data in company pages. This is the most reliable extraction method:

async function extractCompanyData($, body, url) {
    // Method 1: JSON-LD (most reliable)
    const jsonLd = extractJsonLd($);

    // Method 2: Meta tags (fallback)
    const metaData = extractMetaTags($);

    // Method 3: Page content parsing (for data not in structured formats)
    const pageData = extractFromPage($);

    return {
        url,
        ...mergeData(jsonLd, metaData, pageData),
        scrapedAt: new Date().toISOString(),
    };
}

function extractJsonLd($) {
    const scripts = $('script[type="application/ld+json"]');
    const results = {};

    scripts.each((_, el) => {
        try {
            const data = JSON.parse($(el).html());

            if (data['@type'] === 'Organization') {
                results.name = data.name;
                results.description = data.description;
                results.url = data.url;
                results.logo = data.logo?.url || data.logo;
                results.foundingDate = data.foundingDate;
                results.numberOfEmployees = data.numberOfEmployees?.value;

                if (data.address) {
                    results.headquarters = {
                        street: data.address.streetAddress,
                        city: data.address.addressLocality,
                        state: data.address.addressRegion,
                        country: data.address.addressCountry,
                    };
                }

                if (data.member) {
                    results.memberCount = Array.isArray(data.member) 
                        ? data.member.length 
                        : null;
                }
            }
        } catch (e) {
            // Invalid JSON-LD, skip
        }
    });

    return results;
}

function extractMetaTags($) {
    return {
        title: $('meta[property="og:title"]').attr('content'),
        description: $('meta[property="og:description"]').attr('content'),
        image: $('meta[property="og:image"]').attr('content'),
        type: $('meta[property="og:type"]').attr('content'),
        twitterTitle: $('meta[name="twitter:title"]').attr('content'),
        twitterDescription: $('meta[name="twitter:description"]').attr('content'),
    };
}
Enter fullscreen mode Exit fullscreen mode

Parsing the Company About Page

The /about/ page contains structured fields that aren't always in JSON-LD:

function extractFromPage($) {
    const data = {};

    // Company details are in definition list format
    $('dl.overflow-hidden').each((_, dl) => {
        const terms = $(dl).find('dt');
        const definitions = $(dl).find('dd');

        terms.each((i, dt) => {
            const label = $(dt).text().trim().toLowerCase();
            const value = $(definitions[i]).text().trim();

            switch (label) {
                case 'website':
                    data.website = value;
                    break;
                case 'industry':
                    data.industry = value;
                    break;
                case 'company size':
                    data.companySizeRange = value;
                    data.employeeCount = parseEmployeeCount(value);
                    break;
                case 'headquarters':
                    data.headquartersText = value;
                    break;
                case 'type':
                    data.companyType = value;
                    break;
                case 'founded':
                    data.founded = parseInt(value) || null;
                    break;
                case 'specialties':
                    data.specialties = value.split(',').map(s => s.trim());
                    break;
            }
        });
    });

    // Extract follower count
    const followerText = $('[data-test-id="about-us__followers"]').text();
    const followerMatch = followerText.match(/([\d,]+)/);
    data.followers = followerMatch 
        ? parseInt(followerMatch[1].replace(/,/g, '')) 
        : null;

    // Extract the about text
    data.about = $('section.about p').text().trim() || null;

    return data;
}

function parseEmployeeCount(sizeText) {
    // "10,001+ employees" → 10001
    // "51-200 employees" → { min: 51, max: 200 }
    // "1,001-5,000 employees" → { min: 1001, max: 5000 }

    const plusMatch = sizeText.match(/([\d,]+)\+/);
    if (plusMatch) {
        return { min: parseInt(plusMatch[1].replace(/,/g, '')), max: null };
    }

    const rangeMatch = sizeText.match(/([\d,]+)\s*-\s*([\d,]+)/);
    if (rangeMatch) {
        return {
            min: parseInt(rangeMatch[1].replace(/,/g, '')),
            max: parseInt(rangeMatch[2].replace(/,/g, '')),
        };
    }

    return null;
}
Enter fullscreen mode Exit fullscreen mode

Extracting Employee Distribution Data

LinkedIn's company pages show employee distribution by department. This data is valuable for understanding organizational structure:

async function extractEmployeeInsights($) {
    const insights = {
        byFunction: [],
        growth: null,
        totalOnLinkedIn: null,
    };

    // Employee count on LinkedIn
    const countText = $('[data-test-id="about-us__employees-on-linkedin"]').text();
    const countMatch = countText.match(/([\d,]+)/);
    insights.totalOnLinkedIn = countMatch 
        ? parseInt(countMatch[1].replace(/,/g, '')) 
        : null;

    // Department breakdown
    $('[data-test-id="employee-distribution"] li').each((_, li) => {
        const functionName = $(li).find('.function-name').text().trim();
        const percentage = $(li).find('.function-percentage').text().trim();
        const count = $(li).find('.function-count').text().trim();

        if (functionName) {
            insights.byFunction.push({
                department: functionName,
                percentage: parseFloat(percentage) || null,
                estimatedCount: parseInt(count.replace(/[^\d]/g, '')) || null,
            });
        }
    });

    return insights;
}
Enter fullscreen mode Exit fullscreen mode

This gives you data like:

{
    "byFunction": [
        { "department": "Engineering", "percentage": 35, "estimatedCount": 4200 },
        { "department": "Sales", "percentage": 18, "estimatedCount": 2160 },
        { "department": "Marketing", "percentage": 12, "estimatedCount": 1440 },
        { "department": "Operations", "percentage": 10, "estimatedCount": 1200 },
        { "department": "Human Resources", "percentage": 8, "estimatedCount": 960 }
    ],
    "totalOnLinkedIn": 12000
}
Enter fullscreen mode Exit fullscreen mode

Extracting Recent Company Posts

Company posts reveal marketing strategy, product launches, and company culture:

async function extractCompanyPosts($) {
    const posts = [];

    $('div[data-test-id="update-card"]').each((_, card) => {
        const post = {
            text: $(card).find('.update-text').text().trim(),
            timestamp: $(card).find('time').attr('datetime') || 
                       $(card).find('.update-date').text().trim(),
            likes: parseEngagement($(card).find('.likes-count').text()),
            comments: parseEngagement($(card).find('.comments-count').text()),
            shares: parseEngagement($(card).find('.shares-count').text()),
            hasImage: $(card).find('img.update-image').length > 0,
            hasVideo: $(card).find('video').length > 0,
            hasArticle: $(card).find('.article-card').length > 0,
        };

        // Calculate engagement rate
        const totalEngagement = (post.likes || 0) + 
                               (post.comments || 0) + 
                               (post.shares || 0);
        post.totalEngagement = totalEngagement;

        posts.push(post);
    });

    return posts;
}

function parseEngagement(text) {
    if (!text) return 0;
    text = text.trim().toLowerCase();

    // Handle "1.2K", "3.5M" formats
    const multipliers = { k: 1000, m: 1000000 };
    const match = text.match(/([\d.]+)\s*([km])?/);

    if (match) {
        const num = parseFloat(match[1]);
        const mult = multipliers[match[2]] || 1;
        return Math.round(num * mult);
    }

    return parseInt(text.replace(/[^\d]/g, '')) || 0;
}
Enter fullscreen mode Exit fullscreen mode

Using Apify for Production LinkedIn Scraping

LinkedIn is one of the most challenging sites to scrape at scale. Here's why Apify is the right tool:

Challenge 1: Aggressive Rate Limiting

LinkedIn will block your IP after just a few dozen requests. Apify's residential proxy pool rotates IPs automatically:

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
    countryCode: 'US',
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxConcurrency: 2,
    maxRequestRetries: 5,

    sessionPoolOptions: {
        maxPoolSize: 20,
        sessionOptions: {
            maxUsageCount: 5, // Retire session after 5 uses
        },
    },
});
Enter fullscreen mode Exit fullscreen mode

Challenge 2: Authentication Walls

Some LinkedIn data requires being logged in. For public company data, you can often get what you need from the unauthenticated view by targeting specific endpoints and structured data.

Ready-Made Solution on Apify Store

If you don't want to build from scratch, there are ready-made LinkedIn company scrapers on the Apify Store. For example, you can find scrapers that accept a list of company URLs or search queries and return structured company profile data — no coding required.

These actors handle proxy rotation, anti-bot detection, and data normalization out of the box. You just provide the input (company names or URLs) and get clean JSON output.


Building a Complete Company Intelligence Pipeline

Here's how to combine everything into a useful data pipeline:

import { Actor } from 'apify';
import { CheerioCrawler, Dataset } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const {
    companyUrls = [],
    companyNames = [],
    includeEmployeeInsights = true,
    includePosts = true,
    maxPostsPerCompany = 10,
} = input;

// Build URLs from company names
const startUrls = [
    ...companyUrls.map(url => ({
        url: url.endsWith('/about/') ? url : `${url.replace(/\/$/, '')}/about/`,
        userData: { label: 'ABOUT' },
    })),
    ...companyNames.map(name => ({
        url: `https://www.linkedin.com/company/${name.toLowerCase().replace(/\s+/g, '-')}/about/`,
        userData: { label: 'ABOUT', originalName: name },
    })),
];

const proxyConfig = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],
});

const crawler = new CheerioCrawler({
    proxyConfiguration: proxyConfig,
    maxConcurrency: 2,

    async requestHandler({ request, $, body, enqueueLinks }) {
        const { label } = request.userData;

        if (label === 'ABOUT') {
            const companyData = await extractCompanyData($, body, request.url);

            // Enqueue sub-pages if requested
            const baseUrl = request.url.replace(/\/about\/?$/, '');

            if (includeEmployeeInsights) {
                await enqueueLinks({
                    urls: [`${baseUrl}/people/`],
                    userData: { 
                        label: 'PEOPLE', 
                        companyName: companyData.name 
                    },
                });
            }

            if (includePosts) {
                await enqueueLinks({
                    urls: [`${baseUrl}/posts/`],
                    userData: { 
                        label: 'POSTS', 
                        companyName: companyData.name 
                    },
                });
            }

            await Dataset.pushData({
                ...companyData,
                dataType: 'company_profile',
            });
        }

        if (label === 'PEOPLE') {
            const insights = await extractEmployeeInsights($);
            await Dataset.pushData({
                companyName: request.userData.companyName,
                ...insights,
                dataType: 'employee_insights',
            });
        }

        if (label === 'POSTS') {
            const posts = await extractCompanyPosts($);
            await Dataset.pushData({
                companyName: request.userData.companyName,
                posts: posts.slice(0, maxPostsPerCompany),
                dataType: 'company_posts',
            });
        }
    },
});

await crawler.run(startUrls);
await Actor.exit();
Enter fullscreen mode Exit fullscreen mode

Practical Use Cases for LinkedIn Company Data

1. Sales Prospecting and Lead Generation

Filter companies by industry, size, and location to build targeted prospect lists:

function qualifyLead(company) {
    const score = {
        total: 0,
        factors: [],
    };

    // Company size scoring
    if (company.employeeCount?.min >= 50 && company.employeeCount?.max <= 500) {
        score.total += 30;
        score.factors.push('Ideal company size (50-500)');
    }

    // Industry fit
    const targetIndustries = ['Technology', 'Software', 'SaaS', 'Information Technology'];
    if (targetIndustries.some(ind => company.industry?.includes(ind))) {
        score.total += 25;
        score.factors.push('Target industry match');
    }

    // Growth signals
    if (company.recentJobCount > 10) {
        score.total += 20;
        score.factors.push('Actively hiring (growth signal)');
    }

    // Engagement level
    if (company.followers > 5000) {
        score.total += 15;
        score.factors.push('Strong LinkedIn presence');
    }

    // Has website (can research further)
    if (company.website) {
        score.total += 10;
        score.factors.push('Website available for research');
    }

    return { ...company, leadScore: score.total, scoringFactors: score.factors };
}
Enter fullscreen mode Exit fullscreen mode

2. Competitive Intelligence Dashboard

Track competitors' growth, hiring patterns, and content strategy:

async function buildCompetitorReport(companies) {
    return companies.map(company => ({
        name: company.name,

        // Growth metrics
        employeeCount: company.totalOnLinkedIn,
        hiringIntensity: company.activeJobCount,
        topHiringDepartments: company.byFunction
            ?.sort((a, b) => (b.percentage || 0) - (a.percentage || 0))
            .slice(0, 3)
            .map(f => f.department),

        // Content metrics
        postFrequency: calculatePostFrequency(company.posts),
        avgEngagement: calculateAvgEngagement(company.posts),
        topContentThemes: extractThemes(company.posts),

        // Company details
        industry: company.industry,
        size: company.companySizeRange,
        specialties: company.specialties,
    }));
}

function calculatePostFrequency(posts) {
    if (!posts || posts.length < 2) return null;

    const dates = posts
        .map(p => new Date(p.timestamp))
        .filter(d => !isNaN(d))
        .sort((a, b) => b - a);

    if (dates.length < 2) return null;

    const daySpan = (dates[0] - dates[dates.length - 1]) / (1000 * 60 * 60 * 24);
    return {
        postsPerWeek: Math.round((posts.length / daySpan) * 7 * 10) / 10,
        period: `${dates.length} posts over ${Math.round(daySpan)} days`,
    };
}
Enter fullscreen mode Exit fullscreen mode

3. Market Research and Industry Analysis

Aggregate company data across an industry to spot trends:

function analyzeIndustry(companies) {
    const stats = {
        totalCompanies: companies.length,

        // Size distribution
        sizeDistribution: {
            startup: companies.filter(c => (c.employeeCount?.max || 0) <= 50).length,
            smb: companies.filter(c => {
                const min = c.employeeCount?.min || 0;
                const max = c.employeeCount?.max || 0;
                return min >= 51 && max <= 500;
            }).length,
            enterprise: companies.filter(c => (c.employeeCount?.min || 0) > 500).length,
        },

        // Common specialties
        topSpecialties: getTopItems(
            companies.flatMap(c => c.specialties || []),
            20
        ),

        // Geographic distribution
        topLocations: getTopItems(
            companies.map(c => c.headquartersText).filter(Boolean),
            10
        ),

        // Hiring intensity
        averageJobCount: average(
            companies.map(c => c.activeJobCount).filter(n => n != null)
        ),
    };

    return stats;
}

function getTopItems(items, limit = 10) {
    const counts = {};
    items.forEach(item => {
        const normalized = item.toLowerCase().trim();
        counts[normalized] = (counts[normalized] || 0) + 1;
    });

    return Object.entries(counts)
        .sort(([, a], [, b]) => b - a)
        .slice(0, limit)
        .map(([item, count]) => ({ item, count }));
}
Enter fullscreen mode Exit fullscreen mode

Handling Edge Cases and Data Quality

LinkedIn company data isn't always clean. Here's how to handle common issues:

function cleanCompanyData(raw) {
    const cleaned = { ...raw };

    // Normalize company names (remove "| LinkedIn" suffix)
    if (cleaned.name) {
        cleaned.name = cleaned.name.replace(/\s*\|\s*LinkedIn\s*$/i, '').trim();
    }

    // Validate URLs
    if (cleaned.website) {
        try {
            new URL(cleaned.website.startsWith('http') 
                ? cleaned.website 
                : `https://${cleaned.website}`);
        } catch {
            cleaned.website = null;
            cleaned.websiteInvalid = raw.website;
        }
    }

    // Normalize industry names
    const industryAliases = {
        'information technology & services': 'Information Technology',
        'computer software': 'Software Development',
        'internet': 'Technology',
    };
    if (cleaned.industry) {
        cleaned.industry = industryAliases[cleaned.industry.toLowerCase()] 
            || cleaned.industry;
    }

    // Flag data completeness
    const requiredFields = ['name', 'industry', 'employeeCount', 'headquarters'];
    const presentFields = requiredFields.filter(f => cleaned[f] != null);
    cleaned.dataCompleteness = Math.round(
        (presentFields.length / requiredFields.length) * 100
    );

    return cleaned;
}
Enter fullscreen mode Exit fullscreen mode

Legal and Ethical Considerations

LinkedIn scraping exists in a complex legal landscape:

  • hiQ v. LinkedIn (2022): The Ninth Circuit ruled that scraping publicly available LinkedIn profiles doesn't violate the CFAA. However, this ruling is narrow and jurisdiction-specific.
  • LinkedIn's ToS: Explicitly prohibits scraping. While ToS violations aren't criminal, they can result in account restrictions or civil action.
  • GDPR/CCPA: Employee data may be considered personal data under privacy regulations. Ensure your use case has a legitimate basis.
  • Best practices:
    • Only scrape publicly available data (no login required)
    • Respect robots.txt directives
    • Implement rate limiting to avoid impacting LinkedIn's servers
    • Don't store personal employee data without a legitimate purpose
    • Provide opt-out mechanisms if you're building a public tool

Output and Integration

Once you've collected company data, here's how to make it useful:

// Export enriched company profiles
const enrichedCompanies = companies.map(company => ({
    // Core identity
    name: company.name,
    linkedinUrl: company.url,
    website: company.website,

    // Classification
    industry: company.industry,
    companyType: company.companyType,
    specialties: company.specialties,

    // Size and growth
    employeeRange: company.companySizeRange,
    linkedinEmployees: company.totalOnLinkedIn,
    activeJobs: company.activeJobCount,
    isHiring: (company.activeJobCount || 0) > 0,

    // Engagement
    followers: company.followers,
    recentPostCount: company.posts?.length || 0,
    avgPostEngagement: company.posts?.length
        ? Math.round(company.posts.reduce((sum, p) => sum + p.totalEngagement, 0) / company.posts.length)
        : null,

    // Metadata
    dataCompleteness: company.dataCompleteness,
    scrapedAt: company.scrapedAt,
}));

// Access via Apify API
const datasetUrl = `https://api.apify.com/v2/datasets/${datasetId}/items`;
// Formats: ?format=json, ?format=csv, ?format=xlsx
Enter fullscreen mode Exit fullscreen mode

Conclusion

LinkedIn company scraping is one of the most valuable data extraction tasks in the B2B space. The combination of company profiles, employee insights, and activity data gives you a comprehensive view of any business — useful for sales prospecting, competitive intelligence, market research, and investment analysis.

The key challenges are LinkedIn's aggressive anti-scraping measures and the legal nuances around professional data. Using Apify's infrastructure with residential proxies and session management solves the technical challenges, while focusing on publicly available company-level data (not individual profiles) keeps you on the right side of ethics.

Check out the Apify Store for ready-to-use LinkedIn company scrapers, or build your own using the patterns in this guide. For job-specific LinkedIn data, scrapers like the LinkedIn Jobs Scraper can extract public job listings with structured salary and location data.

The professional data landscape is evolving rapidly — LinkedIn continues to tighten access while the demand for business intelligence data only grows. Building a robust, ethical scraping pipeline now positions you well for the future.


Questions about LinkedIn company scraping? Share your use case in the comments — I'd love to hear what you're building with this data.

Top comments (0)