agenthustler

Posted on Apr 9 • Edited on Apr 19

LinkedIn Company Scraper: Extract Business Profiles and Employee Data

#webdev #javascript #programming #webscraping

Introduction: The LinkedIn Data Goldmine

LinkedIn is the world's largest professional network, with over 1 billion members and 67 million company pages. For anyone in sales, recruiting, market research, or competitive intelligence, LinkedIn company data is incredibly valuable — employee counts, industry classifications, recent activity, funding information, and organizational structure.

But LinkedIn's official API is heavily restricted. The Marketing API requires partner approval, and even then, the data you can access is limited. If you need comprehensive company profile data at scale, scraping public LinkedIn pages is often the only practical option.

In this guide, I'll cover how to extract business profiles and employee data from LinkedIn's public company pages, with practical code examples and a production-ready approach using Apify.

What Data Can You Extract from LinkedIn Company Pages?

LinkedIn company pages are surprisingly data-rich. Here's what's publicly available without logging in:

Company Overview

Company name and tagline
Industry classification
Company size (employee range: 1-10, 11-50, 51-200, etc.)
Headquarters location
Founded year
Company type (Public, Private, Nonprofit, etc.)
Website URL
Specialties (keywords the company lists)
About/Description text

Employee Insights

Total employee count on LinkedIn
Employee distribution by function (Engineering, Sales, Marketing, etc.)
Employee growth rate (visible on some pages)
Notable employees (people with high follower counts)
New hires vs. departures trend

Activity Data

Recent posts by the company page
Post engagement (likes, comments, shares)
Posting frequency
Content themes (what topics they post about)

Jobs

Active job listings count
Job locations and types
Growth signals (lots of hiring = growing)

Understanding LinkedIn's Page Structure

LinkedIn company pages follow a consistent URL pattern:

https://www.linkedin.com/company/{company-slug}/
https://www.linkedin.com/company/{company-slug}/about/
https://www.linkedin.com/company/{company-slug}/people/
https://www.linkedin.com/company/{company-slug}/posts/
https://www.linkedin.com/company/{company-slug}/jobs/

The main page shows an overview, while sub-pages provide detailed sections. For scraping, the /about/ page is the most data-dense — it contains structured fields in a consistent format.

The Public vs. Authenticated View

LinkedIn shows different data depending on whether you're logged in:

Data Point	Public View	Logged-in View
Company name/description	Yes	Yes
Employee count	Approximate	Exact
Industry	Yes	Yes
Employee list	No	Yes (limited)
Follower count	Yes	Yes
Recent posts	Limited (3-5)	Full feed
Job listings	Count only	Full listings

For this guide, we'll focus on public data extraction, which doesn't require authentication and avoids ToS complications.

Building a LinkedIn Company Scraper

LinkedIn's public pages are server-side rendered but include structured data in JSON-LD and microdata formats. Here's how to extract it:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Extracting Structured Data

LinkedIn embeds JSON-LD structured data in company pages. This is the most reliable extraction method:

async function extractCompanyData($, body, url) {
    // Method 1: JSON-LD (most reliable)
    const jsonLd = extractJsonLd($);

    // Method 2: Meta tags (fallback)
    const metaData = extractMetaTags($);

    // Method 3: Page content parsing (for data not in structured formats)
    const pageData = extractFromPage($);

    return {
        url,
        ...mergeData(jsonLd, metaData, pageData),
        scrapedAt: new Date().toISOString(),
    };
}

function extractJsonLd($) {
    const scripts = $('script[type="application/ld+json"]');
    const results = {};

    scripts.each((_, el) => {
        try {
            const data = JSON.parse($(el).html());

            if (data['@type'] === 'Organization') {
                results.name = data.name;
                results.description = data.description;
                results.url = data.url;
                results.logo = data.logo?.url || data.logo;
                results.foundingDate = data.foundingDate;
                results.numberOfEmployees = data.numberOfEmployees?.value;

                if (data.address) {
                    results.headquarters = {
                        street: data.address.streetAddress,
                        city: data.address.addressLocality,
                        state: data.address.addressRegion,
                        country: data.address.addressCountry,
                    };
                }

                if (data.member) {
                    results.memberCount = Array.isArray(data.member) 
                        ? data.member.length 
                        : null;
                }
            }
        } catch (e) {
            // Invalid JSON-LD, skip
        }
    });

    return results;
}

function extractMetaTags($) {
    return {
        title: $('meta[property="og:title"]').attr('content'),
        description: $('meta[property="og:description"]').attr('content'),
        image: $('meta[property="og:image"]').attr('content'),
        type: $('meta[property="og:type"]').attr('content'),
        twitterTitle: $('meta[name="twitter:title"]').attr('content'),
        twitterDescription: $('meta[name="twitter:description"]').attr('content'),
    };
}

Parsing the Company About Page

The /about/ page contains structured fields that aren't always in JSON-LD:

function extractFromPage($) {
    const data = {};

    // Company details are in definition list format
    $('dl.overflow-hidden').each((_, dl) => {
        const terms = $(dl).find('dt');
        const definitions = $(dl).find('dd');

        terms.each((i, dt) => {
            const label = $(dt).text().trim().toLowerCase();
            const value = $(definitions[i]).text().trim();

            switch (label) {
                case 'website':
                    data.website = value;
                    break;
                case 'industry':
                    data.industry = value;
                    break;
                case 'company size':
                    data.companySizeRange = value;
                    data.employeeCount = parseEmployeeCount(value);
                    break;
                case 'headquarters':
                    data.headquartersText = value;
                    break;
                case 'type':
                    data.companyType = value;
                    break;
                case 'founded':
                    data.founded = parseInt(value) || null;
                    break;
                case 'specialties':
                    data.specialties = value.split(',').map(s => s.trim());
                    break;
            }
        });
    });

    // Extract follower count
    const followerText = $('[data-test-id="about-us__followers"]').text();
    const followerMatch = followerText.match(/([\d,]+)/);
    data.followers = followerMatch 
        ? parseInt(followerMatch[1].replace(/,/g, '')) 
        : null;

    // Extract the about text
    data.about = $('section.about p').text().trim() || null;

    return data;
}

function parseEmployeeCount(sizeText) {
    // "10,001+ employees" → 10001
    // "51-200 employees" → { min: 51, max: 200 }
    // "1,001-5,000 employees" → { min: 1001, max: 5000 }

    const plusMatch = sizeText.match(/([\d,]+)\+/);
    if (plusMatch) {
        return { min: parseInt(plusMatch[1].replace(/,/g, '')), max: null };
    }

    const rangeMatch = sizeText.match(/([\d,]+)\s*-\s*([\d,]+)/);
    if (rangeMatch) {
        return {
            min: parseInt(rangeMatch[1].replace(/,/g, '')),
            max: parseInt(rangeMatch[2].replace(/,/g, '')),
        };
    }

    return null;
}

Extracting Employee Distribution Data

LinkedIn's company pages show employee distribution by department. This data is valuable for understanding organizational structure:

async function extractEmployeeInsights($) {
    const insights = {
        byFunction: [],
        growth: null,
        totalOnLinkedIn: null,
    };

    // Employee count on LinkedIn
    const countText = $('[data-test-id="about-us__employees-on-linkedin"]').text();
    const countMatch = countText.match(/([\d,]+)/);
    insights.totalOnLinkedIn = countMatch 
        ? parseInt(countMatch[1].replace(/,/g, '')) 
        : null;

    // Department breakdown
    $('[data-test-id="employee-distribution"] li').each((_, li) => {
        const functionName = $(li).find('.function-name').text().trim();
        const percentage = $(li).find('.function-percentage').text().trim();
        const count = $(li).find('.function-count').text().trim();

        if (functionName) {
            insights.byFunction.push({
                department: functionName,
                percentage: parseFloat(percentage) || null,
                estimatedCount: parseInt(count.replace(/[^\d]/g, '')) || null,
            });
        }
    });

    return insights;
}

This gives you data like:

{
    "byFunction": [
        { "department": "Engineering", "percentage": 35, "estimatedCount": 4200 },
        { "department": "Sales", "percentage": 18, "estimatedCount": 2160 },
        { "department": "Marketing", "percentage": 12, "estimatedCount": 1440 },
        { "department": "Operations", "percentage": 10, "estimatedCount": 1200 },
        { "department": "Human Resources", "percentage": 8, "estimatedCount": 960 }
    ],
    "totalOnLinkedIn": 12000
}

Extracting Recent Company Posts

Company posts reveal marketing strategy, product launches, and company culture:

async function extractCompanyPosts($) {
    const posts = [];

    $('div[data-test-id="update-card"]').each((_, card) => {
        const post = {
            text: $(card).find('.update-text').text().trim(),
            timestamp: $(card).find('time').attr('datetime') || 
                       $(card).find('.update-date').text().trim(),
            likes: parseEngagement($(card).find('.likes-count').text()),
            comments: parseEngagement($(card).find('.comments-count').text()),
            shares: parseEngagement($(card).find('.shares-count').text()),
            hasImage: $(card).find('img.update-image').length > 0,
            hasVideo: $(card).find('video').length > 0,
            hasArticle: $(card).find('.article-card').length > 0,
        };

        // Calculate engagement rate
        const totalEngagement = (post.likes || 0) + 
                               (post.comments || 0) + 
                               (post.shares || 0);
        post.totalEngagement = totalEngagement;

        posts.push(post);
    });

    return posts;
}

function parseEngagement(text) {
    if (!text) return 0;
    text = text.trim().toLowerCase();

    // Handle "1.2K", "3.5M" formats
    const multipliers = { k: 1000, m: 1000000 };
    const match = text.match(/([\d.]+)\s*([km])?/);

    if (match) {
        const num = parseFloat(match[1]);
        const mult = multipliers[match[2]] || 1;
        return Math.round(num * mult);
    }

    return parseInt(text.replace(/[^\d]/g, '')) || 0;
}

Using Apify for Production LinkedIn Scraping

LinkedIn is one of the most challenging sites to scrape at scale. Here's why Apify is the right tool:

Challenge 1: Aggressive Rate Limiting

LinkedIn will block your IP after just a few dozen requests. Apify's residential proxy pool rotates IPs automatically:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Challenge 2: Authentication Walls

Some LinkedIn data requires being logged in. For public company data, you can often get what you need from the unauthenticated view by targeting specific endpoints and structured data.

Ready-Made Solution on Apify Store

If you don't want to build from scratch, there are ready-made LinkedIn company scrapers on the Apify Store. For example, you can find scrapers that accept a list of company URLs or search queries and return structured company profile data — no coding required.

These actors handle proxy rotation, anti-bot detection, and data normalization out of the box. You just provide the input (company names or URLs) and get clean JSON output.

Building a Complete Company Intelligence Pipeline

Here's how to combine everything into a useful data pipeline:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Practical Use Cases for LinkedIn Company Data

1. Sales Prospecting and Lead Generation

Filter companies by industry, size, and location to build targeted prospect lists:

function qualifyLead(company) {
    const score = {
        total: 0,
        factors: [],
    };

    // Company size scoring
    if (company.employeeCount?.min >= 50 && company.employeeCount?.max <= 500) {
        score.total += 30;
        score.factors.push('Ideal company size (50-500)');
    }

    // Industry fit
    const targetIndustries = ['Technology', 'Software', 'SaaS', 'Information Technology'];
    if (targetIndustries.some(ind => company.industry?.includes(ind))) {
        score.total += 25;
        score.factors.push('Target industry match');
    }

    // Growth signals
    if (company.recentJobCount > 10) {
        score.total += 20;
        score.factors.push('Actively hiring (growth signal)');
    }

    // Engagement level
    if (company.followers > 5000) {
        score.total += 15;
        score.factors.push('Strong LinkedIn presence');
    }

    // Has website (can research further)
    if (company.website) {
        score.total += 10;
        score.factors.push('Website available for research');
    }

    return { ...company, leadScore: score.total, scoringFactors: score.factors };
}

2. Competitive Intelligence Dashboard

Track competitors' growth, hiring patterns, and content strategy:

async function buildCompetitorReport(companies) {
    return companies.map(company => ({
        name: company.name,

        // Growth metrics
        employeeCount: company.totalOnLinkedIn,
        hiringIntensity: company.activeJobCount,
        topHiringDepartments: company.byFunction
            ?.sort((a, b) => (b.percentage || 0) - (a.percentage || 0))
            .slice(0, 3)
            .map(f => f.department),

        // Content metrics
        postFrequency: calculatePostFrequency(company.posts),
        avgEngagement: calculateAvgEngagement(company.posts),
        topContentThemes: extractThemes(company.posts),

        // Company details
        industry: company.industry,
        size: company.companySizeRange,
        specialties: company.specialties,
    }));
}

function calculatePostFrequency(posts) {
    if (!posts || posts.length < 2) return null;

    const dates = posts
        .map(p => new Date(p.timestamp))
        .filter(d => !isNaN(d))
        .sort((a, b) => b - a);

    if (dates.length < 2) return null;

    const daySpan = (dates[0] - dates[dates.length - 1]) / (1000 * 60 * 60 * 24);
    return {
        postsPerWeek: Math.round((posts.length / daySpan) * 7 * 10) / 10,
        period: `${dates.length} posts over ${Math.round(daySpan)} days`,
    };
}

3. Market Research and Industry Analysis

Aggregate company data across an industry to spot trends:

function analyzeIndustry(companies) {
    const stats = {
        totalCompanies: companies.length,

        // Size distribution
        sizeDistribution: {
            startup: companies.filter(c => (c.employeeCount?.max || 0) <= 50).length,
            smb: companies.filter(c => {
                const min = c.employeeCount?.min || 0;
                const max = c.employeeCount?.max || 0;
                return min >= 51 && max <= 500;
            }).length,
            enterprise: companies.filter(c => (c.employeeCount?.min || 0) > 500).length,
        },

        // Common specialties
        topSpecialties: getTopItems(
            companies.flatMap(c => c.specialties || []),
            20
        ),

        // Geographic distribution
        topLocations: getTopItems(
            companies.map(c => c.headquartersText).filter(Boolean),
            10
        ),

        // Hiring intensity
        averageJobCount: average(
            companies.map(c => c.activeJobCount).filter(n => n != null)
        ),
    };

    return stats;
}

function getTopItems(items, limit = 10) {
    const counts = {};
    items.forEach(item => {
        const normalized = item.toLowerCase().trim();
        counts[normalized] = (counts[normalized] || 0) + 1;
    });

    return Object.entries(counts)
        .sort(([, a], [, b]) => b - a)
        .slice(0, limit)
        .map(([item, count]) => ({ item, count }));
}

Handling Edge Cases and Data Quality

LinkedIn company data isn't always clean. Here's how to handle common issues:

function cleanCompanyData(raw) {
    const cleaned = { ...raw };

    // Normalize company names (remove "| LinkedIn" suffix)
    if (cleaned.name) {
        cleaned.name = cleaned.name.replace(/\s*\|\s*LinkedIn\s*$/i, '').trim();
    }

    // Validate URLs
    if (cleaned.website) {
        try {
            new URL(cleaned.website.startsWith('http') 
                ? cleaned.website 
                : `https://${cleaned.website}`);
        } catch {
            cleaned.website = null;
            cleaned.websiteInvalid = raw.website;
        }
    }

    // Normalize industry names
    const industryAliases = {
        'information technology & services': 'Information Technology',
        'computer software': 'Software Development',
        'internet': 'Technology',
    };
    if (cleaned.industry) {
        cleaned.industry = industryAliases[cleaned.industry.toLowerCase()] 
            || cleaned.industry;
    }

    // Flag data completeness
    const requiredFields = ['name', 'industry', 'employeeCount', 'headquarters'];
    const presentFields = requiredFields.filter(f => cleaned[f] != null);
    cleaned.dataCompleteness = Math.round(
        (presentFields.length / requiredFields.length) * 100
    );

    return cleaned;
}

Legal and Ethical Considerations

LinkedIn scraping exists in a complex legal landscape:

hiQ v. LinkedIn (2022): The Ninth Circuit ruled that scraping publicly available LinkedIn profiles doesn't violate the CFAA. However, this ruling is narrow and jurisdiction-specific.
LinkedIn's ToS: Explicitly prohibits scraping. While ToS violations aren't criminal, they can result in account restrictions or civil action.
GDPR/CCPA: Employee data may be considered personal data under privacy regulations. Ensure your use case has a legitimate basis.
Best practices:
- Only scrape publicly available data (no login required)
- Respect robots.txt directives
- Implement rate limiting to avoid impacting LinkedIn's servers
- Don't store personal employee data without a legitimate purpose
- Provide opt-out mechanisms if you're building a public tool

Output and Integration

Once you've collected company data, here's how to make it useful:

// Export enriched company profiles
const enrichedCompanies = companies.map(company => ({
    // Core identity
    name: company.name,
    linkedinUrl: company.url,
    website: company.website,

    // Classification
    industry: company.industry,
    companyType: company.companyType,
    specialties: company.specialties,

    // Size and growth
    employeeRange: company.companySizeRange,
    linkedinEmployees: company.totalOnLinkedIn,
    activeJobs: company.activeJobCount,
    isHiring: (company.activeJobCount || 0) > 0,

    // Engagement
    followers: company.followers,
    recentPostCount: company.posts?.length || 0,
    avgPostEngagement: company.posts?.length
        ? Math.round(company.posts.reduce((sum, p) => sum + p.totalEngagement, 0) / company.posts.length)
        : null,

    // Metadata
    dataCompleteness: company.dataCompleteness,
    scrapedAt: company.scrapedAt,
}));

// Access via Apify API
const datasetUrl = `https://api.apify.com/v2/datasets/${datasetId}/items`;
// Formats: ?format=json, ?format=csv, ?format=xlsx

Conclusion

LinkedIn company scraping is one of the most valuable data extraction tasks in the B2B space. The combination of company profiles, employee insights, and activity data gives you a comprehensive view of any business — useful for sales prospecting, competitive intelligence, market research, and investment analysis.

The key challenges are LinkedIn's aggressive anti-scraping measures and the legal nuances around professional data. Using Apify's infrastructure with residential proxies and session management solves the technical challenges, while focusing on publicly available company-level data (not individual profiles) keeps you on the right side of ethics.

Check out the Apify Store for ready-to-use LinkedIn company scrapers, or build your own using the patterns in this guide. For job-specific LinkedIn data, scrapers like the LinkedIn Jobs Scraper can extract public job listings with structured salary and location data.

The professional data landscape is evolving rapidly — LinkedIn continues to tighten access while the demand for business intelligence data only grows. Building a robust, ethical scraping pipeline now positions you well for the future.

Questions about LinkedIn company scraping? Share your use case in the comments — I'd love to hear what you're building with this data.

DEV Community