agenthustler

Posted on Apr 9 • Edited on Apr 19

Indeed Scraping: Extract Job Listings, Salaries and Company Data

#webdev #javascript #programming #webscraping

Introduction: Why Scrape Indeed?

Indeed is the world's largest job aggregator, with over 300 million unique visitors per month. Whether you're building a job market analytics dashboard, tracking salary trends across industries, or feeding data into a recruitment pipeline, Indeed holds a goldmine of structured employment data.

But here's the thing — Indeed doesn't offer a public API for job listings. If you want programmatic access to job titles, salary ranges, company profiles, and location data at scale, you need to scrape it.

In this guide, I'll walk you through everything you need to know about extracting data from Indeed: the site's structure, what data you can pull, practical code examples, and how to scale your scraper using Apify's cloud infrastructure.

Understanding Indeed's Page Structure

Before writing a single line of code, you need to understand how Indeed organizes its data. Indeed has several key page types:

1. Search Results Pages

The search results page (indeed.com/jobs?q=...&l=...) is your primary entry point. Each result card contains:

Job title (linked to the full posting)
Company name (linked to the company page)
Location (city, state, remote indicator)
Salary snippet (when available — roughly 40% of listings)
Posted date (relative, like "3 days ago")
Job snippet (first ~160 characters of the description)

Indeed paginates results using the start parameter, incrementing by 10 (e.g., start=0, start=10, start=20).

2. Job Detail Pages

Clicking into a listing (indeed.com/viewjob?jk=...) gives you the full posting:

Complete job description (HTML formatted)
Full salary range (if disclosed)
Benefits list
Job type (full-time, part-time, contract)
Experience level
Company rating on Indeed

3. Company Pages

Indeed's company pages (indeed.com/cmp/Company-Name) aggregate:

Overall rating and review count
Salary data by role
Photos and culture information
All active job listings for that company

Setting Up Your Scraping Environment

Let's start with a basic Node.js setup using Crawlee, the open-source web scraping library that powers Apify actors:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

The key insight here is using labels to route different page types to different handlers. Indeed's search pages and detail pages have completely different DOM structures, so you need separate parsing logic for each.

Extracting Job Listings from Search Results

Here's how to parse Indeed's search results page:

async function handleSearchPage($, enqueueLinks) {
    const jobs = [];

    // Indeed wraps each job card in a div with data-jk attribute
    $('div.job_seen_beacon').each((index, element) => {
        const card = $(element);

        const job = {
            title: card.find('h2.jobTitle span').text().trim(),
            company: card.find('[data-testid="company-name"]').text().trim(),
            location: card.find('[data-testid="text-location"]').text().trim(),
            salary: card.find('.salary-snippet-container').text().trim() || null,
            posted: card.find('.date').text().trim(),
            snippet: card.find('.job-snippet').text().trim(),
            jobKey: card.closest('[data-jk]').attr('data-jk'),
            url: `https://www.indeed.com/viewjob?jk=${card.closest('[data-jk]').attr('data-jk')}`,
        };

        jobs.push(job);
    });

    // Save partial results immediately
    await Dataset.pushData(jobs);

    // Enqueue detail pages for full descriptions
    for (const job of jobs) {
        if (job.jobKey) {
            await enqueueLinks({
                urls: [job.url],
                userData: { label: 'DETAIL', jobKey: job.jobKey },
            });
        }
    }

    // Handle pagination - find the next page link
    const nextPage = $('a[data-testid="pagination-page-next"]').attr('href');
    if (nextPage) {
        await enqueueLinks({
            urls: [`https://www.indeed.com${nextPage}`],
            userData: { label: 'SEARCH' },
        });
    }
}

Pro Tips for Search Page Parsing

Not all listings show salary. Always handle the null case. Indeed only displays salary when the employer provides it or when Indeed estimates it.
Remote jobs have special markers. Look for "Remote" or "Hybrid remote" in the location field.
Sponsored listings appear first. They have a "Sponsored" badge — you may want to flag or filter these.
Indeed limits pagination to ~1000 results. For broad searches, split by location or use date filters to get complete coverage.

Extracting Salary Data

Salary extraction is one of the most valuable parts of Indeed scraping. Here's how to handle the various formats:

function parseSalary(salaryText) {
    if (!salaryText) return null;

    // Remove whitespace and normalize
    const text = salaryText.replace(/\s+/g, ' ').trim();

    // Patterns Indeed uses:
    // "$50,000 - $70,000 a year"
    // "$25 - $35 an hour"
    // "From $60,000 a year"
    // "Up to $100,000 a year"
    // "$80,000 a year" (single value)

    const rangeMatch = text.match(
        /\$([\d,]+(?:\.\d{2})?)\s*[-–]\s*\$([\d,]+(?:\.\d{2})?)\s*(an?\s+\w+)/i
    );

    if (rangeMatch) {
        return {
            min: parseFloat(rangeMatch[1].replace(/,/g, '')),
            max: parseFloat(rangeMatch[2].replace(/,/g, '')),
            period: rangeMatch[3].trim().toLowerCase(),
            type: 'range',
        };
    }

    const singleMatch = text.match(
        /(from|up to)?\s*\$([\d,]+(?:\.\d{2})?)\s*(an?\s+\w+)/i
    );

    if (singleMatch) {
        const value = parseFloat(singleMatch[2].replace(/,/g, ''));
        return {
            min: singleMatch[1]?.toLowerCase() === 'up to' ? null : value,
            max: singleMatch[1]?.toLowerCase() === 'from' ? null : value,
            period: singleMatch[3].trim().toLowerCase(),
            type: singleMatch[1] ? 'bounded' : 'exact',
        };
    }

    return { raw: text, type: 'unparsed' };
}

// Normalize all salaries to annual for comparison
function normalizeToAnnual(salary) {
    if (!salary || salary.type === 'unparsed') return null;

    const multipliers = {
        'an hour': 2080,    // 40hrs * 52 weeks
        'a hour': 2080,
        'a day': 260,       // 5 days * 52 weeks
        'a week': 52,
        'a month': 12,
        'a year': 1,
    };

    const mult = multipliers[salary.period] || 1;

    return {
        minAnnual: salary.min ? salary.min * mult : null,
        maxAnnual: salary.max ? salary.max * mult : null,
    };
}

This salary parser handles 95%+ of Indeed's salary formats. The normalization function lets you compare hourly, weekly, and annual salaries on the same scale — essential for any salary analytics project.

Extracting Company Data

Indeed's company pages are rich data sources. Here's how to extract company profiles:

async function handleCompanyPage($, request) {
    const company = {
        name: $('div[data-testid="company-name"]').text().trim(),
        rating: parseFloat($('[data-testid="rating-value"]').text()) || null,
        reviewCount: parseInt(
            $('[data-testid="review-count"]').text().replace(/[^\d]/g, '')
        ) || 0,
        industry: $('[data-testid="company-industry"]').text().trim() || null,
        size: $('[data-testid="company-size"]').text().trim() || null,
        founded: $('[data-testid="company-founded"]').text().trim() || null,
        revenue: $('[data-testid="company-revenue"]').text().trim() || null,
        headquarters: $('[data-testid="company-headquarters"]').text().trim() || null,
        description: $('[data-testid="company-description"]').text().trim() || null,
        url: request.url,
    };

    // Extract salary data by role
    const salariesByRole = [];
    $('[data-testid="salary-row"]').each((_, el) => {
        salariesByRole.push({
            role: $(el).find('.salary-role').text().trim(),
            averageSalary: $(el).find('.salary-average').text().trim(),
            salaryRange: $(el).find('.salary-range').text().trim(),
            dataPoints: parseInt($(el).find('.salary-count').text().replace(/[^\d]/g, '')) || 0,
        });
    });
    company.salariesByRole = salariesByRole;

    await Dataset.pushData(company);
}

Scaling with Apify

Running a scraper locally is fine for testing, but for production workloads — monitoring thousands of job listings daily, tracking salary trends across markets — you need cloud infrastructure. That's where Apify comes in.

Apify provides:

Automatic proxy rotation — Indeed actively blocks repeated requests from the same IP
Scheduling — Run your scraper daily, weekly, or on any cron schedule
Storage — Results saved to datasets you can export as JSON, CSV, or connect via API
Monitoring — Email alerts on failures, automatic retries

Here's how to convert the above code into an Apify Actor:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Handling Indeed's Anti-Scraping Measures

Indeed employs several anti-scraping techniques:

Rate limiting — Too many requests from one IP get blocked. Solution: Use residential proxies and keep concurrency low (3-5).
CAPTCHA challenges — Triggered by suspicious patterns. Solution: Rotate user agents and add random delays between requests.
Dynamic rendering — Some content loads via JavaScript. Solution: Use a browser-based crawler (Playwright) for these elements.
Session tracking — Indeed tracks browsing patterns. Solution: Rotate sessions and clear cookies between searches.

// Anti-detection configuration
const crawler = new CheerioCrawler({
    proxyConfiguration,

    // Randomize delays between requests
    minConcurrency: 1,
    maxConcurrency: 3,

    preNavigationHooks: [
        async ({ request }) => {
            // Random delay 2-5 seconds
            const delay = 2000 + Math.random() * 3000;
            await new Promise(r => setTimeout(r, delay));

            // Rotate user agents
            request.headers = {
                'User-Agent': getRandomUserAgent(),
                'Accept-Language': 'en-US,en;q=0.9',
                'Accept': 'text/html,application/xhtml+xml',
            };
        },
    ],
});

Practical Use Cases

1. Salary Benchmarking Tool

Combine Indeed salary data across roles and locations to build a benchmarking dashboard:

// Aggregate salary data by role and location
function buildSalaryBenchmark(jobs) {
    const benchmark = {};

    for (const job of jobs) {
        const salary = normalizeToAnnual(parseSalary(job.salary));
        if (!salary) continue;

        const key = `${job.title}|${job.location}`;
        if (!benchmark[key]) {
            benchmark[key] = { salaries: [], title: job.title, location: job.location };
        }

        if (salary.minAnnual) benchmark[key].salaries.push(salary.minAnnual);
        if (salary.maxAnnual) benchmark[key].salaries.push(salary.maxAnnual);
    }

    // Calculate statistics
    return Object.values(benchmark).map(entry => ({
        title: entry.title,
        location: entry.location,
        median: median(entry.salaries),
        p25: percentile(entry.salaries, 25),
        p75: percentile(entry.salaries, 75),
        sampleSize: entry.salaries.length,
    }));
}

2. Job Market Trend Tracker

Run daily scrapes and compare results over time:

// Track new listings, removed listings, and salary changes
async function trackTrends(todayData, yesterdayData) {
    const todayKeys = new Set(todayData.map(j => j.jobKey));
    const yesterdayKeys = new Set(yesterdayData.map(j => j.jobKey));

    const newListings = todayData.filter(j => !yesterdayKeys.has(j.jobKey));
    const removedListings = yesterdayData.filter(j => !todayKeys.has(j.jobKey));

    return {
        date: new Date().toISOString().split('T')[0],
        totalActive: todayData.length,
        newToday: newListings.length,
        removedToday: removedListings.length,
        averageSalary: calculateAverageSalary(todayData),
        topHiringCompanies: getTopCompanies(todayData, 10),
    };
}

3. Competitive Intelligence

Monitor specific companies' hiring patterns to understand their growth areas:

const targetCompanies = ['Google', 'Meta', 'Amazon', 'Apple'];

for (const company of targetCompanies) {
    const searchUrl = `https://www.indeed.com/jobs?q=company:${encodeURIComponent(company)}`;
    // Track: number of openings, departments hiring, salary ranges, locations
}

Data Quality and Validation

Raw scraped data needs cleaning. Here's a validation pipeline:

function validateJob(job) {
    const issues = [];

    if (!job.title || job.title.length < 3) issues.push('Invalid title');
    if (!job.company) issues.push('Missing company');
    if (!job.location) issues.push('Missing location');
    if (job.salary && !parseSalary(job.salary)) issues.push('Unparseable salary');
    if (!job.jobKey) issues.push('Missing job key');

    return {
        ...job,
        isValid: issues.length === 0,
        validationIssues: issues,
        scrapedAt: new Date().toISOString(),
    };
}

// Deduplicate by job key
function deduplicateJobs(jobs) {
    const seen = new Map();
    for (const job of jobs) {
        if (!seen.has(job.jobKey) || job.salary) {
            seen.set(job.jobKey, job);
        }
    }
    return Array.from(seen.values());
}

Legal and Ethical Considerations

Before scraping Indeed (or any site), consider these important points:

Terms of Service: Indeed's ToS restricts automated access. Be aware of the legal landscape in your jurisdiction. The US Ninth Circuit's hiQ v. LinkedIn ruling established that scraping publicly accessible data may not violate the CFAA, but this doesn't override contractual terms.
Rate limiting: Even if scraping is legal, hammering a server with requests can constitute a denial of service. Always throttle your requests.
Personal data: Job listings are generally non-personal, but be careful with data that could identify individuals.
robots.txt: Indeed's robots.txt restricts certain paths. Respecting it demonstrates good faith.

Output Format and Integration

Apify actors output data to datasets that you can access via API or export:

// Access your results via Apify API
const datasetId = 'your-dataset-id';
const response = await fetch(
    `https://api.apify.com/v2/datasets/${datasetId}/items?format=json`
);
const jobs = await response.json();

// Or export as CSV for spreadsheet analysis
const csvUrl = `https://api.apify.com/v2/datasets/${datasetId}/items?format=csv`;

You can also set up webhooks to trigger downstream processing whenever a scrape completes — push to a database, send Slack notifications, or feed into your analytics pipeline.

Conclusion

Indeed scraping opens up powerful possibilities for job market analysis, salary benchmarking, and competitive intelligence. The key challenges are handling Indeed's anti-scraping measures and normalizing the varied salary formats.

Using Apify's infrastructure, you can run these scrapers at scale with built-in proxy rotation, scheduling, and monitoring. Check out the Apify Store for ready-made Indeed scrapers you can run immediately, or build your own custom actor following the patterns in this guide.

The job market is one of the most dynamic datasets on the internet — every day, thousands of listings appear and disappear, salaries shift, and hiring patterns change. With the right scraping setup, you can turn this chaos into structured, actionable intelligence.

Happy scraping! If you have questions about Indeed scraping or want to share your use case, drop a comment below.

DEV Community