DEV Community

agenthustler
agenthustler

Posted on

Levels.fyi Scraping: Extract Salary Data and Tech Compensation Reports

Compensation transparency has become one of the most powerful forces reshaping the tech job market. Platforms like Levels.fyi sit at the center of this revolution, aggregating verified salary data from thousands of tech professionals across companies like Google, Meta, Amazon, Apple, Netflix, and hundreds more.

Whether you're a recruiter benchmarking compensation packages, a job seeker evaluating an offer, a startup founder setting salary bands, or a data analyst studying compensation trends — extracting structured data from Levels.fyi can unlock insights that are otherwise buried behind countless web pages.

In this guide, we'll break down how Levels.fyi organizes its data, walk through practical scraping approaches using both Node.js and Python, and show how to scale your extraction using Apify's cloud scraping platform.


Understanding How Levels.fyi Structures Its Data

Before writing a single line of scraping code, you need to understand the data architecture of the site you're targeting. Levels.fyi organizes compensation data across several key dimensions:

Company Profiles

Each company has a dedicated page showing aggregated compensation data. For a company like Google, you'll find:

  • Total compensation by level (L3, L4, L5, L6, L7, etc.)
  • Compensation breakdowns: base salary, stock/equity, bonus
  • Historical trends: how pay has changed year over year
  • Location-based adjustments: Bay Area vs. Seattle vs. New York vs. remote

Role and Level Taxonomy

Levels.fyi maintains a proprietary leveling system that maps different company titles to equivalent levels. For example:

Company Title Levels.fyi Level
Google L5 Senior SWE Senior
Meta E5 Senior
Amazon SDE II Mid-Senior
Apple ICT4 Senior

This normalization is what makes cross-company comparison possible and is a key part of the data's value.

Compensation Components

Every data point on Levels.fyi includes a detailed breakdown:

  • Base Salary: Annual fixed compensation
  • Stock/Equity: RSUs, stock options, or equity grants (annualized)
  • Bonus: Annual performance bonus, signing bonus (sometimes amortized)
  • Total Compensation (TC): The sum of all components

Geographic Data

Salary figures are tagged with location information, which is critical because a $200K TC in San Francisco has very different purchasing power than $200K in Austin, Texas.


The Technical Challenge of Scraping Levels.fyi

Levels.fyi is a modern React-based single-page application (SPA). This means:

  1. Data is loaded dynamically via API calls, not rendered in the initial HTML
  2. Content requires JavaScript execution to appear in the DOM
  3. Pagination and filtering happen client-side
  4. Rate limiting and bot detection are in place

This makes simple HTTP request-based scraping insufficient. You need either:

  • A headless browser (Puppeteer, Playwright) to render the JavaScript
  • API endpoint interception to capture the underlying data requests

Approach 1: Headless Browser Scraping with Node.js

Let's start with a Puppeteer-based approach that navigates the site and extracts compensation data:

const puppeteer = require('puppeteer');

async function scrapeLevelsFyi(company) {
    const browser = await puppeteer.launch({
        headless: true,
        args: ['--no-sandbox', '--disable-setuid-sandbox']
    });

    const page = await browser.newPage();

    // Set a realistic user agent
    await page.setUserAgent(
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ' +
        'AppleWebKit/537.36 (KHTML, like Gecko) ' +
        'Chrome/120.0.0.0 Safari/537.36'
    );

    // Navigate to the company's compensation page
    const url = `https://www.levels.fyi/companies/${company}/salaries`;
    await page.goto(url, { waitUntil: 'networkidle2', timeout: 30000 });

    // Wait for salary data to load
    await page.waitForSelector('[class*="salary"]', { timeout: 15000 })
        .catch(() => console.log('Salary selector not found, trying alternatives...'));

    // Extract compensation data from the page
    const salaryData = await page.evaluate(() => {
        const results = [];
        const rows = document.querySelectorAll('tr, [class*="row"]');

        rows.forEach(row => {
            const cells = row.querySelectorAll('td, [class*="cell"]');
            if (cells.length >= 4) {
                results.push({
                    level: cells[0]?.textContent?.trim(),
                    title: cells[1]?.textContent?.trim(),
                    totalComp: cells[2]?.textContent?.trim(),
                    base: cells[3]?.textContent?.trim(),
                    stock: cells[4]?.textContent?.trim() || null,
                    bonus: cells[5]?.textContent?.trim() || null,
                });
            }
        });

        return results;
    });

    console.log(`Found ${salaryData.length} salary entries for ${company}`);
    await browser.close();
    return salaryData;
}

// Extract data for multiple companies
async function scrapeMultipleCompanies(companies) {
    const allData = {};

    for (const company of companies) {
        console.log(`Scraping ${company}...`);
        allData[company] = await scrapeLevelsFyi(company);

        // Respectful delay between requests
        await new Promise(resolve => setTimeout(resolve, 3000));
    }

    return allData;
}

// Usage
const companies = ['google', 'meta', 'amazon', 'apple', 'microsoft'];
scrapeMultipleCompanies(companies)
    .then(data => {
        const fs = require('fs');
        fs.writeFileSync('salary_data.json', JSON.stringify(data, null, 2));
        console.log('Data saved to salary_data.json');
    })
    .catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Intercepting API Calls for Cleaner Data

A more efficient approach is to intercept the network requests that Levels.fyi's frontend makes to its backend API:

const puppeteer = require('puppeteer');

async function interceptSalaryAPI(company) {
    const browser = await puppeteer.launch({ headless: true });
    const page = await browser.newPage();

    const apiResponses = [];

    // Listen for API responses containing salary data
    page.on('response', async (response) => {
        const url = response.url();
        if (url.includes('/api/') && url.includes('salaries')) {
            try {
                const data = await response.json();
                apiResponses.push({
                    url: url,
                    data: data,
                    timestamp: new Date().toISOString()
                });
            } catch (e) {
                // Not a JSON response, skip
            }
        }
    });

    await page.goto(
        `https://www.levels.fyi/companies/${company}/salaries`,
        { waitUntil: 'networkidle2' }
    );

    // Scroll to trigger lazy-loaded content
    await autoScroll(page);

    await browser.close();
    return apiResponses;
}

async function autoScroll(page) {
    await page.evaluate(async () => {
        await new Promise((resolve) => {
            let totalHeight = 0;
            const distance = 300;
            const timer = setInterval(() => {
                window.scrollBy(0, distance);
                totalHeight += distance;
                if (totalHeight >= document.body.scrollHeight) {
                    clearInterval(timer);
                    resolve();
                }
            }, 200);
        });
    });
}
Enter fullscreen mode Exit fullscreen mode

Approach 2: Python-Based Extraction

If Python is more your style, here's an equivalent approach using Playwright:

import asyncio
import json
from playwright.async_api import async_playwright


async def scrape_levels_fyi(company: str) -> list[dict]:
    """Scrape salary data for a specific company from Levels.fyi."""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/120.0.0.0 Safari/537.36"
            )
        )
        page = await context.new_page()

        api_data = []

        # Intercept API responses
        async def handle_response(response):
            if "/api/" in response.url and "salaries" in response.url:
                try:
                    data = await response.json()
                    api_data.append(data)
                except Exception:
                    pass

        page.on("response", handle_response)

        url = f"https://www.levels.fyi/companies/{company}/salaries"
        await page.goto(url, wait_until="networkidle")

        # Scroll through the page to load all data
        previous_height = 0
        while True:
            current_height = await page.evaluate("document.body.scrollHeight")
            if current_height == previous_height:
                break
            await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
            await asyncio.sleep(1)
            previous_height = current_height

        await browser.close()
        return api_data


async def build_compensation_report(companies: list[str]):
    """Build a comprehensive compensation report across companies."""
    report = {}

    for company in companies:
        print(f"Extracting data for {company}...")
        data = await scrape_levels_fyi(company)
        report[company] = data
        await asyncio.sleep(2)  # Respectful delay

    return report


# Run the scraper
companies = ["google", "meta", "amazon", "apple", "netflix"]
report = asyncio.run(build_compensation_report(companies))

with open("compensation_report.json", "w") as f:
    json.dump(report, f, indent=2)

print(f"Report saved with data for {len(report)} companies")
Enter fullscreen mode Exit fullscreen mode

Processing and Analyzing the Data

Once you have raw data, you'll want to structure it for analysis:

import pandas as pd
import json


def process_salary_data(raw_data: dict) -> pd.DataFrame:
    """Transform raw salary data into a structured DataFrame."""
    records = []

    for company, entries in raw_data.items():
        for entry in entries:
            if isinstance(entry, dict):
                records.append({
                    "company": company,
                    "level": entry.get("level", "Unknown"),
                    "title": entry.get("title", ""),
                    "base_salary": parse_currency(entry.get("baseSalary", 0)),
                    "stock_grant": parse_currency(entry.get("stockGrantValue", 0)),
                    "bonus": parse_currency(entry.get("bonus", 0)),
                    "total_comp": parse_currency(entry.get("totalCompensation", 0)),
                    "location": entry.get("location", "Unknown"),
                    "years_experience": entry.get("yearsExperience", None),
                    "years_at_company": entry.get("yearsAtCompany", None),
                })

    df = pd.DataFrame(records)

    # Calculate derived metrics
    df["equity_percentage"] = (df["stock_grant"] / df["total_comp"] * 100).round(1)
    df["base_percentage"] = (df["base_salary"] / df["total_comp"] * 100).round(1)

    return df


def parse_currency(value) -> float:
    """Parse currency strings into float values."""
    if isinstance(value, (int, float)):
        return float(value)
    if isinstance(value, str):
        cleaned = value.replace("$", "").replace(",", "").replace("K", "000")
        try:
            return float(cleaned)
        except ValueError:
            return 0.0
    return 0.0


# Generate comparison report
def generate_comparison(df: pd.DataFrame):
    """Generate cross-company compensation comparison."""
    summary = df.groupby(["company", "level"]).agg({
        "total_comp": ["mean", "median", "min", "max", "count"],
        "base_salary": "mean",
        "stock_grant": "mean",
        "bonus": "mean",
    }).round(0)

    print("\n=== Compensation Comparison by Company and Level ===")
    print(summary.to_string())
    return summary
Enter fullscreen mode Exit fullscreen mode

Scaling with Apify: Cloud-Based Scraping Infrastructure

While local scripts work for small-scale extraction, real-world salary research requires scraping hundreds of companies and thousands of data points. This is where Apify excels.

Apify provides a cloud-based platform for running web scrapers (called "Actors") at scale. Here's how to use it for Levels.fyi data extraction:

Using Apify's Web Scraper Actor

// Apify Actor for Levels.fyi salary extraction
const Apify = require('apify');

Apify.main(async () => {
    const input = await Apify.getInput();
    const { companies = ['google'], maxResults = 100 } = input;

    const requestQueue = await Apify.openRequestQueue();
    const dataset = await Apify.openDataset();

    // Queue company pages
    for (const company of companies) {
        await requestQueue.addRequest({
            url: `https://www.levels.fyi/companies/${company}/salaries`,
            userData: { company }
        });
    }

    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        maxConcurrency: 3,
        navigationTimeoutSecs: 60,
        handlePageFunction: async ({ request, page }) => {
            const { company } = request.userData;

            // Wait for data to load
            await page.waitForTimeout(5000);

            // Extract salary table data
            const salaries = await page.evaluate(() => {
                const data = [];
                // Extract from rendered compensation tables
                document.querySelectorAll('[data-testid*="salary"], .salary-row')
                    .forEach(el => {
                        const text = el.textContent;
                        data.push({ rawText: text });
                    });
                return data;
            });

            // Store results
            for (const salary of salaries) {
                await dataset.pushData({
                    company,
                    ...salary,
                    scrapedAt: new Date().toISOString(),
                    sourceUrl: request.url,
                });
            }

            console.log(
                `Extracted ${salaries.length} salary entries for ${company}`
            );
        },

        handleFailedRequestFunction: async ({ request, error }) => {
            console.error(`Failed: ${request.url} - ${error.message}`);
        },
    });

    await crawler.run();
    console.log('Scraping complete!');
});
Enter fullscreen mode Exit fullscreen mode

Calling an Apify Actor from Python

You can also trigger Apify Actors programmatically from Python:

from apify_client import ApifyClient
import json


def run_levels_scraper(companies: list[str], api_token: str) -> list[dict]:
    """Run Levels.fyi scraper on Apify cloud."""
    client = ApifyClient(api_token)

    # Start the actor run
    run = client.actor("your-username/levels-fyi-scraper").call(
        run_input={
            "companies": companies,
            "maxResults": 500,
            "includeHistorical": True,
        },
        timeout_secs=300,
    )

    # Fetch results from the dataset
    results = []
    for item in client.dataset(run["defaultDatasetId"]).iterate_items():
        results.append(item)

    print(f"Retrieved {len(results)} salary entries")
    return results


# Run the scraper
companies = [
    "google", "meta", "amazon", "apple", "microsoft",
    "netflix", "uber", "airbnb", "stripe", "coinbase"
]

results = run_levels_scraper(companies, "your_apify_api_token")

# Save results
with open("levels_fyi_data.json", "w") as f:
    json.dump(results, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

Practical Use Cases for Levels.fyi Data

1. Compensation Benchmarking for Recruiters

Recruiters can build real-time compensation benchmarks by scraping salary data across competing companies. This data helps craft competitive offers and reduces the back-and-forth in salary negotiations.

2. Career Planning and Offer Evaluation

Job seekers can extract data for their target company and level to understand whether an offer is at the 25th, 50th, or 75th percentile. Cross-referencing with location data helps account for cost-of-living differences.

3. Market Research for Startups

Startup founders setting compensation bands can use Levels.fyi data to understand what Big Tech pays at equivalent levels, then decide how much to offset with equity versus cash.

4. Academic and Economic Research

Researchers studying wage inequality, the gender pay gap, or the impact of remote work on compensation can build longitudinal datasets by scraping historical data points.

5. Investment Analysis

Understanding compensation trends at specific companies can provide signals about talent retention, hiring velocity, and overall company health — all relevant to investment decisions.


Handling Common Challenges

Dynamic Content Loading

Levels.fyi loads data asynchronously. Always use waitUntil: 'networkidle2' and add explicit waits for data elements:

// Wait for specific data elements before extracting
await page.waitForFunction(() => {
    const rows = document.querySelectorAll('[class*="compensation"]');
    return rows.length > 0;
}, { timeout: 15000 });
Enter fullscreen mode Exit fullscreen mode

Anti-Bot Measures

Rotate user agents, add random delays between requests, and consider using residential proxies for large-scale scraping. Apify's proxy infrastructure handles this automatically.

Data Quality

Not all entries on Levels.fyi are verified. Build validation into your pipeline:

def validate_salary_entry(entry: dict) -> bool:
    """Basic validation for salary data quality."""
    tc = entry.get("total_comp", 0)
    base = entry.get("base_salary", 0)

    # Filter obvious outliers
    if tc < 30000 or tc > 2000000:
        return False
    if base < 20000 or base > 500000:
        return False
    if base > tc:
        return False

    return True
Enter fullscreen mode Exit fullscreen mode

Ethical Considerations and Best Practices

When scraping compensation data, keep these principles in mind:

  1. Respect robots.txt: Always check and honor the site's robots.txt directives
  2. Rate limiting: Don't overwhelm the server — add delays between requests
  3. Data privacy: Salary data should be aggregated, never tied to individuals
  4. Terms of service: Review and respect the platform's ToS
  5. Caching: Store results locally to avoid redundant requests
  6. Attribution: If publishing analysis, credit Levels.fyi as the data source

Conclusion

Levels.fyi is one of the richest sources of tech compensation data available. By combining headless browser techniques with cloud-based scraping infrastructure like Apify, you can extract, structure, and analyze salary data at scale.

Whether you're building a compensation benchmarking tool, evaluating a job offer, or conducting market research, the techniques covered in this guide give you the foundation to work with Levels.fyi data programmatically.

Start small with a single company, validate your extraction pipeline, and then scale up using Apify's cloud infrastructure to cover the entire tech industry's compensation landscape.

Top comments (0)