NexGenData

Posted on May 14 • Edited on May 18 • Originally published at thenextgennexus.com

H1B Salary Research Toolkit for Job Seekers (2026)

#apify #scraping #automation #salary

H1B Salary Research Toolkit for Job Seekers (2026)

If you are on (or considering) an H1B visa in 2026, the single highest-ROI hour you can spend is pulling your own compensation benchmark data. Employers know the prevailing-wage floor for your LCA. Recruiters know the tier bands at your company. The only person who usually does not know is you — which is exactly why you end up $25k under market.

Some context on the 2026 landscape: USCIS received roughly 442,000 H1B registrations for FY2026, with an 85,000-visa annual cap (65,000 regular + 20,000 US-masters). The cap hit in March 2025 with a 5.2x oversubscription, which means the tightness of the H1B market has real leverage effects on compensation — both good and bad. Employers who genuinely want specific candidates are willing to pay premium multiples above prevailing wage. Employers who rely on high-volume LCA filings (the classic "body shops") pay closer to the prevailing-wage floor. Knowing which side of that line your target employer falls on, before you walk into the negotiation, is worth anywhere from $15k to $80k on a senior-engineer offer. And unlike most career advice, the data for this is 100% public — the Department of Labor releases quarterly LCA disclosure datasets, every filing by every employer, searchable by job title and location. The problem is nobody outside specialty firms and immigration lawyers actually parses it.

This post builds a personal salary research toolkit using public H1B filing data plus general compensation scrapers. You will end up with a spreadsheet showing what your target role pays by company, location, and seniority — the same data recruiters use to anchor their first offer. The mental model to adopt: you are not asking "what should I make?" — you are answering "what has this company actually paid people with my profile, as proven by their own federal filings, in the last 24 months?" That shift in framing alone changes how the conversation goes.

Why this is hard

Public H1B data is technically free. The DOL publishes Labor Condition Applications quarterly. But getting to usable insight is rough:

Raw DOL data is 2-3 GB of CSV per quarter with inconsistent column names across years.
Employer names are messy. "Amazon.com Services LLC", "Amazon Web Services Inc.", and "Amazon Dev Center U.S., Inc." are all Amazon.
Prevailing wage != actual wage. The LCA lists the floor. Real TC (base + bonus + equity) is 20-60% higher.
Location adjustments. An SF offer at $180k base is different from a Seattle $180k base when equity and cost-of-living differ.

You need to join LCA data with actual compensation data (Levels.fyi, Glassdoor, Blind) and salary benchmark APIs.

SOC code ambiguity. The Standard Occupational Classification is granular, but many employers file under a generic "Software Developers, Applications" (15-1252) regardless of actual role. A data scientist and a backend engineer can appear identical in the LCA.
Wage unit confusion. Most filings are annual, but a minority are filed hourly or monthly. Group-by or aggregation without normalizing to annual throws off your bands by a factor of 10.

The architecture

[Target company + role]
          |
          v
 [h1b-visa-salary-search] --> LCA history, prevailing wages, counts
          |
          v
 [salary-data-search]     --> market compensation, bands, percentiles
          |
          v
     [Google Sheet]
          |
          v
 [Negotiation cheat sheet]

Step 1: Pull H1B filings for a target company

The h1b-visa-salary-search actor aggregates DOL LCA filings with fuzzy employer matching already done.

from apify_client import ApifyClient
client = ApifyClient("APIFY_TOKEN")

run = client.actor("nexgendata/h1b-visa-salary-search").call(run_input={
    "employers": ["Stripe", "Databricks", "Anthropic"],
    "job_titles": ["Software Engineer", "Senior Software Engineer",
                   "Staff Software Engineer", "Machine Learning Engineer"],
    "years": [2024, 2025, 2026],
    "locations": ["San Francisco, CA", "New York, NY", "Seattle, WA"],
})

filings = list(client.dataset(run["defaultDatasetId"]).iterate_items())

Each record:

{
  "employer": "Stripe, Inc.",
  "job_title": "Software Engineer II",
  "base_salary": 210000,
  "wage_unit": "Year",
  "location": "San Francisco, CA",
  "filing_date": "2025-03-14",
  "case_status": "Certified",
  "soc_code": "15-1252",
  "prevailing_wage": 186300
}

Note both base_salary (what the employer committed to pay) and prevailing_wage (the DOL floor for the role/location). The gap is informative — a tight gap means the employer is paying close to floor; a wide gap means they are comfortable paying well above.

Step 2: Build band distributions

import pandas as pd
df = pd.DataFrame(filings)
bands = (df.groupby(["employer","job_title","location"])
           .base_salary.quantile([0.25,0.5,0.75])
           .unstack())
bands.columns = ["p25","p50","p75"]
print(bands)

Output:

                                                         p25      p50      p75
employer  job_title              location
Stripe    Software Engineer II   San Francisco, CA    205000   215000   228000
Stripe    Senior Engineer        San Francisco, CA    245000   260000   280000

That p50 is your floor. Anything below is an instant-counter.

Step 3: Cross-check with market comp

H1B base salary does not include bonus or equity. For total comp, pull market data via salary-data-search:

comp_run = client.actor("nexgendata/salary-data-search").call(run_input={
    "roles": ["Senior Software Engineer"],
    "companies": ["Stripe", "Databricks", "Anthropic"],
    "locations": ["San Francisco, CA"],
    "include_bonus": True,
    "include_equity": True,
})

comp = list(client.dataset(comp_run["defaultDatasetId"]).iterate_items())

Returns ranges like:

{
  "company": "Stripe",
  "role": "Senior Software Engineer",
  "level": "L4",
  "location": "San Francisco, CA",
  "base_range": [210000, 260000],
  "bonus_range": [20000, 40000],
  "equity_4yr": [400000, 700000],
  "sample_size": 82
}

Step 4: Compare offer vs. band

Quick negotiation sanity check:

def evaluate_offer(offer_base, offer_bonus, offer_equity_4yr, market):
    m = market
    tc_offer = offer_base + offer_bonus + offer_equity_4yr/4
    tc_market_median = m["base_range"][0] + (m["base_range"][1]-m["base_range"][0])/2 \
                     + m["bonus_range"][0] + (m["bonus_range"][1]-m["bonus_range"][0])/2 \
                     + ((m["equity_4yr"][0]+m["equity_4yr"][1])/2)/4
    return round((tc_offer/tc_market_median - 1)*100, 1)

pct = evaluate_offer(225000, 25000, 400000, comp[0])
print(f"Offer is {pct}% of market median")

If the result is negative, you have concrete data to push back with.

Here is a more complete end-to-end script that produces a Google-Sheets-ready CSV combining LCA + market comp for a list of target companies. Designed to be copy-pasted before a round of final-round interviews:

import csv, statistics
from apify_client import ApifyClient
from collections import defaultdict

client = ApifyClient("APIFY_TOKEN")

TARGETS = [
    ("Stripe", ["Software Engineer", "Senior Software Engineer", "Staff Software Engineer"]),
    ("Databricks", ["Software Engineer", "Senior Software Engineer"]),
    ("Anthropic", ["Member of Technical Staff"]),
]
LOCS = ["San Francisco, CA", "New York, NY", "Seattle, WA"]

rows = []
for company, titles in TARGETS:
    lca = client.actor("nexgendata/h1b-visa-salary-search").call(run_input={
        "employers": [company], "job_titles": titles,
        "years": [2024, 2025, 2026], "locations": LOCS,
    })
    comp = client.actor("nexgendata/salary-data-search").call(run_input={
        "companies": [company], "roles": titles, "locations": LOCS,
        "include_bonus": True, "include_equity": True,
    })

    lca_by_key = defaultdict(list)
    for f in client.dataset(lca["defaultDatasetId"]).iterate_items():
        if f.get("case_status") != "Certified" or f.get("wage_unit") != "Year":
            continue
        key = (f["employer"], f["job_title"], f["location"])
        lca_by_key[key].append(f["base_salary"])

    comp_by_key = {(c["company"], c["role"], c["location"]): c
                   for c in client.dataset(comp["defaultDatasetId"]).iterate_items()}

    for (emp, title, loc), salaries in lca_by_key.items():
        if len(salaries) < 3:
            continue
        c = comp_by_key.get((company, title, loc), {})
        rows.append({
            "company": emp, "title": title, "location": loc, "n_filings": len(salaries),
            "lca_p25": int(statistics.quantiles(salaries, n=4)[0]),
            "lca_p50": int(statistics.median(salaries)),
            "lca_p75": int(statistics.quantiles(salaries, n=4)[2]),
            "market_base_low": c.get("base_range", [None, None])[0],
            "market_base_high": c.get("base_range", [None, None])[1],
            "market_tc_midpoint": (
                c["base_range"][1] + c.get("bonus_range", [0,0])[1]
                + c.get("equity_4yr", [0,0])[1] / 4
            ) if c else None,
        })

with open("negotiation_prep.csv", "w", newline="") as f:
    w = csv.DictWriter(f, fieldnames=list(rows[0].keys()))
    w.writeheader(); w.writerows(rows)
print(f"Wrote {len(rows)} rows")

Open that CSV in Google Sheets, conditional-format the columns, and you are walking into the negotiation with a spreadsheet that recruiters literally cannot argue with — it's their employer's own federal filings.

Step 5: Track employers with healthy sponsorship

For H1B visa holders, approval history matters as much as salary. The actor also returns per-employer counts:

-- Top sponsors in your field (proxy for: they actually file)
SELECT employer, COUNT(*) AS filings, AVG(base_salary) AS avg_pay
FROM filings
WHERE job_title ILIKE '%machine learning%'
  AND case_status = 'Certified'
  AND filing_date > '2024-01-01'
GROUP BY employer
ORDER BY filings DESC
LIMIT 30;

Use cases

1. Pre-offer prep. Before signing, a senior engineer pulled their potential employer's last 24 months of LCAs. Found peers filed at +15% of their offered base. Negotiated and got it.

2. Targeting companies that actually sponsor. A mid-career PM filters for employers who have filed 20+ successful LCAs in the past 2 years. Saves months of dead-end applications.

3. Location arbitrage. A senior dev compared NYC vs. Austin vs. Seattle LCAs for the same role at the same employer. Austin came out ahead on effective comp.

4. Salary review for green card cases. Pulling your employer's LCA filings for people at your level is a strong anchor for "am I being underpaid" conversations.

5. PERM/I-140 prevailing wage awareness. For green card sponsorship, the employer files a PERM labor certification that references the DOL prevailing wage. Knowing your target prevailing wage level (I, II, III, IV) before conversations with your immigration attorney can flag potential downgrading, where an employer files at a lower level than your actual responsibilities warrant.

6. Relocation decision support. A senior engineer weighing a cross-country transfer used the toolkit to quantify the effective comp delta. LCA data showed her company paid equivalent titles 18% less in the Raleigh office than in the NYC office. She renegotiated the transfer offer upward by 12% before accepting.

Pricing comparison

Service	Monthly cost	LCA history	Market comp	Sponsor counts
myvisajobs.com	Free (limited)	Yes	No	Yes
h1bdata.info	Free	Yes	No	Limited
Levels.fyi	Free + paid	No	Yes	No
Payscale	$29/month	No	Yes	No
Apify combo	~$5	Yes	Yes	Yes

You get the DOL data that myvisajobs paywalls plus Levels-grade comp data in one pipeline for <$5/month at typical job-search volume.

Common pitfalls

LCA data is rich but full of edge cases. Here are the ones that regularly mislead people using it for negotiation:

LCA != offer letter. Companies file LCAs for ranges or multiple candidates. Treat as floor, not ceiling. The filed wage is often the minimum the employer is committing to — the real offer may be 10-30% higher, especially at senior levels.
Case_status filter. Only "Certified" filings are valid; "Denied" and "Withdrawn" are noise. Some public tools aggregate all statuses and skew distributions. Always filter WHERE case_status = 'Certified' or 'Certified-Withdrawn' (the employer got the cert but withdrew it, usually because the candidate chose another offer — still a valid data point).
Location field is messy. "San Francisco, CA" and "SAN FRANCISCO, CALIFORNIA" both appear. Normalize before grouping. Also beware metro-level filings: some employers file with a metro MSA (e.g., "San Jose-Sunnyvale-Santa Clara, CA") that needs a crosswalk to the city you care about.
Remote work complicates location. Post-pandemic, some LCAs list the employer's HQ location while the actual work is remote. Cross-reference with the employer's job postings if the location seems off.
Amendment filings inflate counts. An employer amending an existing H1B (title change, location change, material job-duty change) files a new LCA. These show up as new filings but do not represent a new hire. For "is this employer actively hiring" analysis, watch for duplicate beneficiary fingerprints.
Level of experience confounds title. "Software Engineer" at Company A might mean "L3 new-grad" while at Company B it means "L5 staff-engineer equivalent." Always cross-reference with levels.fyi or Blind mappings before concluding a company pays more/less.
The 2026 H1B wage floor rule changes. In late 2025, DOL proposed raising prevailing wage levels across the board. As of this writing the rule is partially implemented. Filings from Q1 2026 onward may reflect higher floors. Do not compare 2026 filings directly to 2023 without accounting for this.
OPT STEM salaries aren't in LCA data. If you are on F1 OPT, your salary is not in this data. LCAs only cover H1B/H1B1/E-3 workers. For OPT negotiation, rely on levels.fyi and Blind instead.
Prevailing wage levels I-IV. Level I is "entry"; Level IV is "fully competent." The level determines the minimum. An employer filing a senior engineer at Level II is legally questionable — that is a downgrading pattern flag.
Secondary filings (H1B transfers) look like new hires. They are not. A transfer from Company A to Company B generates a new LCA at Company B. For hiring-volume metrics, filter by visa_class = H1B new-employment vs. continuing-employment.

How NexGenData handles this

The h1b-visa-salary-search actor is built to solve these problems rather than expose them:

Pre-normalized employer names. We run every raw DOL employer string through an entity-resolution pipeline that collapses variations. "AMAZON.COM SERVICES LLC", "Amazon Web Services, Inc.", and "Amazon Dev Center U.S., Inc." resolve to a canonical Amazon entity with a stable ID.

Wage-unit normalization. Hourly and monthly filings are converted to annualized equivalents with a standard assumption (2080 hours/year for hourly, 12 months for monthly) and flagged with a wage_source field so you can filter if you prefer.

Location normalization. All location strings are mapped to canonical MSA + state. You can group by MSA without messy string matching.

Status filtering by default. Denied and Withdrawn filings are excluded from default output. A flag lets you include them for specific analyses (e.g., "what employers have a high denial rate").

Fresh data weekly. We pull the quarterly DOL release and interpolate with weekly OFLC datasets. The data is rarely more than 5-7 days stale.

Pay-per-result pricing. A single company's 24-month history is typically under $1. Compare with myvisajobs.com's $99/year subscription — the actor is cheaper at any volume under 100 company queries per year.

Conclusion

The information asymmetry in hiring is entirely on the employer's side by default. A three-hour afternoon with these two actors flips it. You walk into the negotiation with percentiles, not vibes.

Start the toolkit with:

H1B Visa Salary Search — LCA filings and sponsorship history.
Salary Data Search — market compensation with bonus + equity.
GitHub Repo Stats — evaluate the engineering org you are joining.

FAQ

Is this legal to use for negotiation?
Absolutely. LCA data is public information the employer files with the federal government. You are allowed to cite it in negotiation. Most recruiters will not be surprised — some will be impressed you did the work.

What if my target company doesn't sponsor H1B?
Then LCA data will not help you, but salary-data-search still will. Use the market-comp side of the pipeline on its own.

Does H1B salary data reflect total compensation?
No. LCA only covers base salary. Bonus and equity are not disclosed. That is why you must cross-reference with market-comp data for total compensation.

How do I identify my "level" at a target company?
Use levels.fyi's role mapping. Most major tech companies have well-documented leveling guides on levels.fyi, Blind, or Team Blind's anonymous comp sheets. Match your current scope of work and years of experience against their level descriptions, then filter LCAs for the corresponding title.

Can I use this data in a salary dispute with my current employer?
Yes, and it's a useful tool. If your current employer has filed LCAs for peers at your level at a higher wage than you make, that is a reasonable conversation to have in a comp review — though the cultural norms vary by company.

What about non-tech H1B roles?
The actor works for any SOC code. Filings are not tech-specific. Pharmaceutical scientists, financial analysts, civil engineers — all in the data. Adjust your SOC code filter accordingly.

How recent is the data?
Quarterly releases are about 45-60 days behind. Weekly OFLC updates are 7-14 days behind. For negotiation purposes, data within 3 months is considered current.

Is there an equivalent for L1, O1, or TN visas?
LCA data is specific to H1B/H1B1/E-3. L1 visa wages are not published. For O1 and TN, there is no equivalent public filing. These visa types require different research strategies (Glassdoor, levels.fyi, peer networks).

DEV Community

H1B Salary Research Toolkit for Job Seekers (2026)

H1B Salary Research Toolkit for Job Seekers (2026)

Why this is hard

The architecture

Step 1: Pull H1B filings for a target company

Step 2: Build band distributions

Step 3: Cross-check with market comp

Step 4: Compare offer vs. band

Step 5: Track employers with healthy sponsorship

Use cases

Pricing comparison

Common pitfalls

How NexGenData handles this

Conclusion

FAQ

Related tools

Top comments (0)