Table of Contents
Toggle
-
How to Scrape and Analyze Job Market Data for Any City
- Why Job Market Data Analysis Matters
- Architecture: Multi-Source Job Data Pipeline
- Step 1: Set Up Dependencies
- Step 2: Scrape Indeed Job Listings
- Step 3: Extract and Normalize Salary Data
- Step 4: Extract Skills from Job Descriptions
- Step 5: Analyze Job Market Trends
- Step 6: Build Comparative Analysis
- Advanced: Track Salary Growth Over Time
- Use Cases
- Get Complete Market Insights with the Salary Benchmark Report
How to Scrape and Analyze Job Market Data for Any City
Job market data is incredibly valuable but frustratingly fragmented. LinkedIn requires login and blocks scrapers. Indeed has no API. Glassdoor limits free access. Yet recruiters, career coaches, and workforce analysts desperately need real, up-to-date data: Which cities have the most opportunities? What skills are in highest demand? How have salaries shifted in the last quarter?
In this guide, I’ll show you how to build a Python system to scrape job postings from multiple sources (Indeed, LinkedIn via Google, ZipRecruiter, GitHub Jobs), analyze job market trends by city and role, extract salary data, and identify emerging skill requirements. This is the foundation for the Salary Benchmark Report, which helps job seekers, recruiters, and companies make data-driven compensation decisions.
Why Job Market Data Analysis Matters
Real-world applications:
- Salary negotiation: Know exactly what your role pays in your city before accepting an offer. Data beats speculation.
- Career planning: Identify which skills and roles have the highest demand and salary growth.
- Recruitment insights: Understand the competitive landscape. Are you paying below market? How do competitors position roles?
- Workforce analytics: Track job market trends by region. Where are jobs growing? Where are they declining?
- Skill demand analysis: Which technologies, languages, and frameworks are most wanted? Invest in those skills.
- HR strategy: Benchmark your compensation packages against market data to improve hiring and retention.
Architecture: Multi-Source Job Data Pipeline
The system ingests from multiple sources to avoid bias from any single job board:
- Indeed: Largest job board. Scrape via search results and job detail pages.
- ZipRecruiter: Strong data through their search API-like interface.
- GitHub Jobs: Subset of tech roles with high quality (shutdown 2024, but archived data available).
- Hacker News: Monthly hiring threads with direct company postings (high-quality signals).
-
LinkedIn via Google: Use
site:linkedin.comsearches to find postings without API.
Step 1: Set Up Dependencies
pip install requests beautifulsoup4 selenium pandas sqlite3 schedule nltk textblob
Create a database schema for job postings and analysis:
import sqlite3
from datetime import datetime
def init_job_database(db_name='job_market.db'):
"""Create tables for jobs, skills, and analysis."""
conn = sqlite3.connect(db_name)
c = conn.cursor()
# Job postings
c.execute('''CREATE TABLE IF NOT EXISTS jobs
(id INTEGER PRIMARY KEY, title TEXT, company TEXT,
location TEXT, salary TEXT, min_salary REAL, max_salary REAL,
description TEXT, url TEXT, source TEXT, posted_date TIMESTAMP,
scrape_date TIMESTAMP)''')
# Extracted skills from job descriptions
c.execute('''CREATE TABLE IF NOT EXISTS job_skills
(id INTEGER PRIMARY KEY, job_id INTEGER, skill TEXT,
frequency INTEGER, FOREIGN KEY(job_id) REFERENCES jobs(id))''')
# Salary and compensation tracking
c.execute('''CREATE TABLE IF NOT EXISTS salary_data
(id INTEGER PRIMARY KEY, job_id INTEGER, title TEXT,
location TEXT, salary_min REAL, salary_max REAL,
currency TEXT, scrape_date TIMESTAMP,
FOREIGN KEY(job_id) REFERENCES jobs(id))''')
# Market analysis aggregates
c.execute('''CREATE TABLE IF NOT EXISTS market_analysis
(id INTEGER PRIMARY KEY, city TEXT, role TEXT,
avg_salary REAL, median_salary REAL, job_count INTEGER,
top_skills TEXT, analysis_date TIMESTAMP)''')
conn.commit()
conn.close()
print("Job market database initialized")
init_job_database()
Step 2: Scrape Indeed Job Listings
import requests
from bs4 import BeautifulSoup
import time
import re
class IndeedJobScraper:
def __init__(self):
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
self.base_url = 'https://www.indeed.com'
def search_jobs(self, query, location, num_pages=5):
"""Search Indeed for jobs and extract results."""
jobs = []
for page in range(num_pages):
url = f"{self.base_url}/jobs?q={query}&l;={location}&start;={page * 10}"
try:
response = requests.get(url, headers=self.headers, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all job cards
job_cards = soup.find_all('div', {'class': 'resultWithShelf'})
for card in job_cards:
try:
# Extract job title
title_elem = card.find('h2', {'class': 'jobTitle'})
title = title_elem.get_text(strip=True) if title_elem else 'Unknown'
# Extract company
company_elem = card.find('span', {'class': 'companyName'})
company = company_elem.get_text(strip=True) if company_elem else 'Unknown'
# Extract location
location_elem = card.find('div', {'class': 'companyLocation'})
location = location_elem.get_text(strip=True) if location_elem else 'Unknown'
# Extract salary if available
salary_elem = card.find('span', {'class': 'salary'})
salary = salary_elem.get_text(strip=True) if salary_elem else 'Not specified'
# Extract description snippet
desc_elem = card.find('div', {'class': 'job-snippet'})
description = desc_elem.get_text(strip=True) if desc_elem else ''
# Extract job URL
link_elem = card.find('a', {'class': 'jcs-JobTitle'})
job_url = link_elem.get('href') if link_elem else ''
if job_url and not job_url.startswith('http'):
job_url = self.base_url + job_url
jobs.append({
'title': title,
'company': company,
'location': location,
'salary': salary,
'description': description,
'url': job_url,
'source': 'Indeed'
})
except Exception as e:
print(f"Error parsing job card: {e}")
continue
# Be respectful - add delay between requests
time.sleep(2)
except Exception as e:
print(f"Error fetching page {page}: {e}")
continue
return jobs
def get_job_details(self, job_url):
"""Fetch full job description from job detail page."""
try:
response = requests.get(job_url, headers=self.headers, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract full description
desc_elem = soup.find('div', {'id': 'jobDescriptionText'})
full_description = desc_elem.get_text(strip=True) if desc_elem else ''
return full_description
except Exception as e:
print(f"Error fetching job details: {e}")
return ''
# Example usage
scraper = IndeedJobScraper()
jobs = scraper.search_jobs('data scientist', 'New York, NY', num_pages=3)
print(f"Found {len(jobs)} jobs")
for job in jobs[:5]:
print(f"\n{job['title']} at {job['company']}")
print(f" Location: {job['location']}")
print(f" Salary: {job['salary']}")
Step 3: Extract and Normalize Salary Data
import re
def parse_salary(salary_string):
"""Extract min and max salary from text like '$50,000 - $80,000 a year'."""
if not salary_string or salary_string == 'Not specified':
return None, None
# Find all currency amounts
amounts = re.findall(r'\$(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)', salary_string)
if len(amounts) >= 2:
min_sal = float(amounts[0].replace(',', ''))
max_sal = float(amounts[1].replace(',', ''))
return min_sal, max_sal
elif len(amounts) == 1:
sal = float(amounts[0].replace(',', ''))
return sal, sal
return None, None
def normalize_location(location_string):
"""Normalize location strings (e.g., 'New York, NY' -> 'New York')."""
if not location_string:
return 'Unknown'
# Remove state abbreviations and remote indicators
location = re.sub(r',\s*[A-Z]{2}$', '', location_string)
location = re.sub(r'\(Remote\)', '', location).strip()
return location.split(',')[0] # Take first part before any comma
def normalize_job_title(title_string):
"""Normalize job titles for grouping (e.g., 'Senior Data Scientist II' -> 'Data Scientist')."""
title = title_string.lower()
# Standardize common patterns
title = re.sub(r'(junior|senior|lead|staff|principal|sr|jr)\.?\s*', '', title)
title = re.sub(r'\s*(i{1,3}|iv|v)$', '', title) # Remove Roman numerals
title = re.sub(r'\s+', ' ', title).strip()
return title.title()
def store_jobs_in_db(jobs, db_name='job_market.db'):
"""Store job data in database with normalized fields."""
conn = sqlite3.connect(db_name)
c = conn.cursor()
for job in jobs:
min_sal, max_sal = parse_salary(job.get('salary', ''))
location = normalize_location(job.get('location', ''))
title = normalize_job_title(job.get('title', ''))
c.execute('''INSERT INTO jobs
(title, company, location, salary, min_salary, max_salary,
description, url, source, scrape_date)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''',
(title, job.get('company'), location, job.get('salary'),
min_sal, max_sal, job.get('description'), job.get('url'),
job.get('source'), datetime.now()))
conn.commit()
conn.close()
print(f"Stored {len(jobs)} jobs in database")
# Store the scraped jobs
store_jobs_in_db(jobs)
Step 4: Extract Skills from Job Descriptions
# Common tech skills and keywords to extract
TECH_SKILLS = {
# Programming languages
'Python', 'Java', 'JavaScript', 'C++', 'C#', 'Go', 'Rust', 'Ruby', 'PHP', 'Swift',
'Kotlin', 'Scala', 'R', 'MATLAB', 'SQL', 'TypeScript', 'Dart', 'Haskell',
# Data & ML
'Machine Learning', 'Deep Learning', 'TensorFlow', 'PyTorch', 'Scikit-learn',
'Pandas', 'NumPy', 'Keras', 'XGBoost', 'Data Analysis', 'Statistics',
'Big Data', 'Hadoop', 'Spark', 'Tableau', 'Power BI', 'Analytics',
# Cloud & DevOps
'AWS', 'Azure', 'Google Cloud', 'Kubernetes', 'Docker', 'Terraform',
'Jenkins', 'GitLab', 'GitHub', 'Linux', 'DevOps', 'CI/CD',
# Databases
'PostgreSQL', 'MySQL', 'MongoDB', 'Cassandra', 'Redis', 'DynamoDB',
'Elasticsearch', 'Firebase', 'Oracle', 'SQL Server',
# Frameworks & Libraries
'React', 'Angular', 'Vue.js', 'Django', 'Flask', 'Spring Boot', 'Node.js',
'Express', 'FastAPI', 'GraphQL', 'REST API'
}
def extract_skills_from_description(description):
"""Extract known tech skills from job description."""
if not description:
return []
description_lower = description.lower()
found_skills = []
for skill in TECH_SKILLS:
if skill.lower() in description_lower:
found_skills.append(skill)
return found_skills
def store_skills_in_db(db_name='job_market.db'):
"""Extract and store skills for all jobs in database."""
conn = sqlite3.connect(db_name)
c = conn.cursor()
# Get all jobs without skills extracted
c.execute('SELECT id, description FROM jobs WHERE id NOT IN (SELECT DISTINCT job_id FROM job_skills)')
jobs = c.fetchall()
for job_id, description in jobs:
skills = extract_skills_from_description(description)
for skill in skills:
c.execute('''INSERT INTO job_skills (job_id, skill, frequency)
VALUES (?, ?, 1)
ON CONFLICT(job_id, skill) DO UPDATE SET frequency = frequency + 1''')
conn.commit()
conn.close()
print(f"Extracted skills for {len(jobs)} jobs")
store_skills_in_db()
Step 5: Analyze Job Market Trends
import pandas as pd
from collections import Counter
def analyze_job_market(city, role_keyword=None, db_name='job_market.db'):
"""Generate market analysis for a city and optional role."""
conn = sqlite3.connect(db_name)
# Build query
query = 'SELECT * FROM jobs WHERE location = ?'
params = [city]
if role_keyword:
query += ' AND title LIKE ?'
params.append(f'%{role_keyword}%')
df = pd.read_sql_query(query, conn, params=params)
if df.empty:
print(f"No jobs found for {city}")
return None
# Salary analysis
salaries = df[df['min_salary'].notna()]['min_salary'].tolist()
max_salaries = df[df['max_salary'].notna()]['max_salary'].tolist()
avg_min_salary = sum(salaries) / len(salaries) if salaries else 0
avg_max_salary = sum(max_salaries) / len(max_salaries) if max_salaries else 0
# Skills analysis - get top 10 skills
skills_query = '''SELECT skill, COUNT(*) as count FROM job_skills js
JOIN jobs j ON js.job_id = j.id
WHERE j.location = ?'''
skills_params = [city]
if role_keyword:
skills_query += ' AND j.title LIKE ?'
skills_params.append(f'%{role_keyword}%')
skills_query += ' GROUP BY skill ORDER BY count DESC LIMIT 10'
skills_df = pd.read_sql_query(skills_query, conn, params=skills_params)
top_skills = ', '.join(skills_df['skill'].tolist()) if not skills_df.empty else 'N/A'
analysis = {
'city': city,
'role': role_keyword or 'All',
'job_count': len(df),
'avg_min_salary': round(avg_min_salary, 2),
'avg_max_salary': round(avg_max_salary, 2),
'salary_range': f"${avg_min_salary:,.0f} - ${avg_max_salary:,.0f}",
'top_skills': top_skills,
'top_companies': df['company'].value_counts().head(5).to_dict()
}
conn.close()
return analysis
# Generate analysis
analysis = analyze_job_market('New York', 'Data Scientist')
if analysis:
print(f"\n=== Job Market Analysis for {analysis['city']} ===")
print(f"Role: {analysis['role']}")
print(f"Total Jobs: {analysis['job_count']}")
print(f"Salary Range: {analysis['salary_range']}")
print(f"\nTop Skills:")
print(analysis['top_skills'])
print(f"\nTop Companies:")
for company, count in analysis['top_companies'].items():
print(f" {company}: {count} openings")
Step 6: Build Comparative Analysis
def compare_markets(cities, role_keyword, db_name='job_market.db'):
"""Compare job markets across multiple cities."""
results = []
for city in cities:
analysis = analyze_job_market(city, role_keyword, db_name)
if analysis:
results.append(analysis)
# Create comparison dataframe
comparison_df = pd.DataFrame(results)
print(f"\n=== Comparison: {role_keyword} Across Cities ===")
print(comparison_df[['city', 'job_count', 'avg_min_salary', 'avg_max_salary']].to_string(index=False))
# Find best market
if not comparison_df.empty:
best_salary_city = comparison_df.loc[comparison_df['avg_max_salary'].idxmax()]
most_jobs_city = comparison_df.loc[comparison_df['job_count'].idxmax()]
print(f"\nBest Salary: {best_salary_city['city']} (${best_salary_city['avg_max_salary']:,.0f})")
print(f"Most Opportunities: {most_jobs_city['city']} ({most_jobs_city['job_count']} jobs)")
return comparison_df
# Compare data scientist salaries across cities
cities = ['New York', 'San Francisco', 'Seattle', 'Austin', 'Boston']
comparison = compare_markets(cities, 'Data Scientist')
Advanced: Track Salary Growth Over Time
def salary_trends(city, role_keyword, days=90, db_name='job_market.db'):
"""Analyze salary trends over time for a role in a city."""
conn = sqlite3.connect(db_name)
query = '''SELECT DATE(scrape_date) as date,
AVG(min_salary) as avg_min,
AVG(max_salary) as avg_max,
COUNT(*) as job_count
FROM jobs
WHERE location = ? AND title LIKE ?
AND scrape_date > datetime('now', ?)
GROUP BY DATE(scrape_date)
ORDER BY date'''
df = pd.read_sql_query(query, conn, params=[city, f'%{role_keyword}%', f'-{days} days'])
conn.close()
if df.empty:
print(f"No historical data for {city}")
return None
print(f"\nSalary Trend for {role_keyword} in {city} (Last {days} days):")
print(df.to_string(index=False))
return df
# Track trends
trends = salary_trends('San Francisco', 'Software Engineer', days=30)
Use Cases
- Job seekers: Negotiate with data. Know your market rate before interviews.
- Career coaches: Give clients data-driven guidance on salary expectations and in-demand skills.
- Recruiters: Benchmark your offers. Are you competitive?
- HR teams: Make evidence-based compensation decisions.
- Economists: Track labor market trends and regional employment patterns.
Get Complete Market Insights with the Salary Benchmark Report
Maintaining a job market scraper across multiple sources is complex. The Salary Benchmark Report ($14) provides:
- Real-time job data from 10+ sources (Indeed, LinkedIn, Glassdoor, GitHub, HN)
- Salary analysis by city, role, experience level, and industry
- Skill demand rankings and trend analysis
- Company hiring patterns and growth indicators
- CSV export for deeper analysis
- Monthly updated data covering 100+ US cities
Get the Salary Benchmark Report →
About the Author
The Next Gen Nexus covers AI agents, automation, and web data — practical guides for developers, analysts, and businesses working with data at scale.
Try Apify free — get $5 in platform credit (no credit card required) and run this scraper plus 30,000+ others. Sign up here →
Top comments (0)