DEV Community

NexGenData
NexGenData

Posted on • Originally published at thenextgennexus.com

How to Scrape and Analyze Job Market Data for Any City

Table of Contents

Toggle

How to Scrape and Analyze Job Market Data for Any City

Job market data is incredibly valuable but frustratingly fragmented. LinkedIn requires login and blocks scrapers. Indeed has no API. Glassdoor limits free access. Yet recruiters, career coaches, and workforce analysts desperately need real, up-to-date data: Which cities have the most opportunities? What skills are in highest demand? How have salaries shifted in the last quarter?

In this guide, I’ll show you how to build a Python system to scrape job postings from multiple sources (Indeed, LinkedIn via Google, ZipRecruiter, GitHub Jobs), analyze job market trends by city and role, extract salary data, and identify emerging skill requirements. This is the foundation for the Salary Benchmark Report, which helps job seekers, recruiters, and companies make data-driven compensation decisions.

Why Job Market Data Analysis Matters

Real-world applications:

  • Salary negotiation: Know exactly what your role pays in your city before accepting an offer. Data beats speculation.
  • Career planning: Identify which skills and roles have the highest demand and salary growth.
  • Recruitment insights: Understand the competitive landscape. Are you paying below market? How do competitors position roles?
  • Workforce analytics: Track job market trends by region. Where are jobs growing? Where are they declining?
  • Skill demand analysis: Which technologies, languages, and frameworks are most wanted? Invest in those skills.
  • HR strategy: Benchmark your compensation packages against market data to improve hiring and retention.

Architecture: Multi-Source Job Data Pipeline

The system ingests from multiple sources to avoid bias from any single job board:

  • Indeed: Largest job board. Scrape via search results and job detail pages.
  • ZipRecruiter: Strong data through their search API-like interface.
  • GitHub Jobs: Subset of tech roles with high quality (shutdown 2024, but archived data available).
  • Hacker News: Monthly hiring threads with direct company postings (high-quality signals).
  • LinkedIn via Google: Use site:linkedin.com searches to find postings without API.

Step 1: Set Up Dependencies


    pip install requests beautifulsoup4 selenium pandas sqlite3 schedule nltk textblob
Enter fullscreen mode Exit fullscreen mode

Create a database schema for job postings and analysis:


    import sqlite3
    from datetime import datetime

    def init_job_database(db_name='job_market.db'):
        """Create tables for jobs, skills, and analysis."""
        conn = sqlite3.connect(db_name)
        c = conn.cursor()

        # Job postings
        c.execute('''CREATE TABLE IF NOT EXISTS jobs
                     (id INTEGER PRIMARY KEY, title TEXT, company TEXT, 
                      location TEXT, salary TEXT, min_salary REAL, max_salary REAL,
                      description TEXT, url TEXT, source TEXT, posted_date TIMESTAMP,
                      scrape_date TIMESTAMP)''')

        # Extracted skills from job descriptions
        c.execute('''CREATE TABLE IF NOT EXISTS job_skills
                     (id INTEGER PRIMARY KEY, job_id INTEGER, skill TEXT, 
                      frequency INTEGER, FOREIGN KEY(job_id) REFERENCES jobs(id))''')

        # Salary and compensation tracking
        c.execute('''CREATE TABLE IF NOT EXISTS salary_data
                     (id INTEGER PRIMARY KEY, job_id INTEGER, title TEXT, 
                      location TEXT, salary_min REAL, salary_max REAL, 
                      currency TEXT, scrape_date TIMESTAMP,
                      FOREIGN KEY(job_id) REFERENCES jobs(id))''')

        # Market analysis aggregates
        c.execute('''CREATE TABLE IF NOT EXISTS market_analysis
                     (id INTEGER PRIMARY KEY, city TEXT, role TEXT, 
                      avg_salary REAL, median_salary REAL, job_count INTEGER,
                      top_skills TEXT, analysis_date TIMESTAMP)''')

        conn.commit()
        conn.close()
        print("Job market database initialized")

    init_job_database()
Enter fullscreen mode Exit fullscreen mode

Step 2: Scrape Indeed Job Listings


    import requests
    from bs4 import BeautifulSoup
    import time
    import re

    class IndeedJobScraper:
        def __init__(self):
            self.headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            self.base_url = 'https://www.indeed.com'

        def search_jobs(self, query, location, num_pages=5):
            """Search Indeed for jobs and extract results."""
            jobs = []

            for page in range(num_pages):
                url = f"{self.base_url}/jobs?q={query}&l;={location}&start;={page * 10}"

                try:
                    response = requests.get(url, headers=self.headers, timeout=10)
                    soup = BeautifulSoup(response.content, 'html.parser')

                    # Find all job cards
                    job_cards = soup.find_all('div', {'class': 'resultWithShelf'})

                    for card in job_cards:
                        try:
                            # Extract job title
                            title_elem = card.find('h2', {'class': 'jobTitle'})
                            title = title_elem.get_text(strip=True) if title_elem else 'Unknown'

                            # Extract company
                            company_elem = card.find('span', {'class': 'companyName'})
                            company = company_elem.get_text(strip=True) if company_elem else 'Unknown'

                            # Extract location
                            location_elem = card.find('div', {'class': 'companyLocation'})
                            location = location_elem.get_text(strip=True) if location_elem else 'Unknown'

                            # Extract salary if available
                            salary_elem = card.find('span', {'class': 'salary'})
                            salary = salary_elem.get_text(strip=True) if salary_elem else 'Not specified'

                            # Extract description snippet
                            desc_elem = card.find('div', {'class': 'job-snippet'})
                            description = desc_elem.get_text(strip=True) if desc_elem else ''

                            # Extract job URL
                            link_elem = card.find('a', {'class': 'jcs-JobTitle'})
                            job_url = link_elem.get('href') if link_elem else ''
                            if job_url and not job_url.startswith('http'):
                                job_url = self.base_url + job_url

                            jobs.append({
                                'title': title,
                                'company': company,
                                'location': location,
                                'salary': salary,
                                'description': description,
                                'url': job_url,
                                'source': 'Indeed'
                            })

                        except Exception as e:
                            print(f"Error parsing job card: {e}")
                            continue

                    # Be respectful - add delay between requests
                    time.sleep(2)

                except Exception as e:
                    print(f"Error fetching page {page}: {e}")
                    continue

            return jobs

        def get_job_details(self, job_url):
            """Fetch full job description from job detail page."""
            try:
                response = requests.get(job_url, headers=self.headers, timeout=10)
                soup = BeautifulSoup(response.content, 'html.parser')

                # Extract full description
                desc_elem = soup.find('div', {'id': 'jobDescriptionText'})
                full_description = desc_elem.get_text(strip=True) if desc_elem else ''

                return full_description
            except Exception as e:
                print(f"Error fetching job details: {e}")
                return ''

    # Example usage
    scraper = IndeedJobScraper()
    jobs = scraper.search_jobs('data scientist', 'New York, NY', num_pages=3)
    print(f"Found {len(jobs)} jobs")
    for job in jobs[:5]:
        print(f"\n{job['title']} at {job['company']}")
        print(f"  Location: {job['location']}")
        print(f"  Salary: {job['salary']}")
Enter fullscreen mode Exit fullscreen mode

Step 3: Extract and Normalize Salary Data


    import re

    def parse_salary(salary_string):
        """Extract min and max salary from text like '$50,000 - $80,000 a year'."""
        if not salary_string or salary_string == 'Not specified':
            return None, None

        # Find all currency amounts
        amounts = re.findall(r'\$(\d{1,3}(?:,\d{3})*(?:\.\d{2})?)', salary_string)

        if len(amounts) >= 2:
            min_sal = float(amounts[0].replace(',', ''))
            max_sal = float(amounts[1].replace(',', ''))
            return min_sal, max_sal
        elif len(amounts) == 1:
            sal = float(amounts[0].replace(',', ''))
            return sal, sal

        return None, None

    def normalize_location(location_string):
        """Normalize location strings (e.g., 'New York, NY' -> 'New York')."""
        if not location_string:
            return 'Unknown'

        # Remove state abbreviations and remote indicators
        location = re.sub(r',\s*[A-Z]{2}$', '', location_string)
        location = re.sub(r'\(Remote\)', '', location).strip()

        return location.split(',')[0]  # Take first part before any comma

    def normalize_job_title(title_string):
        """Normalize job titles for grouping (e.g., 'Senior Data Scientist II' -> 'Data Scientist')."""
        title = title_string.lower()

        # Standardize common patterns
        title = re.sub(r'(junior|senior|lead|staff|principal|sr|jr)\.?\s*', '', title)
        title = re.sub(r'\s*(i{1,3}|iv|v)$', '', title)  # Remove Roman numerals
        title = re.sub(r'\s+', ' ', title).strip()

        return title.title()

    def store_jobs_in_db(jobs, db_name='job_market.db'):
        """Store job data in database with normalized fields."""
        conn = sqlite3.connect(db_name)
        c = conn.cursor()

        for job in jobs:
            min_sal, max_sal = parse_salary(job.get('salary', ''))
            location = normalize_location(job.get('location', ''))
            title = normalize_job_title(job.get('title', ''))

            c.execute('''INSERT INTO jobs 
                         (title, company, location, salary, min_salary, max_salary, 
                          description, url, source, scrape_date)
                         VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''',
                      (title, job.get('company'), location, job.get('salary'),
                       min_sal, max_sal, job.get('description'), job.get('url'),
                       job.get('source'), datetime.now()))

        conn.commit()
        conn.close()
        print(f"Stored {len(jobs)} jobs in database")

    # Store the scraped jobs
    store_jobs_in_db(jobs)
Enter fullscreen mode Exit fullscreen mode

Step 4: Extract Skills from Job Descriptions


    # Common tech skills and keywords to extract
    TECH_SKILLS = {
        # Programming languages
        'Python', 'Java', 'JavaScript', 'C++', 'C#', 'Go', 'Rust', 'Ruby', 'PHP', 'Swift',
        'Kotlin', 'Scala', 'R', 'MATLAB', 'SQL', 'TypeScript', 'Dart', 'Haskell',

        # Data & ML
        'Machine Learning', 'Deep Learning', 'TensorFlow', 'PyTorch', 'Scikit-learn',
        'Pandas', 'NumPy', 'Keras', 'XGBoost', 'Data Analysis', 'Statistics',
        'Big Data', 'Hadoop', 'Spark', 'Tableau', 'Power BI', 'Analytics',

        # Cloud & DevOps
        'AWS', 'Azure', 'Google Cloud', 'Kubernetes', 'Docker', 'Terraform',
        'Jenkins', 'GitLab', 'GitHub', 'Linux', 'DevOps', 'CI/CD',

        # Databases
        'PostgreSQL', 'MySQL', 'MongoDB', 'Cassandra', 'Redis', 'DynamoDB',
        'Elasticsearch', 'Firebase', 'Oracle', 'SQL Server',

        # Frameworks & Libraries
        'React', 'Angular', 'Vue.js', 'Django', 'Flask', 'Spring Boot', 'Node.js',
        'Express', 'FastAPI', 'GraphQL', 'REST API'
    }

    def extract_skills_from_description(description):
        """Extract known tech skills from job description."""
        if not description:
            return []

        description_lower = description.lower()
        found_skills = []

        for skill in TECH_SKILLS:
            if skill.lower() in description_lower:
                found_skills.append(skill)

        return found_skills

    def store_skills_in_db(db_name='job_market.db'):
        """Extract and store skills for all jobs in database."""
        conn = sqlite3.connect(db_name)
        c = conn.cursor()

        # Get all jobs without skills extracted
        c.execute('SELECT id, description FROM jobs WHERE id NOT IN (SELECT DISTINCT job_id FROM job_skills)')
        jobs = c.fetchall()

        for job_id, description in jobs:
            skills = extract_skills_from_description(description)

            for skill in skills:
                c.execute('''INSERT INTO job_skills (job_id, skill, frequency)
                             VALUES (?, ?, 1)
                             ON CONFLICT(job_id, skill) DO UPDATE SET frequency = frequency + 1''')

        conn.commit()
        conn.close()
        print(f"Extracted skills for {len(jobs)} jobs")

    store_skills_in_db()
Enter fullscreen mode Exit fullscreen mode

Step 5: Analyze Job Market Trends


    import pandas as pd
    from collections import Counter

    def analyze_job_market(city, role_keyword=None, db_name='job_market.db'):
        """Generate market analysis for a city and optional role."""
        conn = sqlite3.connect(db_name)

        # Build query
        query = 'SELECT * FROM jobs WHERE location = ?'
        params = [city]

        if role_keyword:
            query += ' AND title LIKE ?'
            params.append(f'%{role_keyword}%')

        df = pd.read_sql_query(query, conn, params=params)

        if df.empty:
            print(f"No jobs found for {city}")
            return None

        # Salary analysis
        salaries = df[df['min_salary'].notna()]['min_salary'].tolist()
        max_salaries = df[df['max_salary'].notna()]['max_salary'].tolist()

        avg_min_salary = sum(salaries) / len(salaries) if salaries else 0
        avg_max_salary = sum(max_salaries) / len(max_salaries) if max_salaries else 0

        # Skills analysis - get top 10 skills
        skills_query = '''SELECT skill, COUNT(*) as count FROM job_skills js
                          JOIN jobs j ON js.job_id = j.id
                          WHERE j.location = ?'''
        skills_params = [city]

        if role_keyword:
            skills_query += ' AND j.title LIKE ?'
            skills_params.append(f'%{role_keyword}%')

        skills_query += ' GROUP BY skill ORDER BY count DESC LIMIT 10'
        skills_df = pd.read_sql_query(skills_query, conn, params=skills_params)

        top_skills = ', '.join(skills_df['skill'].tolist()) if not skills_df.empty else 'N/A'

        analysis = {
            'city': city,
            'role': role_keyword or 'All',
            'job_count': len(df),
            'avg_min_salary': round(avg_min_salary, 2),
            'avg_max_salary': round(avg_max_salary, 2),
            'salary_range': f"${avg_min_salary:,.0f} - ${avg_max_salary:,.0f}",
            'top_skills': top_skills,
            'top_companies': df['company'].value_counts().head(5).to_dict()
        }

        conn.close()
        return analysis

    # Generate analysis
    analysis = analyze_job_market('New York', 'Data Scientist')

    if analysis:
        print(f"\n=== Job Market Analysis for {analysis['city']} ===")
        print(f"Role: {analysis['role']}")
        print(f"Total Jobs: {analysis['job_count']}")
        print(f"Salary Range: {analysis['salary_range']}")
        print(f"\nTop Skills:")
        print(analysis['top_skills'])
        print(f"\nTop Companies:")
        for company, count in analysis['top_companies'].items():
            print(f"  {company}: {count} openings")
Enter fullscreen mode Exit fullscreen mode

Step 6: Build Comparative Analysis


    def compare_markets(cities, role_keyword, db_name='job_market.db'):
        """Compare job markets across multiple cities."""
        results = []

        for city in cities:
            analysis = analyze_job_market(city, role_keyword, db_name)
            if analysis:
                results.append(analysis)

        # Create comparison dataframe
        comparison_df = pd.DataFrame(results)

        print(f"\n=== Comparison: {role_keyword} Across Cities ===")
        print(comparison_df[['city', 'job_count', 'avg_min_salary', 'avg_max_salary']].to_string(index=False))

        # Find best market
        if not comparison_df.empty:
            best_salary_city = comparison_df.loc[comparison_df['avg_max_salary'].idxmax()]
            most_jobs_city = comparison_df.loc[comparison_df['job_count'].idxmax()]

            print(f"\nBest Salary: {best_salary_city['city']} (${best_salary_city['avg_max_salary']:,.0f})")
            print(f"Most Opportunities: {most_jobs_city['city']} ({most_jobs_city['job_count']} jobs)")

        return comparison_df

    # Compare data scientist salaries across cities
    cities = ['New York', 'San Francisco', 'Seattle', 'Austin', 'Boston']
    comparison = compare_markets(cities, 'Data Scientist')
Enter fullscreen mode Exit fullscreen mode

Advanced: Track Salary Growth Over Time


    def salary_trends(city, role_keyword, days=90, db_name='job_market.db'):
        """Analyze salary trends over time for a role in a city."""
        conn = sqlite3.connect(db_name)

        query = '''SELECT DATE(scrape_date) as date, 
                          AVG(min_salary) as avg_min, 
                          AVG(max_salary) as avg_max,
                          COUNT(*) as job_count
                   FROM jobs
                   WHERE location = ? AND title LIKE ?
                   AND scrape_date > datetime('now', ?)
                   GROUP BY DATE(scrape_date)
                   ORDER BY date'''

        df = pd.read_sql_query(query, conn, params=[city, f'%{role_keyword}%', f'-{days} days'])
        conn.close()

        if df.empty:
            print(f"No historical data for {city}")
            return None

        print(f"\nSalary Trend for {role_keyword} in {city} (Last {days} days):")
        print(df.to_string(index=False))

        return df

    # Track trends
    trends = salary_trends('San Francisco', 'Software Engineer', days=30)
Enter fullscreen mode Exit fullscreen mode

Use Cases

  • Job seekers: Negotiate with data. Know your market rate before interviews.
  • Career coaches: Give clients data-driven guidance on salary expectations and in-demand skills.
  • Recruiters: Benchmark your offers. Are you competitive?
  • HR teams: Make evidence-based compensation decisions.
  • Economists: Track labor market trends and regional employment patterns.

Get Complete Market Insights with the Salary Benchmark Report

Maintaining a job market scraper across multiple sources is complex. The Salary Benchmark Report ($14) provides:

  • Real-time job data from 10+ sources (Indeed, LinkedIn, Glassdoor, GitHub, HN)
  • Salary analysis by city, role, experience level, and industry
  • Skill demand rankings and trend analysis
  • Company hiring patterns and growth indicators
  • CSV export for deeper analysis
  • Monthly updated data covering 100+ US cities

Get the Salary Benchmark Report →


About the Author

The Next Gen Nexus covers AI agents, automation, and web data — practical guides for developers, analysts, and businesses working with data at scale.


Try Apify free — get $5 in platform credit (no credit card required) and run this scraper plus 30,000+ others. Sign up here →

Top comments (0)