How to figure out what skills are being hired for right now, programmatically

#beginners #scraping #ruby #jobs

Ever curious what skills are really in demand on job boards? We're programmers, why not build a scraper to help figure that out?

Here's the type of output we want to get:

{
  "javascript": 648,
  "react": 442,
  "java": 382,
  "agile": 345,
  "cloud": 309,
  "css": 305,
  "python": 301,
  "apis": 243,
  "sql": 241,
}

Prerequisites

To get started you'll want to follow this guide which will walk you thru getting your environment setup and some basic scraping working.

Getting the job URL's

Following the same idea that the guide shows us, here's how we'll lay out our code. This code will get us our individual job listings to pull the words out. I opted to split this into 2 different files just in case there were errors. This will put the URL's of each job listing into a JSON file that we'll use in the next file.

I put in a few sample search result URL's, feel free to change the search terms or add as many as you'd like. The one I ended up running was about 80 URL's with a few different search terms.

# indeed_url.rb
require 'kimurai'
require "selenium-webdriver"

class Indeed < Kimurai::Base
  @name = 'indeed_scrape'
  @start_urls = [
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY', 
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=10', 
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=20',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=30',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=40',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=50',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=60',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=70',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=80',
    'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=90',
  ]
  @engine = :selenium_chrome
  @@jobs = []

  def scrape_page
    # Update response to current response after interaction with a browser
    doc = browser.current_response
    # browser.save_screenshot
    sleep 2

    while (doc.css('div.jobsearch-SerpJobCard')[0]) do
      # this loop goes thru the however many job listings are on the page
      doc = browser.current_response
      # get first job listing
      single_job = doc.css('div.jobsearch-SerpJobCard')[0]
      # get job information
      job_url = single_job.css('a.jobtitle').attribute('href')
      job_url = 'https://indeed.com' + job_url

      puts ' ===== '
      puts job_url
      puts " ===== "

      @@jobs << job_url if !@@jobs.include?(job_url)
      doc.css('div.jobsearch-SerpJobCard')[0].remove
 browser.execute_script("document.querySelector('div.jobsearch-SerpJobCard').remove()") 
      sleep 0.1

    end
  end

  def parse(response, url:, data: {})
    scrape_page
    File.open("tmp/indeed_jobs_urls.json","w") do |f|
      f.write(JSON.pretty_generate(@@jobs))
    end

    @@jobs
  end
end

Indeed.crawl!

Getting the words

Now that we have the URL's to parse let's go through each individual posting and then see which words pop up the most.

# indeed_posting.rb
require 'kimurai'
require 'json'

# this loads each url from the JSON file and pulls the description,
# removes all punctuation and converts it all to lowercase
# then, throw each word into a hash for JSON

class JobScraper < Kimurai::Base
  @name = 'indeed_scrape'
  @start_urls = JSON.parse(File.read("tmp/indeed_jobs_urls.json"))
  @engine = :selenium_chrome

  @@word_count = {}

  def scrape_page
    sleep 2
    doc = browser.current_response
    job_desc = doc.css('div.jobsearch-jobDescriptionText').text.gsub(/[[:punct:]]/, '').downcase
    job_array = job_desc.split(' ')
    job_array.each do |word|
      @@word_count[word] ? @@word_count[word] += 1 : @@word_count[word] = 1
    end

    puts @@word_count

  end

  def parse(response, url:, data: {})
    scrape_page
    sorted_hash = @@word_count.sort_by {|a,b| -b}
    sorted_hashery = sorted_hash.to_h  

    File.open("tmp/new_sorted_skills.json","w") do |f|
      f.write(JSON.pretty_generate(sorted_hashery))
    end

  end
end

JobScraper.crawl!
puts 'done scraping'

The only problem I have with this is that you'll get a JSON file that you'll have to eventually trim down and remove all the non tech words. It's pretty easy to see where the block of tech words pop up and only takes a few minutes.

I'm very open to improvement on this though, so let me know if there's something that I should tweak!