Ever curious what skills are really in demand on job boards? We're programmers, why not build a scraper to help figure that out?
Here's the type of output we want to get:
{
"javascript": 648,
"react": 442,
"java": 382,
"agile": 345,
"cloud": 309,
"css": 305,
"python": 301,
"apis": 243,
"sql": 241,
}
Prerequisites
To get started you'll want to follow this guide which will walk you thru getting your environment setup and some basic scraping working.
Getting the job URL's
Following the same idea that the guide shows us, here's how we'll lay out our code. This code will get us our individual job listings to pull the words out. I opted to split this into 2 different files just in case there were errors. This will put the URL's of each job listing into a JSON file that we'll use in the next file.
I put in a few sample search result URL's, feel free to change the search terms or add as many as you'd like. The one I ended up running was about 80 URL's with a few different search terms.
# indeed_url.rb
require 'kimurai'
require "selenium-webdriver"
class Indeed < Kimurai::Base
@name = 'indeed_scrape'
@start_urls = [
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=10',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=20',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=30',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=40',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=50',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=60',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=70',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=80',
'https://www.indeed.com/jobs?q=full%20stack%20developer&l=New%20York%2C%20NY&start=90',
]
@engine = :selenium_chrome
@@jobs = []
def scrape_page
# Update response to current response after interaction with a browser
doc = browser.current_response
# browser.save_screenshot
sleep 2
while (doc.css('div.jobsearch-SerpJobCard')[0]) do
# this loop goes thru the however many job listings are on the page
doc = browser.current_response
# get first job listing
single_job = doc.css('div.jobsearch-SerpJobCard')[0]
# get job information
job_url = single_job.css('a.jobtitle').attribute('href')
job_url = 'https://indeed.com' + job_url
puts ' ===== '
puts job_url
puts " ===== "
@@jobs << job_url if !@@jobs.include?(job_url)
doc.css('div.jobsearch-SerpJobCard')[0].remove
browser.execute_script("document.querySelector('div.jobsearch-SerpJobCard').remove()")
sleep 0.1
end
end
def parse(response, url:, data: {})
scrape_page
File.open("tmp/indeed_jobs_urls.json","w") do |f|
f.write(JSON.pretty_generate(@@jobs))
end
@@jobs
end
end
Indeed.crawl!
Getting the words
Now that we have the URL's to parse let's go through each individual posting and then see which words pop up the most.
# indeed_posting.rb
require 'kimurai'
require 'json'
# this loads each url from the JSON file and pulls the description,
# removes all punctuation and converts it all to lowercase
# then, throw each word into a hash for JSON
class JobScraper < Kimurai::Base
@name = 'indeed_scrape'
@start_urls = JSON.parse(File.read("tmp/indeed_jobs_urls.json"))
@engine = :selenium_chrome
@@word_count = {}
def scrape_page
sleep 2
doc = browser.current_response
job_desc = doc.css('div.jobsearch-jobDescriptionText').text.gsub(/[[:punct:]]/, '').downcase
job_array = job_desc.split(' ')
job_array.each do |word|
@@word_count[word] ? @@word_count[word] += 1 : @@word_count[word] = 1
end
puts @@word_count
end
def parse(response, url:, data: {})
scrape_page
sorted_hash = @@word_count.sort_by {|a,b| -b}
sorted_hashery = sorted_hash.to_h
File.open("tmp/new_sorted_skills.json","w") do |f|
f.write(JSON.pretty_generate(sorted_hashery))
end
end
end
JobScraper.crawl!
puts 'done scraping'
The only problem I have with this is that you'll get a JSON file that you'll have to eventually trim down and remove all the non tech words. It's pretty easy to see where the block of tech words pop up and only takes a few minutes.
I'm very open to improvement on this though, so let me know if there's something that I should tweak!
Top comments (0)