The problem
I have ran out of jobs to apply for trying to break into tech. I applied to all the remote jobs on LinkedIn, AngelList, We Work Remotely and countless other jobs boards. I had to find a new source for remote job listings, so I started looking on Craigslist. Having to search every city in the US is very tedious and easy to lose track of cities.
My solution
I have played around with Python a little in the past but never built anything with it. Python works very well for web scraping, so I decided to give it a shot.
First I got a list of all the Craigslist city boards urls from Craigslist. Then I used the urlopen function from urllib.requst to read the html from the page and used BeautifulSoup to work with the page data. I got all the city pages urls by getting all the pages "li" tags, then getting the "a" tags from that. Finally getting "href" urls and adding them to a list. I had to remove the first three because they were not city urls.
import csv
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
## get all US cities
cities_list = 'https://geo.craigslist.org/iso/us'
uClient = uReq(cities_list)
cities_html = uClient.read()
uClient.close()
cites_soup = soup(cities_html, "html.parser")
cites_li = cites_soup.findAll("li")
cites_urls = cites_soup.findAll("a")
city_url_list = []
##loop through and get url for each city
for a in cites_urls:
cites_url = a['href']
city_url_list.append(cites_url)
city_url_list.pop(0)
city_url_list.pop(0)
city_url_list.pop(0)
Now I have all the US cities' urls in a list in memory. I can use them to get the urls for job posting from each city's "web/info design" job board. There will be thousands of jobs so I stored them in a txt file with a url per line.
First I loop through the "city_url_list" list and use urlopen to read the html and BeautifulSoup to parse it and get all the "a" tags with the class of "result-title" which is all the job posts in the "web/info design" board for that city. Then we loop the "job_titles" and add the urls from each job post to a new list. We then write the list from each iteration to the txt file with each url on its own line.
for url in city_url_list:
my_url = '{}/d/web-html-info-design/search/web'.format(url)
#get data from url
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#parse html
page_soup = soup(page_html, "html.parser")
#grabs each job title
job_titles = page_soup.findAll("a", {"class": "result-title"})
jobs_url_list = []
for j in job_titles:
# print(j['href'])
if j['href']:
jobs_url_list.append([j['href']])
file = open('jobs.txt', 'a', newline='')
with file:
write = csv.writer(file)
write.writerows(jobs_url_list)
Now we have a txt file with over 2000 job posting urls that may be developer jobs but we want to filter through those job postings and find only jobs that have the word developer in the title and/or description.
First we need to get all the urls from the txt file and add them to a list. Now we have a list in a list, so we assign the first list in the list to a variable. Now the variable "job_urls" can be iterated through and we can again use urlopen to read the html and BeautifulSoup to parse it. Then we need to check if any of the posts have been deleted by looking for a "span" tag with the "id" of "has_been_removed". If it has been removed we move on to the next url, if the post has not been removed we continue on. Now we look for the title and description so we can find the word "developer". We get the "span" tag with an "id" of "titletextonly" which is the title and we also get the "section" tag with an "id" of "postingbody" which is the description. We check to make sure there is a value for both the title and the description, and if there is one we continue. Then we check if the title or description has the word "developer" in it, if it does we add the url for that job post to a list and write that list to a txt file that will have a url per line of only developer jobs.
import csv
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
with open('jobs.csv') as f:
l = list(map(str.split, f.read().split('\n\n')))
## get only developer jobs
job_urls = l[0]
for j in job_urls:
filtered_list = []
uClient = uReq(j)
job_html = uClient.read()
uClient.close()
job_soup = soup(job_html, "html.parser")
job_removed = job_soup.find("span", {"id": "has_been_removed"})
if job_removed == None:
job_title = job_soup.findAll("span", {"id": "titletextonly"})
job_body = job_soup.findAll("section", {"id": "postingbody"})
if (len(job_title[0]) != 0 and len(job_body[0]) != 0):
if (job_title[0].text.find("developer") != -1 or job_body[0].text.find("developer") != -1 ):
filtered_list.append(j)
if len(filtered_list) != 0:
file = open('filteredJobs.txt', 'a', newline='')
with file:
write = csv.writer(file)
write.writerow(filtered_list)
Now we have our list of jobs that we know are developer jobs and they are all in one place.
You can find the code and list on my GitHub
Project code
List of developer jobs
Also if your company is hiring Web Developers check out my portfolio. I am looking to break into tech and am very excited to learn more about your company.
Top comments (0)