Web Scraping is the act of extracting data from the web. In this tutorial, we will build a web scrapper that will extract information about various jobs. There are several python web scrapping libraries like requests, beautiful soup, etc
Let's start by creating our project directory in which we will have our virtual environment
python -m venv venv
Now activate your virtual environment
. venv/bin/activate
Now that your virtual environment is activated, you can conveniently install your library.
Now create your working directory and enter it
mkdir python-scrappper
cd python-scrapper
Create your requirements.txt file to keep track of all libraries/dependencies downloaded
touch requirements.txt
Download the requests library
python -m pip install requests
Create a file scrapper.py
to carryout the scrapping from the fossjobs.net job board.
We will start by writing a simple program using to the request library to get all the html from the fossjobs website and write to a text file
import requests
URL = "https://www.fossjobs.net/"
page = requests.get(URL)
with open('readme.txt', 'w') as f:
f.write(page.text)
Parsing with beautiful soup
Now that we have scrapped data from the fossjobs.net site, the output looks like a huge mess with just HTML tags all over the place. We are going to use beautiful soup to parse the data in the HTML. First, let's install beautiful soup
python -m pip install beautifulsoup4
We will add 2 extra lines in our code. The first one importing beautiful soup and the other one which creates a BeautifulSoup
object that with our html content as its input.
import requests
from bs4 import BeautifulSoup
URL = "https://www.fossjobs.net/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
with open('readme.txt', 'w') as f:
f.write(page.text)
The second argument of the BeautifulSoup
object describes the type of parser which in this case will be an html parser.
When you visit a job board there are some key things you are looking for. The job title, it's description, and link to apply. These are what we want to currently extract from our html content. We can extract information by certain html elements like classes or ids. In our HTML content, we can see for every job post, there is an article
tag with a class name post
. In this scenario, we want to keep it simple by extracting just the links to all the job application pages. So we will make use of the a
tag and the neutral-link
class.
import requests
from bs4 import BeautifulSoup
URL = "https://www.fossjobs.net/"
# Gets html content of fossjobs.net
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
# extracts all html elements with an 'a' tag and 'neutral-link' class
job_elements = soup.find_all("a", class_="neutral-link", href=True)
for job_element in job_elements:
job_link = job_element['href']
print(job_link, end="\n"*2)
Our output should look somewhat like this:
https://tryquiet.org/jobs/
https://grnh.se/230c253d1us
https://apply.workable.com/digidem/j/96343A29BE/
https://okfn.org/en/jobs/senior-developer/
https://www.fsf.org/news/fsf-job-opportunity-outreach-and-communications-coordinator
https://wikimedia-deutschland.softgarden.io/job/40776229?l=de
https://wikimedia-deutschland.softgarden.io/job/40775956?l=de
Now add all dependencies in the requirements.txt file
pip freeze > requirements.txt
We have successfully scrapped the fossjobs.net website and gotten the links of all jobs in the home page. Next thing we will do is parse the content of these links and get other useful details.
Top comments (0)