DEV Community

Ida Delphine
Ida Delphine

Posted on

Web Scraping With Python (Part 1)

Web Scraping is the act of extracting data from the web. In this tutorial, we will build a web scrapper that will extract information about various jobs. There are several python web scrapping libraries like requests, beautiful soup, etc

Let's start by creating our project directory in which we will have our virtual environment

python -m venv venv
Enter fullscreen mode Exit fullscreen mode

Now activate your virtual environment

. venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Now that your virtual environment is activated, you can conveniently install your library.

Now create your working directory and enter it

mkdir python-scrappper
Enter fullscreen mode Exit fullscreen mode
cd python-scrapper
Enter fullscreen mode Exit fullscreen mode

Create your requirements.txt file to keep track of all libraries/dependencies downloaded

touch requirements.txt
Enter fullscreen mode Exit fullscreen mode

Download the requests library

python -m pip install requests
Enter fullscreen mode Exit fullscreen mode

Create a file scrapper.py to carryout the scrapping from the fossjobs.net job board.
We will start by writing a simple program using to the request library to get all the html from the fossjobs website and write to a text file

import requests

URL = "https://www.fossjobs.net/"
page = requests.get(URL)

with open('readme.txt', 'w') as f:
    f.write(page.text)
Enter fullscreen mode Exit fullscreen mode

Parsing with beautiful soup

Now that we have scrapped data from the fossjobs.net site, the output looks like a huge mess with just HTML tags all over the place. We are going to use beautiful soup to parse the data in the HTML. First, let's install beautiful soup

python -m pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

We will add 2 extra lines in our code. The first one importing beautiful soup and the other one which creates a BeautifulSoup object that with our html content as its input.

import requests
from bs4 import BeautifulSoup


URL = "https://www.fossjobs.net/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

with open('readme.txt', 'w') as f:
    f.write(page.text)
Enter fullscreen mode Exit fullscreen mode

The second argument of the BeautifulSoup object describes the type of parser which in this case will be an html parser.

When you visit a job board there are some key things you are looking for. The job title, it's description, and link to apply. These are what we want to currently extract from our html content. We can extract information by certain html elements like classes or ids. In our HTML content, we can see for every job post, there is an article tag with a class name post. In this scenario, we want to keep it simple by extracting just the links to all the job application pages. So we will make use of the a tag and the neutral-link class.

import requests
from bs4 import BeautifulSoup


URL = "https://www.fossjobs.net/"

# Gets html content of fossjobs.net
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

# extracts all html elements with an 'a' tag and 'neutral-link' class
job_elements = soup.find_all("a", class_="neutral-link", href=True)

for job_element in job_elements:
    job_link = job_element['href']
    print(job_link, end="\n"*2)
Enter fullscreen mode Exit fullscreen mode

Our output should look somewhat like this:

https://tryquiet.org/jobs/

https://grnh.se/230c253d1us

https://apply.workable.com/digidem/j/96343A29BE/

https://okfn.org/en/jobs/senior-developer/

https://www.fsf.org/news/fsf-job-opportunity-outreach-and-communications-coordinator

https://wikimedia-deutschland.softgarden.io/job/40776229?l=de

https://wikimedia-deutschland.softgarden.io/job/40775956?l=de
Enter fullscreen mode Exit fullscreen mode

Now add all dependencies in the requirements.txt file

pip freeze > requirements.txt
Enter fullscreen mode Exit fullscreen mode

We have successfully scrapped the fossjobs.net website and gotten the links of all jobs in the home page. Next thing we will do is parse the content of these links and get other useful details.

Top comments (0)