Web Scraping With Python (Part 1)

#python #programming #webdev #tutorial

Web Scraping is the act of extracting data from the web. In this tutorial, we will build a web scrapper that will extract information about various jobs. There are several python web scrapping libraries like requests, beautiful soup, etc

Let's start by creating our project directory in which we will have our virtual environment

python -m venv venv

Now activate your virtual environment

. venv/bin/activate

Now that your virtual environment is activated, you can conveniently install your library.

Now create your working directory and enter it

mkdir python-scrappper

cd python-scrapper

Create your requirements.txt file to keep track of all libraries/dependencies downloaded

touch requirements.txt

Download the requests library

python -m pip install requests

Create a file scrapper.py to carryout the scrapping from the fossjobs.net job board.
We will start by writing a simple program using to the request library to get all the html from the fossjobs website and write to a text file

import requests

URL = "https://www.fossjobs.net/"
page = requests.get(URL)

with open('readme.txt', 'w') as f:
    f.write(page.text)

Parsing with beautiful soup

Now that we have scrapped data from the fossjobs.net site, the output looks like a huge mess with just HTML tags all over the place. We are going to use beautiful soup to parse the data in the HTML. First, let's install beautiful soup

python -m pip install beautifulsoup4

We will add 2 extra lines in our code. The first one importing beautiful soup and the other one which creates a BeautifulSoup object that with our html content as its input.

import requests
from bs4 import BeautifulSoup


URL = "https://www.fossjobs.net/"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

with open('readme.txt', 'w') as f:
    f.write(page.text)

The second argument of the BeautifulSoup object describes the type of parser which in this case will be an html parser.

When you visit a job board there are some key things you are looking for. The job title, it's description, and link to apply. These are what we want to currently extract from our html content. We can extract information by certain html elements like classes or ids. In our HTML content, we can see for every job post, there is an article tag with a class name post. In this scenario, we want to keep it simple by extracting just the links to all the job application pages. So we will make use of the a tag and the neutral-link class.

import requests
from bs4 import BeautifulSoup


URL = "https://www.fossjobs.net/"

# Gets html content of fossjobs.net
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

# extracts all html elements with an 'a' tag and 'neutral-link' class
job_elements = soup.find_all("a", class_="neutral-link", href=True)

for job_element in job_elements:
    job_link = job_element['href']
    print(job_link, end="\n"*2)

Our output should look somewhat like this:

https://tryquiet.org/jobs/

https://grnh.se/230c253d1us

https://apply.workable.com/digidem/j/96343A29BE/

https://okfn.org/en/jobs/senior-developer/

https://www.fsf.org/news/fsf-job-opportunity-outreach-and-communications-coordinator

https://wikimedia-deutschland.softgarden.io/job/40776229?l=de

https://wikimedia-deutschland.softgarden.io/job/40775956?l=de

Now add all dependencies in the requirements.txt file

pip freeze > requirements.txt

We have successfully scrapped the fossjobs.net website and gotten the links of all jobs in the home page. Next thing we will do is parse the content of these links and get other useful details.

DEV Community

Web Scraping With Python (Part 1)

Parsing with beautiful soup

Top comments (0)

Read next

Building SaaS Faster with Ercas for SaaS: A Template for Indie Hackers

NoFlyList : How NoFlyList Got Cleared for Production

Made a whimsical theme toggle, with CSS and emojis

Part 12: Building Your Own AI - Model Evaluation and Tuning for Optimal Performance