Web Scraping With Python (Part 2)

We shall be continuing where we left our Python scrapper. Our program was able to return all the job links of the homepage of fossjobs.net. However, we shouldn't end there. How about we build a tool that sends the user SMS messages of jobs they match with? That will mean we will have to access those individual links, access their info and make sure they match the user. For now we will be extracting info from the job link like the summary, job description, eligibility/qualifications/skills required and the link or steps to apply. This might be a little tricky but firstly we will start slowly.

Each of those job links have different formats of how their data is organized. You can clearly see that format of this rust developer role is different from this software engineer role. Considering the fact that there is no consistent format across all links, we will basically just pull everything from the links and pass the out put to a generative AI model to extract the relevant data we need.

This is how we previously left our code:

import requests
from bs4 import BeautifulSoup


URL = "https://www.fossjobs.net/"

# Gets html content of fossjobs.net
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

# extracts all html elements with an 'a' tag and 'neutral-link' class
job_elements = soup.find_all("a", class_="neutral-link", href=True)

for job_element in job_elements:
    job_link = job_element['href']
    print(job_link, end="\n"*2)

It's high time we "modularize" our code by placing different functionalities in different functions. The first function will accept a URL and return a list of links meanwhile the second function will accept this list of job urls, iterate through them and grab all the job details.

import requests
from bs4 import BeautifulSoup


def get_job_urls(url):
    # Gets html content of fossjobs.net
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, "html.parser")

    # extracts all html elements with an 'a' tag and 'neutral-link' class
    job_elements = soup.find_all("h3")
    job_links = []

    for job_element in job_elements:
        job_link = job_element.find("a")["href"]
        job_links.append(job_link)

    return job_links

Now let's code the second function. We will call it get_job_details and it will accept a list of strings and print out the job details for now. Based on the html content there is a div with a class called job-description where in this div we have a paragraph tag.

def get_job_details(urls):
    job_details = []
    for url in urls:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")

        job = soup.find("div", class_="job-description")
        print(job.find("p").text)
        job_details.append(job.find("p").text)
    return job_details

Next thing we will do is get to organize the data in a format which will enable us send jobs that are a good fit for our users and also to exclude sending jobs which are past the deadline.

DEV Community

Web Scraping With Python (Part 2)

Top comments (0)