DEV Community

hub
hub

Posted on

some approaches to scrape the clutch-page: with bs4 and pandas - a comparison

trying to gather the data form the page "https://clutch.co/il/it-services"
and that said i - think that there are probably several options to do that

a. using bs4 and requests
b. using pandas

this first approach uses a.

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd

    url = "https://clutch.co/il/it-services"
    response = requests.get(url)

    soup = BeautifulSoup(response.content, "html.parser")

    company_names = soup.find_all("h3", class_="company-name")
    locations = soup.find_all("span", class_="locality")

    company_names_list = [name.get_text(strip=True) for name in company_names]
    locations_list = [location.get_text(strip=True) for location in locations]

    data = {"Company Name": company_names_list, "Location": locations_list}
    df = pd.DataFrame(data)

    df.to_csv("it_services_data.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

This code will scrape

a. the company names and locations from the specified webpage and
b. stores them in a Pandas DataFrame.
c. It will then save the data to a CSV file named "it_services_data.csv" in the current working directory.

i am wondering if a panda-approach could be useful as well?

    import pandas as pd

    url = "https://clutch.co/il/it-services"

    # Use pandas to read HTML content and extract tables from the webpage
    tables = pd.read_html(url)

    # Assuming the desired table is the first one on the page
    table = tables[0]

    # Extract the columns we're interested in
    df = table[["Company Name", "Location"]]

    # Optional: Clean up the column names if needed
    df.columns = ["Company Name", "Location"]

    # Optional: Perform further data processing or analysis using the Pandas DataFrame

    # Save the data to a CSV file
    df.to_csv("it_services_data.csv", index=False)

Enter fullscreen mode Exit fullscreen mode

In this approach, pandas' read_html() function is used to read the HTML content of the webpage and extract tables.
Assuming the desired table is the first one on the page, we are able to assign it to the table variable.
Then, we can extract the columns we're interested in and create the DataFrame.
Finally, we are able to perform further data processing or analysis if needed and save the data to a CSV file.

Top comments (0)