DEV Community

hub
hub

Posted on

1

some approaches to scrape the clutch-page: with bs4 and pandas - a comparison

trying to gather the data form the page "https://clutch.co/il/it-services"
and that said i - think that there are probably several options to do that

a. using bs4 and requests
b. using pandas

this first approach uses a.

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd

    url = "https://clutch.co/il/it-services"
    response = requests.get(url)

    soup = BeautifulSoup(response.content, "html.parser")

    company_names = soup.find_all("h3", class_="company-name")
    locations = soup.find_all("span", class_="locality")

    company_names_list = [name.get_text(strip=True) for name in company_names]
    locations_list = [location.get_text(strip=True) for location in locations]

    data = {"Company Name": company_names_list, "Location": locations_list}
    df = pd.DataFrame(data)

    df.to_csv("it_services_data.csv", index=False)
Enter fullscreen mode Exit fullscreen mode

This code will scrape

a. the company names and locations from the specified webpage and
b. stores them in a Pandas DataFrame.
c. It will then save the data to a CSV file named "it_services_data.csv" in the current working directory.

i am wondering if a panda-approach could be useful as well?

    import pandas as pd

    url = "https://clutch.co/il/it-services"

    # Use pandas to read HTML content and extract tables from the webpage
    tables = pd.read_html(url)

    # Assuming the desired table is the first one on the page
    table = tables[0]

    # Extract the columns we're interested in
    df = table[["Company Name", "Location"]]

    # Optional: Clean up the column names if needed
    df.columns = ["Company Name", "Location"]

    # Optional: Perform further data processing or analysis using the Pandas DataFrame

    # Save the data to a CSV file
    df.to_csv("it_services_data.csv", index=False)

Enter fullscreen mode Exit fullscreen mode

In this approach, pandas' read_html() function is used to read the HTML content of the webpage and extract tables.
Assuming the desired table is the first one on the page, we are able to assign it to the table variable.
Then, we can extract the columns we're interested in and create the DataFrame.
Finally, we are able to perform further data processing or analysis if needed and save the data to a CSV file.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (0)

The best way to debug slow web pages cover image

The best way to debug slow web pages

Tools like Page Speed Insights and Google Lighthouse are great for providing advice for front end performance issues. But what these tools can’t do, is evaluate performance across your entire stack of distributed services and applications.

Watch video