DEV Community

Cover image for Extracting data from e-commerce websites
Anuoluwapo Balogun
Anuoluwapo Balogun

Posted on

1

Extracting data from e-commerce websites

Basic Web Scraping is one of the essentials for a Data Analyst. The ability to get your own data for Project Purpose is an undervalued task.

I recently scraped some data from 4 big art shops (websites) in Nigeria and I would like to share the codes (ChatGPT included codes) for learning purposes(Other Data analyst who might find it useful).

The first website is Crafts Village I scarped the Art-tools category.

code for scraping the website

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Initialize lists to store the data
product_names = []
prices = []

# Scrape all 6 pages
for page in range(1, 7):
    url = f"https://craftsvillage.com.ng/product-category/art-tools/"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the relevant HTML elements for product information
    products = soup.find_all("li", class_="product")

    # Extract data from each product element
    for product in products:
        # Product name
        name_element = product.find("a", class_="woocommerce-LoopProduct-link")
        name = name_element.text.replace("\n", "").strip()
        name = re.sub(r"[₦\,|–]", "", name)  # Remove unwanted characters
        product_names.append(name)


        # Price
        price_element = product.find("bdi")
        price = price_element.text if price_element else None
        prices.append(price)

# Create a Pandas DataFrame from the scraped data
data = {
    "Product Name": product_names,
    "Price": prices
}
df = pd.DataFrame(data)

# Remove "\n\n\n\n\n" from "Product Name" column
df["Product Name"] = df["Product Name"].str.replace("\n", "")

# Display the Data Frame
print(df)
Enter fullscreen mode Exit fullscreen mode

To get the name element class, I inspected the name class from my browser by putting the cursor on the product name right click my mouse pad and clicking on inspect.

Image description

I also did same for the price too

Image description

The code above extracted the product name and prices from all the 6 pages in the Art tool category.

Here is how I scraped information from Crafties Hobbies

import requests
from bs4 import BeautifulSoup
import pandas as pd

base_url = 'https://craftieshobbycraft.com/product-category/painting-drawing/page/{}/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# Create lists to store data
categories = []
product_names = []
product_prices = []

# Iterate over each page
for page in range(1, 8):
    url = base_url.format(page)
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')

    category_elements = soup.find_all('p', class_='category uppercase is-smaller no-text-overflow product-cat op-7')
    product_names_elements = soup.find_all('a', class_='woocommerce-LoopProduct-link woocommerce-loop-product__link')
    product_prices_elements = soup.find_all('bdi')

    for category_element, product_name_element, product_price_element in zip(category_elements, product_names_elements, product_prices_elements):
        category = category_element.get_text(strip=True)
        product_name = product_name_element.get_text(strip=True)
        product_price = product_price_element.get_text(strip=True)

        categories.append(category)
        product_names.append(product_name)
        product_prices.append(product_price)

# Create a pandas DataFrame
data = {
    'Category': categories,
    'Product Name': product_names,
    'Product Price': product_prices
}
df = pd.DataFrame(data)

# Print the DataFrame
print(df)
Enter fullscreen mode Exit fullscreen mode

Here is how I scraped data from Kaenves store

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Create empty lists to store the data
product_names = []
prices = []

# Iterate through each page
for page in range(1, 4):
    # Send a GET request to the page
    url = f"https://www.kaenves.store/collections/floating-wood-frame?page={page}"
    response = requests.get(url)

    # Create a BeautifulSoup object to parse the HTML content
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all span elements with the specified class
    price_elements = soup.find_all('span', class_='price-item price-item--regular')
    name_elements = soup.find_all('h3', class_='card__heading h5')

    # Extract the prices and product names
    for price_element, name_element in zip(price_elements, name_elements):
        price = price_element.get_text(strip=True)
        name = name_element.get_text(strip=True)
        product_names.append(name)
        prices.append(price)

# Create a pandas DataFrame
data = {'Product Name': product_names, 'Price': prices}
df = pd.DataFrame(data)

# Save the DataFrame as a CSV file
df.to_csv('paperandboard.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

Here is how I scraped data from Art Easy

import requests
from bs4 import BeautifulSoup
import pandas as pd

prices = []
product_names = []

# Iterate over all 2 pages
for page_num in range(1, 3):
    url = f"https://arteasy.com.ng/product-category/canvas-surfaces/page/{page_num}/"

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content
    soup = BeautifulSoup(response.text, "html.parser")

    # Find all the span elements with class "price"
    product_prices = [span.get_text(strip=True) for span in soup.find_all("span", class_="price")]

    # Find all the h3 elements with class "product-title"
    product_names += [product_name.get_text(strip=True) for product_name in soup.find_all("h3", class_="product-title")]

    # Add the prices to the list
    prices += product_prices

# Check if the lengths of product_names and prices are equal
if len(product_names) == len(prices):
    # Create a pandas DataFrame
    data = {"Product Name": product_names, "Price": prices}
    df = pd.DataFrame(data)

    # Print the DataFrame
    print(df)
else:
    print("Error: The lengths of product_names and prices are not equal.")
Enter fullscreen mode Exit fullscreen mode

If you want to reuse this code ensure to change the URL to your preferred e-commerce website and also change the class to your URL product name and product price class

These informations scraped can be used for the following;

  • Price comparison: You can use the scraped data to compare prices of products across different websites. This can help you find the best deal on the product you are looking for.

  • Product research: You can use the scraped data to research products. This can help you learn more about a product's features, specifications, and reviews.

  • Market analysis: You can use the scraped data to analyze the market for a particular product. This can help you identify trends and opportunities.

  • Product recommendations: You can use the scraped data to recommend products to users. This can help you increase sales and improve customer satisfaction.

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

Top comments (0)

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay