DEV Community

Cover image for First glance - Sales Dataset
MegaWatts
MegaWatts

Posted on

First glance - Sales Dataset

Introduction

The sales dataset is a collection of orders made by different customers from various geographical are[CITY,STATE,POSTALCODE,COUNTRY]. Each entry includes detailed information about the customer [CUSTOMERNAME, CONTACTLASTNAME CONTACTFIRSTNAME], the order [ORDERNUMBER, DATE, STATUS], and the products ordered [PRODUCTLINE, MSRP, PRODUCTCODE]. This dataset spans from 2003 to 2005 and includes contact person information.

Each entry contains an order number[ORDERNUMBER] which represents a unique identifier for each order. An order entry contains different products [PRODUCTCODE] associated with a product line [PRODUCTLINE].
Data columns such as deal size [DEALSIZE] and status [STATUS] are used to categorize different products. The dataset includes pricing [PRICEEACH], quantity ordered [QUANTITYORDERED], and the total sales amount [SALES].

The dataset contains 2823 entries of orders with 25 data columns, of which 9 are numerical and 3 are categorical data [STATUS, DEALSIZE, PRODUCTLINE].

retail sales data dataset's main statistical properties

Fig. 1 showing retail sales data dataset's main statistical properties

Key Components of the Dataset:

  • Customer Information: [CUSTOMERNAME, CONTACTLASTNAME, CONTACTFIRSTNAME, PHONE]

  • Geographical Information: [ADDRESSLINE1, ADDRESSLINE2, CITY, STATE, POSTALCODE, COUNTRY, TERRITORY]

  • Order Information: [ORDERNUMBER, ORDERDATE, STATUS]

  • Product Information: [PRODUCTLINE, PRODUCTCODE, QUANTITYORDERED, PRICEEACH, SALES]
    • Additional Attributes: [QTR_ID, MONTH_ID, YEAR_ID, MSRP, DEALSIZE]

shape of the retail sales data

Fig 2 shows the shape of the retail sales data

Purpose of the Review
The purpose of this review is to provide an initial understanding of the dataset's structure, content, and data quality. This will help to:

  • Detect missing values and assess their impact.

  • Identify necessary data type conversions for accurate analysis.

  • Determine important columns for analysis such as sales figures, order details, and customer information.

Observations

  1. Multiple Products per Order: An ORDERNUMBER may be associated with more than one product [PRODUCTCODE] from different product lines [PRODUCTLINE].
  2. Sales Calculation Discrepancy: The [SALES] column is not always the multiplication of the [QUANTITYORDERED]and [PRICEEACH] columns. There are 1304 entries that do not follow this calculation.
  3. Order Date Format: The [ORDERDATE] is in object format and will need to be converted to a datetime format for any time series analysis.
  4. Inconsistent Phone Data: The [PHONE] data does not follow a consistent pattern across different countries.

Potential Areas for Further Analysis:

  • Sales Performance:
    • Analyze completed sales data over time, by product, and by region.
    • Analyze sales data based on status by product, region, and time.
  • Customer Insights:

    • Segment customers and analyze their purchasing behavior.
  • Geographical Insights:

    • Map sales data and customer locations.
  • Forecasting:

    • Predict quantity and product sales for the next year.

I am currently an intern in the HNG Internship boot camp that builds apps and solves different problems in teams. The HNG Internship is a fast-paced boot camp for learning digital skills. HNG Hire makes it easy to find and hire elite talent.

Top comments (1)