Obioma Obijama

Posted on Aug 23, 2024

Descriptive Analytics of Sample Sales Data (Kaggle) with R

#datascience #beginners #learning #computerscience

In the budding field of Business Analytics, there are 3 broad categories of analytical techniques: Descriptive, Predictive, and Prescriptive.

Descriptive analytics techniques describe what has happened in the past. They help summarize historical data, find patterns or relationships in the data, and monitor specific variables for performance or decision-making purposes. Examples of descriptive analytics techniques include data visualization, descriptive statistics (e.g. mean, median, variance, and standard deviation), data dashboards, and basic spreadsheet models.

Predictive analytics techniques use models constructed from historical data to predict the impact of one variable on another variable. Statistical methods like linear regression and time series analysis, are examples of predictive analytics techniques. Another example of predictive analytics is simulations that use probability and statistics to construct a computer model to study the impact of uncertain scenarios on a decision.

Prescriptive analytics techniques use models constructed from predictive analytics and combine them with theories, rules, or logic, to prescribe particular courses of action and decisions. Prescriptive analytics techniques include optimization models (linear, non-linear, integer, and simulation) and decision analysis.

This project would focus on descriptive analytics on the sample sales dataset. For the sake of clarity and ease, we would rename the dataset into Photon Dataset, and refer to the automobile company as Photon Inc.

Data Preprocessing

One of the first things to do when working on a dataset in R is to set the working directory, before importing the dataset to R.

> setwd("~/R-Studio Practice")#This is to set the working directory
> Photon.Dataset <- read.csv("C:...\\Learn Python and R\\sales_data_sample.csv")

Next is to explore the dataset to determine what columns and variables in the dataset. This would also inform the various types of tests one can perform, and the kinds of insights one should look out for.

> head(Photon.Dataset)
#The output is as follows: 
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER   SALES       ORDERDATE
1       10107              30     95.70               2 2871.00  2/24/2003 0:00
2       10121              34     81.35               5 2765.90   5/7/2003 0:00
3       10134              41     94.74               2 3884.34   7/1/2003 0:00
4       10145              45     83.26               6 3746.70  8/25/2003 0:00
5       10159              49    100.00              14 5205.27 10/10/2003 0:00
6       10168              36     96.66               1 3479.76 10/28/2003 0:00
   STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE
1 Shipped      1        2    2003 Motorcycles   95    S10_1678
2 Shipped      2        5    2003 Motorcycles   95    S10_1678
3 Shipped      3        7    2003 Motorcycles   95    S10_1678
4 Shipped      3        8    2003 Motorcycles   95    S10_1678
5 Shipped      4       10    2003 Motorcycles   95    S10_1678
6 Shipped      4       10    2003 Motorcycles   95    S10_1678
              CUSTOMERNAME            PHONE                  ADDRESSLINE1
1        Land of Toys Inc.       2125557818       897 Long Airport Avenue
2       Reims Collectables       26.47.1555            59 rue de l'Abbaye
3          Lyon Souveniers +33 1 46 62 7555 27 rue du Colonel Pierre Avia
4        Toys4GrownUps.com       6265557265            78934 Hillside Dr.
5 Corporate Gift Ideas Co.       6505551386               7734 Strong St.
6     Technics Stores Inc.       6505556809             9408 Furth Circle
  ADDRESSLINE2          CITY STATE POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME
1                        NYC    NY      10022     USA      <NA>              Yu
2                      Reims            51100  France      EMEA         Henriot
3                      Paris            75508  France      EMEA        Da Cunha
4                   Pasadena    CA      90003     USA      <NA>           Young
5              San Francisco    CA                USA      <NA>           Brown
6                 Burlingame    CA      94217     USA      <NA>          Hirano
  CONTACTFIRSTNAME DEALSIZE
1             Kwai    Small
2             Paul    Small
3           Daniel   Medium
4            Julie   Medium
5            Julie   Medium
6             Juri   Medium

The next practice is to check for missing data in the dataset. This is to ensure that the dataset is balanced, or at least the variables we are most interested in do not have any of its data missing.

> is.na(Photon.Dataset)

The output indicates the ADDRESSLINE2, STATE, and POSTALCODE columns have missing values. Given that these columns are not particularly important to the insight we are trying to glean from the dataset, we can either remove the columns or leave them. I will use the “dplyr” library to remove the unuseful columns.

> library(dplyr) #This would import the dplyr library to the workspace.  
> Photon.Dataset2 <- Photon.Dataset %>%
select(-ADDRESSLINE2, -STATE, -POSTALCODE) #This would remove the columns Addressline 2, State, and Postal code
> Photon.Dataset$TERRITORY[is.na(Photon.Dataset$TERRITORY)] <- "North America" #This is to change the item ‘NA’ in the Territory column to ‘North America’ so that the library does not misinterpret it as ‘Not Available’ data.

These are all the data preprocessing we'll be conducting on this dataset.

Sales Analysis

In this procedure, we’ll check for the top-selling product lines, and the total sales volume, create a plot of the sales against the product lines, a correlation of quantity ordered vs sales, and the relationship between price and sales.

> Total_sales <- Photon.Dataset %>%
     summarise(Total_sales = sum(SALES)) %>%
     pull(Total_sales)
> print(paste("Total_sales:", round(Total_sales, 3)))
[1] "Total_sales: 10032628.85" #This is the total sales volume

To determine the top-selling products, and rank the productline we did the following:

> top_product_lines <- Photon.Dataset %>%
    group_by(PRODUCTLINE) %>%
    summarise(total_sales = sum(SALES)) %>%
    arrange(desc(total_sales))
> print(top_product_lines)
# A tibble: 7 × 2
  PRODUCTLINE      total_sales
  <chr>                  <dbl>
1 Classic Cars        3919616.
2 Vintage Cars        1903151.
3 Motorcycles         1166388.
4 Trucks and Buses    1127790.
5 Planes               975004.
6 Ships                714437.
7 Trains               226243.

Next is to create a visualization for the top selling product lines.

> library(ggplot2)
> library(scales)  #for adding the numbers in the correct format
> ggplot(top_product_lines, aes(x = reorder(PRODUCTLINE, -total_sales), y = total_sales)) + #aes prescribes the aesthetics of the X and Y axis. This code reorders the input product line in the X axis and removes total sales, then input total sales in the Y axis.
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(x = "Product Line", y = "Total Sales", title = "Sales by Product Line") + #This code defines the labels for the X and Y axis 
    scale_y_continuous(labels = label_number(big.mark = ",", prefix = "$"))#This code ensure to include the dollar sign in the number of total sales

The next task is to determine the correlation between Quantity ordered and sales, and to plot a graph on this correlation.

> correlation <- cor(Photon.Dataset$QUANTITYORDERED, Photon.Dataset$SALES)
> print(paste("Correlation between QUANTITYORDERED and SALES:", round(correlation, 3)))
> ggplot(Photon.Dataset, aes(x = QUANTITYORDERED, y = SALES)) +
     geom_point(alpha = 0.5) +
     geom_smooth(method = "lm", se = FALSE, color = "red") +
     labs(x = "Quantity Ordered", y = "Sales", title = "Relationship between Quantity Ordered and Sales")
`geom_smooth()` using formula = 'y ~ x'

Lastly, in order to determine the correlation between the price of the product and sales, we used the following:

> ggplot(Photon.Dataset, aes(x = PRICEEACH, y = SALES)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +
    labs(x = "Price Each", y = "Sales", title = "Relationship between Price of Product and Sales")#When writing a code like this, remember to input the “+” after each line. This is so that R can treat the code as one code.
> price_sales_correlation <- cor(Photon.Dataset$PRICEEACH, Photon.Dataset$SALES)
> print(paste("Correlation between PRICEEACH and SALES:", round(price_sales_correlation, 3)))
[1] "Correlation between PRICEEACH and SALES: 0.658"

DEV Community

Descriptive Analytics of Sample Sales Data (Kaggle) with R

Data Preprocessing

Sales Analysis

Top comments (0)

Read next

Set Git to Recognize Case Changes

How to do Review Sentiment Analysis using Python

Encryption Standards: AES, RSA, ECC, SHA and Other Protocols

AI-Powered Solution Cuts Mixed-Integer Programming Time by 40% Using Unsupervised Learning