In the budding field of Business Analytics, there are 3 broad categories of analytical techniques: Descriptive, Predictive, and Prescriptive.
Descriptive analytics techniques describe what has happened in the past. They help summarize historical data, find patterns or relationships in the data, and monitor specific variables for performance or decision-making purposes. Examples of descriptive analytics techniques include data visualization, descriptive statistics (e.g. mean, median, variance, and standard deviation), data dashboards, and basic spreadsheet models.
Predictive analytics techniques use models constructed from historical data to predict the impact of one variable on another variable. Statistical methods like linear regression and time series analysis, are examples of predictive analytics techniques. Another example of predictive analytics is simulations that use probability and statistics to construct a computer model to study the impact of uncertain scenarios on a decision.
Prescriptive analytics techniques use models constructed from predictive analytics and combine them with theories, rules, or logic, to prescribe particular courses of action and decisions. Prescriptive analytics techniques include optimization models (linear, non-linear, integer, and simulation) and decision analysis.
This project would focus on descriptive analytics on the sample sales dataset. For the sake of clarity and ease, we would rename the dataset into Photon Dataset, and refer to the automobile company as Photon Inc.
Data Preprocessing
One of the first things to do when working on a dataset in R is to set the working directory, before importing the dataset to R.
> setwd("~/R-Studio Practice")#This is to set the working directory
> Photon.Dataset <- read.csv("C:...\\Learn Python and R\\sales_data_sample.csv")
Next is to explore the dataset to determine what columns and variables in the dataset. This would also inform the various types of tests one can perform, and the kinds of insights one should look out for.
> head(Photon.Dataset)
#The output is as follows:
ORDERNUMBER QUANTITYORDERED PRICEEACH ORDERLINENUMBER SALES ORDERDATE
1 10107 30 95.70 2 2871.00 2/24/2003 0:00
2 10121 34 81.35 5 2765.90 5/7/2003 0:00
3 10134 41 94.74 2 3884.34 7/1/2003 0:00
4 10145 45 83.26 6 3746.70 8/25/2003 0:00
5 10159 49 100.00 14 5205.27 10/10/2003 0:00
6 10168 36 96.66 1 3479.76 10/28/2003 0:00
STATUS QTR_ID MONTH_ID YEAR_ID PRODUCTLINE MSRP PRODUCTCODE
1 Shipped 1 2 2003 Motorcycles 95 S10_1678
2 Shipped 2 5 2003 Motorcycles 95 S10_1678
3 Shipped 3 7 2003 Motorcycles 95 S10_1678
4 Shipped 3 8 2003 Motorcycles 95 S10_1678
5 Shipped 4 10 2003 Motorcycles 95 S10_1678
6 Shipped 4 10 2003 Motorcycles 95 S10_1678
CUSTOMERNAME PHONE ADDRESSLINE1
1 Land of Toys Inc. 2125557818 897 Long Airport Avenue
2 Reims Collectables 26.47.1555 59 rue de l'Abbaye
3 Lyon Souveniers +33 1 46 62 7555 27 rue du Colonel Pierre Avia
4 Toys4GrownUps.com 6265557265 78934 Hillside Dr.
5 Corporate Gift Ideas Co. 6505551386 7734 Strong St.
6 Technics Stores Inc. 6505556809 9408 Furth Circle
ADDRESSLINE2 CITY STATE POSTALCODE COUNTRY TERRITORY CONTACTLASTNAME
1 NYC NY 10022 USA <NA> Yu
2 Reims 51100 France EMEA Henriot
3 Paris 75508 France EMEA Da Cunha
4 Pasadena CA 90003 USA <NA> Young
5 San Francisco CA USA <NA> Brown
6 Burlingame CA 94217 USA <NA> Hirano
CONTACTFIRSTNAME DEALSIZE
1 Kwai Small
2 Paul Small
3 Daniel Medium
4 Julie Medium
5 Julie Medium
6 Juri Medium
The next practice is to check for missing data in the dataset. This is to ensure that the dataset is balanced, or at least the variables we are most interested in do not have any of its data missing.
> is.na(Photon.Dataset)
The output indicates the ADDRESSLINE2, STATE, and POSTALCODE columns have missing values. Given that these columns are not particularly important to the insight we are trying to glean from the dataset, we can either remove the columns or leave them. I will use the “dplyr” library to remove the unuseful columns.
> library(dplyr) #This would import the dplyr library to the workspace.
> Photon.Dataset2 <- Photon.Dataset %>%
select(-ADDRESSLINE2, -STATE, -POSTALCODE) #This would remove the columns Addressline 2, State, and Postal code
> Photon.Dataset$TERRITORY[is.na(Photon.Dataset$TERRITORY)] <- "North America" #This is to change the item ‘NA’ in the Territory column to ‘North America’ so that the library does not misinterpret it as ‘Not Available’ data.
These are all the data preprocessing we'll be conducting on this dataset.
Sales Analysis
In this procedure, we’ll check for the top-selling product lines, and the total sales volume, create a plot of the sales against the product lines, a correlation of quantity ordered vs sales, and the relationship between price and sales.
> Total_sales <- Photon.Dataset %>%
summarise(Total_sales = sum(SALES)) %>%
pull(Total_sales)
> print(paste("Total_sales:", round(Total_sales, 3)))
[1] "Total_sales: 10032628.85" #This is the total sales volume
To determine the top-selling products, and rank the productline we did the following:
> top_product_lines <- Photon.Dataset %>%
group_by(PRODUCTLINE) %>%
summarise(total_sales = sum(SALES)) %>%
arrange(desc(total_sales))
> print(top_product_lines)
# A tibble: 7 × 2
PRODUCTLINE total_sales
<chr> <dbl>
1 Classic Cars 3919616.
2 Vintage Cars 1903151.
3 Motorcycles 1166388.
4 Trucks and Buses 1127790.
5 Planes 975004.
6 Ships 714437.
7 Trains 226243.
Next is to create a visualization for the top selling product lines.
> library(ggplot2)
> library(scales) #for adding the numbers in the correct format
> ggplot(top_product_lines, aes(x = reorder(PRODUCTLINE, -total_sales), y = total_sales)) + #aes prescribes the aesthetics of the X and Y axis. This code reorders the input product line in the X axis and removes total sales, then input total sales in the Y axis.
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Product Line", y = "Total Sales", title = "Sales by Product Line") + #This code defines the labels for the X and Y axis
scale_y_continuous(labels = label_number(big.mark = ",", prefix = "$"))#This code ensure to include the dollar sign in the number of total sales
The next task is to determine the correlation between Quantity ordered and sales, and to plot a graph on this correlation.
> correlation <- cor(Photon.Dataset$QUANTITYORDERED, Photon.Dataset$SALES)
> print(paste("Correlation between QUANTITYORDERED and SALES:", round(correlation, 3)))
> ggplot(Photon.Dataset, aes(x = QUANTITYORDERED, y = SALES)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Quantity Ordered", y = "Sales", title = "Relationship between Quantity Ordered and Sales")
`geom_smooth()` using formula = 'y ~ x'
Lastly, in order to determine the correlation between the price of the product and sales, we used the following:
> ggplot(Photon.Dataset, aes(x = PRICEEACH, y = SALES)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(x = "Price Each", y = "Sales", title = "Relationship between Price of Product and Sales")#When writing a code like this, remember to input the “+” after each line. This is so that R can treat the code as one code.
> price_sales_correlation <- cor(Photon.Dataset$PRICEEACH, Photon.Dataset$SALES)
> print(paste("Correlation between PRICEEACH and SALES:", round(price_sales_correlation, 3)))
[1] "Correlation between PRICEEACH and SALES: 0.658"
Top comments (0)