loading...
Cover image for Exploratory Data Analysis, Feature Engineering and Modelling using Supermarket Sales Data. (Part 1)

Exploratory Data Analysis, Feature Engineering and Modelling using Supermarket Sales Data. (Part 1)

risin_developer profile image Rising Odegua Updated on ・12 min read

Originally posted in towardsdatascience
In these series of posts, we’re going to dive deep and fully explore the amazing world of data exploration, feature engineering and modelling. If you are a beginner in machine learning and data science, and need practical and intuitive explanation to these concepts, then this series is for you.

The series is divided into three parts which I’ll write over the coming weeks.

Part 1: Learn how to explore and gain insights from a dataset.
Part 2: Learn the basics of feature engineering with knowledge gained from data exploration.
Part 3: Modelling and feature importance.

Before we start this journey, we’ll need to define some important terms that we’ll use in this and subsequent posts.
First, EDA (Exploratory Data Analysis) What is it?

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. (source: wikipedia)

In summary, EDA can show us hidden relationships and attributes present in our data even before we throw it at a machine learning model.

Next up, F.E (Feature Engineering) who are you?

Source: Data Science Central

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. (source: wikipedia)

In summary, FE is simply using your existing knowledge of the dataset to create new features that can help a machine learning model perform better. Let’s see an example. Assuming I have a dataset on different car types as shown below;

sample car dataset

Now, let’s say I want to predict the price of a car from the information in this dataset, I can use my domain knowledge of cars to create extra features that can help a machine learning model perform better on this task. Features like the speed of the car or the type of engine. I could even take it further by adding boolean features like:

Disney's Cars

  1. Does the car have Infotainment System?
  2. Does the car have rear seat entertainment?
  3. Is the car an all-wheel or a 4-wheel drive? etc.

You get the hang of it.I could also create features from the existing ones by doing what we call Feature crossing (More on this in the next post).

Now that we’re familiar with the terms EDA and FE, let’s get our data and start exploring.
We’ll be using a dataset of supermarket sales provided by Data Science Nigeria. Here’s a link to the data.
We’ll use the popular prototyping tool Jupyter Notebooks and sorry “R”folks, I’ll be using Python for this exploration.

After downloading your data, place it in the same folder as your notebook so you can access it. Start your notebook and import the following libraries as shown below.

Import cell

  1. Here, we import numpy, pandas and matplotlib which is used for manipulation of arrays, processing of csv files and plot visualization respectively.
  2. Here, we import the stats module which contains some statistical functions such as norm and skew which we’ll use for calculating some statistics.
  3. We import seaborn. A powerful plotting library built on top of matplotlib. We’ll use this for creating some insightful plots.
  4. Python may throw some annoying warnings, we stop this using the ignore_warns function.
  5. We import os, a package for accessing files and folders easily. We use the os.listdir to show the content of our current directory (Present working directory).

Next, let’s read in our data.

Reading in dataset

  1. We read in our data using the pandas read_csv() command
  2. We print out the first five rows of the data.

Note: I transposed the train.head() command because I wanted to see all rows and columns on a single page without scrolling horizontally.

Now that we can see our data and the features it contains. Let’s get down to business.

yea, you should

First, we need to know our target variable.

We could pick any feature to be our target and that will in turn tell us the kind of model we’ll build.

For example, if we pick Product_Fat_Content to be our target variable, then we’ll have a classification problem and if we decided to pick Product_Price then it becomes a regression problem.

But since we’re not concerned with building models in this post, we’ll just assume we’re trying to predict Product_Supermarket_Sales (Total number of sales made by a supermarket).

Next, Let’s know our features. Since the number of features is small, we can manually look at them and instantly remove the ones that aren’t worth exploring or putting into our model. We’ll use our domain knowledge of a supermarket for this.

I usually take out a pen and paper, draw five columns as shown below and manually fill it for all features:
Feature == Important == In-between == Not-important ==Reason

You could do this any other way though. Let’s look at these features.

  • Product_Identifier: This is a unique ID for each particular product.
    • Verdict: In-between.
    • Reason: Sometimes it is best to remove any Unique ID columns because our model can overfit to this, but sometimes it may help. We can experiment with this.
  • Supermarket_Identifier: This is a unique ID for each supermarket.
    • Verdict: In-between.
    • Reason: Same as above
  • Product_Supermarket_Identifier: This is a concatenation of the product and supermarket identifier.
    • Verdict: In-between
    • Reason: This feature can be used in place of Product and * * * Supermarket Identifier. It is probably a cross between the two features. It might be important, but may drastically increase our dimension space when encoding. (More on this in a later post)
  • Product_Weight: Weight of the product.
    • Verdict: Important
    • Reason: Weight of a product may contribute to it’s price, which in turn contribute to the total Supermarket sales.
  • Product_Fat_Content: The amount of fat contained in the * product.
    • Verdict: Important
    • Reason: Fat content can also influence price, which in turn contributes to total Supermarket sales.
  • Product_Shelf_Visibility: According to the data description, this is the total display area in percentage allocated to a group of product.
    • Verdict: Important
    • Reason: The visibility of a product will probably determine if that product will be seen by customers or not. This can influence sales of that product.
  • Product_Type: This is the category to which the product belongs.
    • Verdict: Important
    • Reason: Some type of product will generally sell more than others.
  • Product_Price: The price of the product.
    • Verdict: Important
    • Reason: This is certainly an important feature because the price of products will definitely affect out target variable Product_Supermarket_Sales.
  • Supermarket_Opening_Year: The year the supermarket was opened.
    • Verdict: Important
    • Reason: The year of opening may influence the sales of products.
  • Supermarket _Size: This is the size of the supermarket. It has been categorize into small, medium and High.
    • Verdict: Important
    • Reason: The size of a supermarket can mean more sales by the supermarket.
  • Supermarket_Location_Type: This feature is a description of the location of the supermarket. According to the data, there are 3 clusters (Cluster 1,2 and 3). We do not know which is higher; But we want to keep this feature.
    • Verdict: Important
    • Reason: The location type might contain information about the place supermarkets are located, whether they’re in urban,rural,busy or isolated places.
  • Supermarket_Type: The type of supermarket categorize into Grocery store, Supermarket Type1, Supermarket Type2, Supermarket Type3.
    • Verdict: Important
    • Reason: It gives us useful information about the supermarket which may be important in predicting total Sales.
  • Product_Supermarket_Sales: This is our target variable for this post. It is the total amount of sales made in the supermarket.
    • Verdict: Yes, it is important!
    • Reason: It is our target variable. Now that we’ve manually gone through our data and found almost all the features to be important (some are still under scrutiny though), we can start our EDA.

The first thing I like to do when doing EDA on a dataset with a reasonable amount of numeric columns, is to check the relationship between my target variable and these numeric features. One quick way to do this is to use the seaborn heatmap plot. This seaborn heatmap takes the correlation matrix calculated on the numerical features in our dataset and makes a heated plot of it. Enter the code below to make an heatmap plot.
heatmap

  1. First we calculate the correlation on the data using the pandas .corr( ) function, and then we store it in the variable corrmat
  2. Here, we simply set the figure size of our plot.
  3. Here, we pass the corrmat to the seaborn function heatmap( ) which makes the beautiful plot.

Now let’s interpret the plot above. Take a look at it and see what insight you can draw from it.
If you don’t know how to interpret an heatmap plot, the next few paragraphs are for you. If you do, please skip it.

Heatmap plot of Supermarket dataset

Interpreting Heatmaps

Each square in an heatmap shows how much any two feature correlates (is there some kind of relationship between them). Does the increase in one feature result in the increase of the other? (Positive correlation) or does the increase in one result in the decrease of the other ? (Negative correlation).

  1. The vertical range running from 0.0 to 1.0 shows you the relative size of the correlation between any two features, where hotter values are higher and means more correlation.
  2. The diagonal almost-whitish portion of the matrix shows that all features are highly correlated to the themselves (Of course they should be).

Now, let’s interpret the labels inside the square.

Tracing Label 3 horizontally to the left leads to Product_Weight and tracing it vertically downwards, leads to Product_Price. This means the square box (3) is showing us how correlated Product_Weight and Product_Price are. We can see that they’re not really correlated as the square is almost cold.

Label 4 shows the correlation between Product_Price and Product_Supermarket_Sales. We clearly see that the region is hot, meaning there’s a high correlation.

And that’s how you interpret an heatmap.

Okay, you’ve just graduated with a degree in heatmap reading. Now, you’re at your first Job and the heatmap below is laid before you.

Heatmap of dataset

Your boss looks up at you with full confidence that he has hired the right guy and demands for an answer.
You got up, take one look at the heatmap, smiled and said:

Well, this plot shows that the Product_Price is the most correlated feature with our target of interest Product_Supermarket_Sales, and it makes sense with our intuition that if more expensive products are sold, then the total product_Supermarket_Sales will likely increase and vice versa.

Your boss nods and ask you to go on. You’re full of confidence now and continues.

Another correlation I can see — though small — is between Product_Price and Supermarket_Opening_Year.

So what does this mean your boss asks.

Well in summary, this is telling us that the supermarket opening year seems to be affecting the prices of products. It could be positively (Newer Supermarkets sell more) or it could be negative (Old Supermarkets have more loyal customers and sell more). I’d have to investigate and find more about this.

Beside that, there are no other interesting correlations from this plot Sir. But these are just the numeric features. I’ll still have to check with the categorical features.

Your boss is happy, he gives you a firm handshake and asked you to investigate further reporting to him the next day.

And off you go, to do some more exploration.
First, you decided to investigate the relationship between all the features in the dataset. So you wanted to plot each feature against the others, but you want to do this once without writing multiple for loops.
Well, Seaborn’s pairplot( ) comes to the rescue.

seaborn pairplot on supermarket dataset

If you’ve not seen a seaborn pairplot before, this is what it looks like. It’s basically a one line wonder. It plots all features in dataset against each other just like the heatmap( ) function except this gives a more visual plot. We can also specify a categorical feature we want to use for coloring.

Note: The Seaborn pairplot only plots numerical features, but we can color our plots based on a selected categorical column. Also, the pairplot will throw an error if the data set you pass into it contains missing values.

So First, let’s check for columns with missing values, so we don’t add them to our pairplot.

Check for missing values

Well, we have two columns with missing values, Supermarket_size, which is a categorical feature and Product_Weight which is a numeric feature. We’ll work on these missing features in our next post, for now let’s just exclude them from our pairplot( ) columns.

code block to pairplot features

  1. Here we list out all categorical features in our data set. (we’ll be using them for coloring. Meaning we’ll have one set of pairplot for each categorical feature.
  2. Inside the for loop, we list out all columns without missing values.
  3. Create a new figure each time
  4. Call the one line wonder pairplot( ) and pass our data “train” and the columns we want. We specify a size of 3.0 so we can see the complete pairplot without scrolling.
  5. The hue variable takes a column from our categorical columns and use them in coloring our pairplots.

After running the code above, we’ll have a total of four plots. Let’s look at them one after the other and try to interpret or draw insights from them.

Plot one: Category (Product_Fat_Content)
Plot one: Category (Product_Fat_Content)

Interpretation:

  • We can see that a vast majority of products contains Normal Fat. So what can we possibly do with this knowledge? Well, we could create a boolean feature column is_normal_fat and separate Normal Fat(Majority class) from the other two Low fat and Ultra Low Fat (Minority classes). It just a suggestion.
  • There is positive trend between product price and super market sales. This confirms the correlation shown by the heatmap plot (Positive Correlation). So what can we possibly do with this knowledge?
  • Well, this feature is definitely very important. We could create more features from it.
  • We also notice that our target variable (Product_Supermarket_Sales) and the variable Product_Shelf_Visibility are right skewed. We can transform them using the log function to make them normally skewed. We’ll do this later. We can also see that the prices of low Fat and Ultra low fat is relatively higher than that of normal fat product. So what can we possibly do with this knowledge? Well, we could create a categorical feature where we set the price of Normal Fat product to low and Low/Ultra Low Fat products to high.
  • Another thing I noticed is that there seems to be a gap between Supermarket_Opening_Year. Notice the cluster of Supermarkets in the 90s and another cluster in the 2000s. What can we possibly do with this knowledge? Well, we could create a new feature column to capture these two clusters.
  • I couldn’t notice any other interesting relationship in this pairplot, so we move on the next.

Plot two: Category (Product_Type)
Plot two: Category (Product_Type)

Interpretation:

  • From Product_type category, Household, Health-and-Hygiene and others dominate the plot. So what can we possibly do with this knowledge? Well, we could create a feature column that classifies the data into two groups, one group captures Household, Health Hygiene and others and the other captures the rest products. Who knows? It might be a good feature.
  • Nothing else seemed interesting . So I’ll stop here for now and go to the next plot.

Plot three: (Supermarket_Location_Type)

Plot three: (Supermarket_Location_Type

Interpretation:

  • It seems cluster 2 is the most expensive location. It’s product_Price and Product_Supermarket_Sales looks higher.
  • We also notice that supermarkets in cluster 2 are newer (2010 upwards).
  • Cluster 1 has lower product prices and Cluster 2 is between. So what can we possibly do with this knowledge? Well, now we know that Supermarket_Location_type is a very important feature which seem to have some ordinality. If we’re to Label Encode this variable, we should try to keep this ordinal structure.

Remember, if you notice any interpretation I missed, use the comment box below.

Plot four (Supermarket_Type):

Plot four (Supermarket_Type)

Interpretation:
Looking at the plot, we can instantly see that the different supermarkets are almost finely separated. The’re lots of insights we can draw from this:

  • First, we can see that the price of goods in SuperMarket Type3 generally starts from around 350 and the price of goods in SuperMarket Type1 is approximately between 0–250, while SuperMarket Type2 is somewhere between 250 and 350, and finally Grocery Store takes prices along all ranges. What can we do with this insight? We could create categorical columns to capture these ranges or we could create bins and group each supermarket according to their price range.
  • Looking at the SuperMarket_Opening_Year, we immediately notice that supermarkets belonging to the category SuperMarket Type3 were all opened in the 90s; Meaning the feature we proposed to create earlier — partitioning opening years into two clusters — will be quite important.

As always, if you notice any good interpretation I may have missed, please notify me in the comment box below.


Wheeeeew! Its been a long first post. Well hopefully, you learnt a lot.

In my next post, we’ll talk about Feature Engineering and Creation. There, we’ll bring to life all the intuition we’ve gathered while doing this EDA.

My parting advise to you is that you get a dataset, poke it, tear it apart, find the hidden gems and derive insights from it, and also have fun while at it.

But Remember, EDA is an art, It can be done in many different ways; The steps listed above are by no means an outline you must follow rigidly. Do what works for you.

So, go have fun doing some exploration. Bye for now!
Notebook and data for this post can be found here
Questions, comments and contributions are always welcome.

Posted on by:

risin_developer profile

Rising Odegua

@risin_developer

Doing something around data science, machine learning and software engineering

Discussion

pic
Editor guide