What is exploratory data analysis
Data is the current invaluable resource that when utilized appropriately provides helpful information that shapes the future of the world. Data in its raw nature is rather a useless chunk, one cannot use it to make informed decisions. Therefore, one needs to transform it to extract insights from the data. To perform this action, one should first understand the data, and this is where exploratory data analysis comes in. Data analysis is a crucial component in various fields from data science to business. The first and crucial step in this process is exploratory data analysis.
Exploratory data analysis put simply is, getting to know your data to understand it. To do this, examine the dataset keenly to identify patterns and anomalies. Manipulate and visualize the data to generate insights that inform the further analysis.
Objectives of exploratory data analysis
There are two main goals of exploratory data analysis:
- Gain a deep understanding of the data, its quality and structure.
- Get a wider perspective of the data, understand the relationship between variables, and use relationships in features to develop models for the data.
Types of exploratory data analysis
There are two major types of exploratory data analysis:
• Univariate exploratory data analysis
• Multivariate exploratory data analysis
Univariate EDA
Here, uni stands for unity, which is just a single value, while variate comes from the variable, meaning in univariate EDA there is only one variable that we examine in isolation. Techniques employed in univariate EDA include performing summary statistics on the measures of central tendencies and dispersion. Use histograms and boxplots to visualize the data graphically. A histogram is a graph in which each bar along the x-axis represents the frequency distribution of the data. A box plot is a rectangle drawn to represent the 5-number statistics of data. These are minimum value, maximum value, median, lower quartile and upper quartile.
Multivariate EDA
Here, multi stands for multiple and variate stands for variable, meaning multivariate EDA is where you consider multiple variables in a dataset. The goal is to determine the correlations between the variables. Multivariate EDA techniques include:
- Scatter plots; this a graphical plot of two quantitative variables on the x-y axis.
- Heatmaps; are graphical representations where data values are represented as colors.
- Correlation matrix; displays pairwise correlation coefficients between all pairs of variables in a dataset.
Exploratory data analysis tools
The choice of data analysis tool to use largely depends on what data you are working on, what tool your organization use and which tool you are comfortable with. For instance, you can use Python for general datasets and MATLAB for datasets in the engineering field. In this article, we shall use Python as our tool for exploratory data analysis.
Exploratory data analysis steps
There are four main steps involved in exploratory data analysis. These are data collection, cleaning, analysis and visualization. The steps are performed in chronological order.
Data collection
This is the process of gathering data from various sources and consolidating them into one datasheet. Sites like Kaggle UCL Machine Learning Repository, EarthData, AWS Open Datasets, and Github among others provide public datasets for developers and data scientists.
In the following illustrations, we use the “IT Salary Survey for Euregion (2018-2020) dataset” from Kaggle.
First, import all the necessary libraries you’ll need.
Next load the datasets into data frames:
Here there are three dataframes but we would wish to combine them into one dataframe. To combine them into one file we use the concatenate function. After combining we write the combined dataframe into a CSV file.
Display the first 5 rows of the dataframe:
Get the summary of the dataframe using the info () function this help in checking the structure and completeness of the dataframe. It also tells us the datatypes in the dataframe.
Get the descriptive information of the dataframe. Use the .describe() function to perform this operation:
Data cleaning
Datasets may have inconsistencies or undesired values; therefore you may need to clean your data.
Check if there are any missing values in the data frame by using the function .isna()
Our data has numerous features. We would however wish to work on just a few specific features. So, we will use the “loc” method of the pandas dataframe to create a new dataframe with selected specific features.
Check for missing values using the isna() function
If your data has missing values, you should consider checking the percentage of missing values. The pandas function df.isna().mean()*100 returns the percentage of missing values in each column. Once you get information about the missing values, there are approaches you can take to handle them.
Handling missing values
Missing values can either make or break your subsequent data analysis. Therefore you must handle them appropriately. The following are some commonly used methods:
- Dropping missing values; is simply getting rid of the rows or columns containing the missing values. It is more appropriate when the missing data is small compared to the entire dataset. Thus removing them will be insignificant to the analysis. Use the function drop.na to remove missing values.
- Fill in missing values; use the function fiil.na() to fill in the missing values with either mean or median. Care should be taken to avoid generating misleading values.
- Interpolating missing values; this method returns a dataframe with missing values replaced with values obtained from interpolating neighbouring rows or columns. Use the function interpolate()
In the salary 2018 dataset, we replaced the missing values with the median.
Data analysis
This is where you explore the data to determine and identify correlations that exist. This could either be a univariate or bivariate analysis depending on the dataset you are working on. Below is the pi-chart plot of gender distribution.
Use the function corr() to check if there are any correlations in the dataset. Below is the correlation matrix of the salary2018 dataset.
To visualize the correlation, use the seaborn library to generate a heatmap as shown below:
From the heatmap, it is observable that the current salary has a strong correlation with the salary one & two years ago. Age has a minimal correlation with salaries in all the years but has a stronger correlation with years of experience.
In conclusion, EDA is an important step in the data analysis process that enables data scientists to gain accurate insights and trends from their data. It is important, to note that EDA is an iterative process. Steps taken may vary depending on the dataset and the objectives of the analysis.
Top comments (0)