Definition :
Exploratory Data Analysis(EDA), also referred to as Data Exploration, is the process of analyzing,investigating and summarizing datasets to gain insights into the underlying patterns and relationships within the dataset, this is done by employing data visualization techniques and graphical statistical methods like histograms,heatmaps, violin plots, joint plots etc. Technically, Eda is all about 'understanding the dataset'.
'Understanding' in this context might refer to quite a number of things: -
- Extracting important variables which is normally referred to as Feature engineering.
- Identifying and dealing with outliers and missing values.
- Understanding the relationships between variables; linear or non-linear variables.
By employing EDA techniques, you can change a very grumpy dataset into a very clean dataset. Overall, EDA is a very crucial and critical part of any data analysis project, it is often used to guide the data analyst while doing further analysis and data modelling.
In this article we will dive deeper into EDA, we will discuss several topics: -
- Data cleaning and preparation
- Univariate Analysis
- Bivariate Analysis
- Multivariate Analysis
- Visualization Techniques
- Descriptive Statistics.
Data Cleaning and Preparation
The first step in data analysis is to clean and prepare the data. This might involve employing techniques such as identifying and correcting missing values, removing outliers, and transforming variables as necessary. This step ensures that data is cleaned and prepared for the other processes ahead. For a successful data analysis process, data is required to be as accurate and reliable as possible.
Let's have a look on how you do this:
#Import the necessary libraries
import pandas as pd
# Read in data
df = pd.read_csv('data.csv')
# Check for missing values
print(df.isnull().sum())
# Remove outliers
df = df[df['column_name'] < 100]
# Transform variables
df['new_column'] = df['column_name'] * 2
Univariate Analysis
Univariate analysis involves analyzing each variable in the dataset individually. Let's say you have variable named age
, by using univariate analysis, you can calculate the summary statistics of this variable eg., mean, median,mode, standard deviation, and variance. This step also involves visualizing the distribution of each variable using histograms, box plots, density plots etc.
We can use the Seaborn library to perform a univariate analysis on some set of data: -
import seaborn as sns
# Load data
tips = sns.load_dataset('tips')
# Calculate summary statistics
print(tips.describe())
# Visualize distribution with histogram
sns.histplot(tips['total_bill'], kde=False)
Bivariate Analysis
On the other hand of Univariate Analysis we have Bivariate Analysis, your guess is good as mine, Bivariate Analysis involves analyzing the relationships between two variables in a dataset.Again, let's look this from a practical point of view, you have two variables height
and weight
and you need to understand the relationship between the two, Bivariate analysis let's you use graphical methods such as scatter plots, bar charts, line plots etc. to visualize this relationships.It also includes calculation of correlation coefficients, cross-tabulations and contigency tables.
Let's use the matplotlib library to illustrate bivariate analysis: -
#Import the necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Load data
iris = sns.load_dataset('iris')
# Calculate correlation coefficients
print(iris.corr())
# Visualize relationship with scatter plot
plt.scatter(iris['sepal_length'], iris['sepal_width'])
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.show()
Multivariate Analysis
This is the statistical procedure that involves analyzing the relationships between more than two variables. Alternatively, multivariate analysis can be used in analyzing the relationship between dependent and independent variables. It's major applications is relevant in:- clustering, feature selection, dimensionality reduction, hypothesis testing etc.
Visualization Techniques
Another very important component of EDA is data visualization, it gives the data analyst a chance to explore and understand the data visually. This step is very crucial for any organization as it is easily understandable by the non-technical people in the organization. Non-technical people sometimes have hard time trying
to understand the 'under-the-hood' variable relationships but with data visualization, they can easily understand the relationshhips between different variables in the dataset.There are several techniques and tools used in this process. Some tools that are normally used by data analysts to visualize data include: -
- MS Power Bi
- MS Excel
- Tableu
- Google Data Studio
For the techniques, there are numerous techniques that can be employed to visualize your data. We will discuss some of these techniques using the pandas library: -
Histograms
Histograms are used to visualize the distribution of a continuous variable like height
. The hist()
method is used to generate a histogram.
# Generate a histogram
df['column_name'].hist()
Boxplots
Boxplots are used to visualize the distribution of continuous variables to detect outliers. The boxplot()
method is used.
# Generate a boxplot
df['column_name'].boxplot()
Scatterplots
This technique is used to visualize the relationship between two continuous variables, e.g, height
and weight
. The plot.scatter()
method is used to generate a scatterplot.
# Generate a scatterplot
df.plot.scatter(x='column1', y='column2')
Bar Charts
Bar charts are used in visualizing the distribution of categorical variables in a dataset. An example of a categorical variable is gender
, race
, type of job
etc. The plot.bar()
method is used to generate a bar chart.
# Generate a bar chart
df['column_name'].value_counts().plot.bar()
Hands-on EDA
We've talked a lot about the theoretical side of EDA, but now let's get to the fun part where we will be applying these techniques on a real-world dataset. Working with real-world data can at times be quite hard and frustrating, as it involves paying careful attention to data cleaning, exploration, handling outliers, dealing with missing data, and finally understanding the data. Also, it's good to keep in mind that- the ultimate goal for any data scientist is to achieve an accurate, meaningful, and relevant analysis to the problem at hand. To get your hands on a real-word data, there are various open source and free data websites that provides a wide pool of datasets, they iclude: Data.gov, Kaggle, World Bank Open Data, Open ML, Datahub, etc.
Enough have been said, let's now dive right into the nitty-gritty part. For this particular article, i'll be using an East African dataset that can be found on: https://www.kaggle.com/datasets/enockmokua/financial-dataset
Importing Libraries
#Import the necessary libraries
%matplotlib inline #for displaying plots directly below the code cell
import matplotlib.pyplot as plt #for creating plots
import pandas as pd #for data manipulation and analysis
import numpy as np # for working with arrays and matrices
import seaborn as sns #for complex visualizations that can't be achieved by plt
sns.set(); #sets the default parameters for seaborn
Loading and Exploring the Data
#load the data
df=pd.read_csv("Datasets/finance.csv")
#check the dataset size and shape
df.shape, df.size
#Display the first 5 rows
df.head()
#Display the last 5 rows
df.tail()
#View the column names
df.columns
#view the column data types
df.dtypes
#view the summary statistics of the numerical columns
df.describe()
#Display the summary of the DataFrame
df.info
#Display the total number of missing values in each column
df.isnull().sum()
#Create a mini dataframe to see the % of the missing values
missing=(df.isnull().sum()*100/len(df))
missing_df=pd.DataFrame({'Percentage missing': missing})
missing_df
Data Cleaning
#Drop irrelevant columns, or columns with too many missing values
df.drop(['Unnamed: 0','year'], axis=1, inplace=True)
#Fill the missing values with the mean or median
df['Respondent Age'].fillna(df['Respondent Age'].mean(), inplace=True)
#Check for any duplicates and drop them if exists
df.drop_duplicates(inplace=True)
Creating visualizations
We will create histogram visualizations using the Matplotlib library and seaborn libraries.
#Using matplotlib
x=(df["Respondent Age"])
plt.hist(x,100,density=True, facecolor="green")
plt.show()
#Using Seaborn
sns.histplot(df['Respondent Age']);
- From the two visalizations, we can see the differences between the two libraries. Seaborn library tends to have more clearer visualizations than the matplotlib.pyplot library.
#Scatter plots for two numerical columns
sns.scatterplot(data=df, x='Respondent Age',y='household_size',hue='Has a Bank account');
#Boxplot of a numerical column by a categorical column
sns.boxplot(x='Respondent Age',y='country', data=df);
- Boxplots are mostly used to check outliers in a dataset
#A heatmap of the correlation between columns
sns.heatmap(df.corr());
#A bar chart of the headcount per country
df.country.value_counts().plot(kind='bar')
plt.xlabel("Country")
plt.ylabel("Count");
#A pie chart for the number of respondents per country
counts=df.country.value_counts()
plt.pie(counts, labels=counts.index, autopct='%1.1f%%')
plt.show();
Chi-Squared Test
So far we've only talked and visualized mostly numerical variables, and you might be wondering, "What about the Categorical variables?". That's where Chi-Squared test comes in, the chi-squared test is basically a statistical test used to determine if there is any significant association between two categorical variables. It is used to compare observed data with expected data, and to determine if the differences between them are significant enough to reject the null hypothesis that there is no association between the variables.
In Python, you can use the scipy.stats
module, which provides a function called chi2_contingency()
that calculates the chi-squared statistic, degrees of freedom, p-value, and expected frequencies for a contingency table. Let's try it out.
import numpy as np
from scipy.stats import chi2_contingency
# create a contingency table
table = np.array([[10, 20, 30], [15, 25, 35]])
# perform the chi-squared test
chi2, p, dof, expected = chi2_contingency(table)
# print the results
print('Chi-squared statistic:', chi2)
print('Degrees of freedom:', dof)
print('P-value:', p)
print('Expected frequencies:\n', expected)
- The contingency table we created with two rows and three columns, represents the frequency of two categorical variables.
The expected output will be:
Chi-squared statistic: 0.27692307692307694
Degrees of freedom: 2
P-value: 0.870696738961232
Expected frequencies:
[[11.11111111 20. 28.88888889]
[13.88888889 25. 36.11111111]]
-The p-value is greater than the significance level of 0.05, which means we fail to reject the null hypothesis that there is no association between the variables. Therefore, we conclude that there is no significant association between the variables.
-Check out this article for further understanding of Chi-Square Test.
Top comments (0)