What is Exploratory Data Analysis
Exploratory Data Analysis is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.
Why perform EDA?
It is probably one of the most important part of a machine learning project, therefore, every machine learning problem solving starts with EDA.
Use of charts and certain graphs, one can make sense out of the data and check whether there is any relationship or not hence this helps the company to make a firm and profitable decisions.
The feature of EDA can be used for supervised and unsupervised machine learning modeling.
How can we perform EDA?
We can perform EDA using programming languages like, R and Python, and tools like Tableau, IBM Cognos and others. These are known as Business Intelligence tools.
Python has libraries which are used for EDAs. These libraries are: NumPy, Pandas, Matplotlib, and Seaborn.
Some methods and plots are:
Univariate analysis - explores each variable in a data set, separately.
Bivariate analysis - the analysis of two variables to determine relationships between them.
Multivariate analysis - involves evaluating multiple variables (more than two) to identify any possible association among them. Multivariate analysis offers a more complete examination of data by looking at all possible independent variables and their relationships to one another.
For performing EDA, we need to import the following libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Common Graphs used by performing EDAs are:
Scatter Plot
Scatter Plot uses dots to represent values for two different numeric variables. The position of each dot on the horizontal and vertical axis indicates values for an individual data point. Scatter plots are used to observe relationships between variables.
data.plot(kind='scatter', x='tenure', y='MonthlyCharges')
plt.title('ScatterPlot of MonthlyCharges vs tenure')
plt.show()
Output:
As matplotlib did not give a clear output, we will seaborn module to find whether we can get some insights since the above plot does not give a better understanding as the color is same and a lot of points are overlapping.
sns.set_style('whitegrid')
sns.FacetGrid(data,hue='InternetService',height=5) \
.map(plt.scatter, 'tenure', 'MonthlyCharges') \
.add_legend()
plt.title('Scatterplot using seaborn of MonthlyCharges vs tenure')
plt.show()
Output:
Pair Plots
Pair Plots are used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.
sns.set_style('whitegrid')
sns.pairplot(data,hue='InternetService',height=3)
plt.show()
Output:
Box Plots
Box plots tell us the percentile plotting which other plots cant tell easily. It also helps in detection of outliers.
It displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum. In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median.
ns.boxplot(x='SeniorCitizen',y='MonthlyCharges',data=data)
plt.legend
plt.title('Box plot')
plt.show
Output:
Histogram
Histogram plots are used to depict the distribution of any continuous variable. These types of plots are very popular in statistical analysis.
sns.FacetGrid(data, hue='Churn',height=8)\
.map(sns.histplot,'tenure')\
.add_legend()
plt.title('Histogram of Churn')
Output:
Violin Plots
It is a extension of box plots in this the kernel density plot is also plotted with box plots.
sns.violinplot(x='SeniorCitizen',y='MonthlyCharges',data=data)
plt.legend
plt.title('Violin plot')
plt.show
Output:
These are some basic plots used in EDA. It is always important to read and understand what the plot is saying. It is never good to skip EDA for a machine learning project.






Top comments (0)