What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptions about its contents.
Exploratory data analysis is a significant step to take before diving into statistical modeling or machine learning, to ensure the data is really what it is claimed to be and that there are no obvious errors. It should be part of data science projects in every organization.
Why is it important to perform EDA?
It helps you gather insights and make better sense of the data, and removes irregularities and unnecessary values from data. Helps you prepare your dataset for analysis. Allows a machine learning model to predict our dataset better. Gives you more accurate results.
What is Data Visualization.
It is the process of presenting insights through plots, charts, and graphs to communicate findings effectively.
What are some of the visual techniques.
- Distribution plots: Also known as PDF plots are used to carry out analysis of one variable at a time. Each feature as a variable on X-axis. The values on the Y-axis in each case represent the normalized density. For instance, let's say our aim is to be able to correctly determine the survival status given the features — patient’s age. Its an example of univariate analysis.
- Box plots and Violin plots: Which is also under univariate analysis. Box plot, also known as box and whisker plot, displays a summary of data in five numbers — minimum, lower quartile(25th percentile), median(50th percentile), upper quartile(75th percentile), and maximum data values. A violin plot displays the same information as the box and whisker plot; additionally, it also shows the density-smoothed plot of the underlying distribution.
The isolated points seen in the box plot of positive axillary nodes are the outliers in the data. Such a high number of outliers is kind of expected in medical datasets. Also, the patient age and the operation year plots show similar statistics.
Violin plots in general are more informative as compared to the box plots as violin plots also represent the underlying distribution of the data in addition to the statistical summary. In the violin plot of positive axillary nodes, it is observed that the distribution is highly skewed for class label = ‘yes’, while it is moderately skewed for ‘no’.
3.Heatmap: are used to observe the correlations among the feature variables. This is particularly important when we are trying to obtain the feature importance in regression analysis. Although correlated features do not impact the performance of the statistical model, it could mess up the post-modeling analysis.
The values in the cells are Pearson’s R values which indicate the correlation among the feature variables. As we can see, these values are nearly 0 for any pair, so no correlation exists among any pair of variables.
4.Contour plot: is a graphical technique for representing a 3-dimensional surface by plotting constant z slices, called contours, in a 2-dimensional format. A contour plot enables us to visualize data in a two-dimensional plot. Here is a diagrammatic representation of how the information from the 3rd dimension can be consolidated into a flat 2-D chart
5.Scatter plots: is a diagram where each value in the data set is represented by a dot.
import matplotlib.pyplot as plt
x =[5, 7, 8, 7, 2, 17, 2, 9,
4, 11, 12, 9, 6]
y =[99, 86, 87, 88, 100, 86,
103, 87, 94, 78, 77, 85, 86]
plt.scatter(x, y, c ="blue")
#To show the plot
plt.show()
Top comments (0)