DEV Community

Carol Ajando
Carol Ajando

Posted on

Understanding your Data:The Essentials of Explanatory Data Analysis

Originally developed in the 1970s by John Turkey, Explanatory Data Analysis (EDA), continues to be widely used in data science till date.
Explanatory Data Analysis (EDA), is used by data scientists to get a better look and understanding of data sets by utilizing data visualization methods.
EDA allows data scientists to analyze, investigate and identify main characteristics of data sets.
EDA is used to see what data can reveal beyond the formal modeling or hypothesis testing hence providing a better understanding of data set variables and the relationships between them.

## Why is EDA in data science important?

  1. The main purpose of EDA is to get a better understanding of data before making any assumptions.
  2. EDA facilitates the process of data cleaning where it makes it easy to identify outliers, duplicates and any errors within the data set.
  3. EDA enables data scientists to produce effective, error-free & Valid results that can be used for decision-making.
  4. EDA is used to identify patterns and features within the data set such as categorical variables, mean, mode, standard deviation and confidence intervals.
  5. After completion of EDA and insights drawn the features identified can be used for more sophisticated data analysis or modelling such as machine learning.

## Tpes of EDA
EDA can be categorized into Univariate, Bivariate and multivariate.
The categories can be identified further into graphical and non-graphical
Non-graphical methods are used mostly for statistical deductions.
Graphical methods are used to get a full picture of the data sets.

  1. Univariate: This is the most simple form of data analysis which is used where the data sets consists of one variable/column. This includes; boxplots, histogram.
  2. Bivariate: Is used to decribe data and find patterns that exist within data sets that contain two variables/columns. The most commonly used graphical presentation include; scatter plots, bar plots.
  3. Multivariate: Is used for EDA of data sets that contain multiple variables/columns. This included; scatter plots, heatmaps,Runchart, bubble charts.

Python and R programming languages are the mostly used tools that are used to perform EDA.
Common libraries used in python for EDA: Numpy, Pandas, Matplotlib, seaborn, and Ploty.
There is no single method in performing EDA, the decision varies depending on the data set that you have.

Top comments (0)