DEV Community

Cover image for Understanding Your Data: The Essentials of Exploratory Data Analysis
Christopher Njoroge
Christopher Njoroge

Posted on

Understanding Your Data: The Essentials of Exploratory Data Analysis

Exploratory Data Analysis(EDA) is a vital process in Data science since it helps in understanding the data you are dealing with and also making conclusions about it. EDA serves as a bridge between the process of data collection and the processes of building machine learning models.

What is EDA?

EDA is the process of analyzing data the discover insights, trends, patterns, anomalies, test hypotheses and also making conclusions from the data.

Types of EDA

1.Univariate Non-graphical- The data has only one variable and no relationships in univariate non-graphical EDA.

2.Multivariate Non-graphical- This depicts the relationship between two or more data variables using cross-tabulation or statistics.

3.Univariate graphical- Quantitative and objective, they are not able to give the complete picture of the data; therefore, graphical methods are used more as they involve a degree of subjective analysis, also are required.

4.Multivariate graphical- It represents the relationship between two or more data sets. It uses graphics to display relationships between two or more sets of knowledge. The most popular graphic is a bar plot or a bar chart.

Process of EDA

1.Cleaning your Dataset
When your dataset is first loaded into the coding environment of your choice, the most crucial step is to clean the dataset before analysis begins as a 'dirty' dataset is compromised and will affect the accuracy of your analysis. Some of the key steps in this stage including
checking for null values; once you have identified any null values in your dataset you can replace them using the mean, median or mode of that column. In some instances where there are too many null values in one column you can drop the entire column.
checking for outliers; outliers are data points that significantly deviate from the norm of your dataset. They can impact your data visualization, distort your summary statistic and negatively affect your models.
identifying duplicate data; duplicate data is another factor that affects the integrity of your data and accuracy of your analysis. The most common practice when dealing with duplicate data is to drop the duplicate.
Then the final stage of data cleaning is to ensure that there is data uniformity in your columns. Ensure that none of your columns has two or more distinct data types within it simultaneously.

2.Visualize your Dataset
Once you have cleaned up your original dataset, you can now visualize what remains. Depending on the numbers and type of variables you can choice any means of visualization. For instance you can elect correlation matrices or scatter plots to visualize data with 2 or more variables, you can choose bar graphs or pie charts to visualize categorical data and box plots for visualizing data with one variable.

3.Perform analyses on your variables
This step will help us gain insight into the distribution of and correlation between our variables. Once again the technique of analysis varies depending on the number of variables and datatypes. Once we analyze our variables, we can then identify the relationships between them.

4.Identifying data Patterns
This step is crucial because it allows us to observe the behavior of our variables and in the long term make predictions on them, both independent and dependent. This is a major step because it is a core reason for why EDA is performed in the first place.

The final step of EDA is documentation and reporting as you will need to present your findings in an 'easy to understand' manner. After all, the whole point of data analysis is to make sense of facts and figures.
Some of the tools that are necessary for EDA are Python, R and in some cases even SQL.

Top comments (0)