DEV Community

Cover image for Exploratory Data Analysis using Data Visualization Techniques
Allan Ouko
Allan Ouko

Posted on

Exploratory Data Analysis using Data Visualization Techniques

Exploratory Data Analysis (EDA) is an essential step in analysis as it allows for the investigation of the characteristics of a dataset before further modelling. Besides, EDA allows analysts to detect the relationships, trends, and anomalies of different variables within the dataset. Although statistical summaries give proper insights into data characteristics, it is also important to include appropriate visualizations to check the distributions of the variables. This article will discuss the most commonly used visualizations in EDA and their significance in data understanding.

However, we will look at the importance of EDA during analysis.

Importance of EDA

1. Detecting missing values: EDA allows for detecting missing data in datasets and the distribution of such data. Through this approach, one can determine the best method of handling the missing data, such as deleting or imputation.
2. Detecting outliers: EDA also allows the detection of outlier values for different variables. This method would help in determining which variables would affect model performance and the best way of handling the outliers.
3. Correlation analysis: EDA helps in correlation analysis to know how variables are related within the dataset. This method assists in determining values with high multicollinearity and removing them if they would affect model performance.
4. Variable transformation (Feature Engineering): Through EDA, analysts can determine what variables would need transformation to align with the approach in data modelling. Similarly, the approach would assist in feature engineering to know what variables could be created from existing data fields for more robust analysis.

Visualization in EDA

There are different graphs and plots available for visualizing variables during EDA. Each visualization is used according to the characteristics a data analyst would want to investigate and the data type analyzed.

Univariate Plots

Univariate plots are visualizations used to graph individual variables to check their distributions. They include:

1. Histograms
Histograms are two-dimensional plots used to display the distribution frequency of a numerical variable. This plot usually indicates how the variable is spread: positive skew, negative skew, or normal distribution.

Histogram

2. Probability Distribution Plots
The probability distribution plots show the range of possible values where the random variable would take.

Probability Distribution Plot

Bivariate Plots

Bivariate plots are used to visualize two variables to determine their relationship.

1. Bar Graphs

Bar graphs are used to compare trends between nominal and ordinal variables. For example, comparison of prices between various store outlets.

Bar Graph

2. Scatter Plots

Scatter plots allow visualization of the relationships between two numerical variables. For example, visualizing the relationship between data science professionals' years of experience and their salary would indicate how the two variables are related.

Scatter Plot

3. Correlation Plots (Heat Maps)

Heat maps also indicate the relationships between different numerical variables in a dataset. The heat maps give the magnitude and correlation coefficient of the variables; hence, it is easier to determine what variables are of influence in a dataset.

4. Box Plot (Whisker Plot)

A box plot is used to display the five-number summary of a variable. This info includes the minimum, first quartile, median, third quartile, and maximum. Furthermore, the box plot is important in displaying outliers in a dataset and is, hence, useful during data cleaning.

Box Plot

Multivariate Analysis

Multivariate analysis involves the analysis of more than two variables in a plot. Similar to bivariate analysis, a scatterplot can be used to display the distribution and relationships of these variables.

Conclusion

EDA is important in understanding data before performing further analysis and model development. Therefore, it is crucial to have a careful approach when understanding the distribution, characteristics, and relationships of different variables. Although there are different visualizations for conducting EDA, the highlighted plots are important and would help a basic understanding of the data. Besides, it is essential to understand when and how to use the visualizations for different data types.

Top comments (0)