Introduction
Exploratory data analysis(EDA) is an approach used to extract meaningful information from data so as to be able to identify patterns and gain a deeper understanding of it.
It mainly uses statistical graphics and other data visualization methods. Various libraries used for EDA such as pandas, matplotlib and seaborn.
In this article we are going to focus on EDA using data visualization.
Data Visualization
Data visualization is a method of presenting data in a visual format to get to know variables and the relationship between them.
Python provides several libraries for data visualization which include matplotlib, seaborn , and plotly among others.
To select and design a visualization we consider the following:
- The type of data available (Numerical or categorical).
- The number of variables and the questions you want to answer with your data.
Univariate analysis.
Univariate analysis is the analysis of one variable at a time. In univariate analysis, numerical and categorical data are plotted differently as they require different types of plots.
Categorical data.
Categorical data can assume only a limited number of values. Let us look at the types of plots that can be used to visualize categorical data.
1. Count plots.
Count plots are used to quickly visualize the frequency of values in each category in form of a bar graph. Each category is presented in a separate bar. We can say it is a visual representation of the pandas value counts function.
2. Pie charts.
They are similar to the count plot but show the percentage of each category in the data. They are preferred when visualizing variables with few number of categories.
Numerical data
This is data that can be quantified. Numerical data can be continuous or discrete.
Continuous data has an infinite number of values while discrete data has a finite number of distinct values.
The analysis of numerical data is vital because it helps in further processing of the data.
Numerical data can be presented visually by the use of:
1. Histogram
It is the distribution plot of numerical columns that creates bins with a range of values and plots. It can help visualize how values are distributed.
2. Dist plot.
Dist plots are similar to the histogram but have a slight improvement, It gives us a kernel Density Estimation.
3. Boxplot
Boxplots display the summary of data in five numbers, minimum, maximum and the 25th, 50th and 75th percentiles.
To get 5 number summary some terms we need to describe.
- Median – which is the middle value in a series after sorting
- Percentile – which gives any number which is the number of values present before this percentile.
- Minimum and Maximum – they describe the lower and upper boundary of standard deviation which is calculated using Interquartile range(IQR).
Bivariate Analysis.
Bivariate analysis is the analysis of two variables and is essential for understanding the relationship between them.
Depending on the type of variables, different visualizations can be used for bivariate analysis.
1. Numerical and numerical.
When both variables are numerical we can use scatter plots. They are a great choice for presenting the relationship between two variables.
2. Numerical and Categorical.
For bivariate analysis of numerical and categorical variables we can use box plots, scatterplots, overlapping histograms(density plots), bar plots and dist plots.
3. Categorical and categorical.
Stacked bar plots, cluster maps and heatmaps can be used in the visualization of two categorical variables..
Cluster maps and heatmaps are often used for visualizing the relationship between categorical variables.
Heat maps basically show how much the presence of one category affects the presence of another category in the dataset.
Cluster maps plot a dendrogram which show categories with similar behavior.
Multivariate analysis.
Multivariate analysis is the analysis of multiple variables at a time. Bar plots, scatter plots and box plots can be used.
1. Multivariate analysis using scatterplots.
Scatterplots, in this case, are used, using different visual cues like different colors, shapes and patterns.
Scatterplots are effective for visualizing the relationships between two numerical variables. Adding different visual cues such as colors, shapes, or patterns to data points can help you incorporate additional information. You can use different cues to represent a third or even fourth numerical variable, making it easier to understand how multiple factors relate to each other. For example, different colors might represent different categories or groups within your data, allowing you to see how they interact with two numerical variables.
2. Multivariate analysis using bar plots.
Bar plots with the hue argument in their syntax are used to present the analysis of more than two variables.
Bar plots are useful for comparing categories or groups in your data, and the "hue" argument allows you to introduce a third categorical variable. It can help you see how a third factor affects the relationship between two other variables. For instance, if you have data on sales (a numerical variable) for different products (a categorical variable) in different regions (another categorical variable), you can create a bar plot with the "hue" argument to visualize how the regions impact the sales of products within those regions. This adds depth to your analysis.
Top comments (0)