DEV Community

Jean Wasike
Jean Wasike

Posted on

Exploratory Data Analysis using Data Visualization Techniques.

Exploratory Data Analysis is a term commonly used by data scientists and is a key concept in your journey to become a data scientist. So let’s delve into it.
meme

What is Exploratory Data Analysis(EDA)?

A simple definition would be that Exploratory Data Analysis is the process of observing and investigating data for any patterns and key characteristics. The intuition behind it is while working with data, it is important to explore and analyze it further to draw conclusions that are important to the task at hand. For example, during the data science life cycle, before the modeling stage, more insights need to be drawn from the data to inform the choice of machine learning model and other modeling decisions.

A more technical definition is as follows.
Exploratory Data Analysis refers to the critical process of performing initial investigations on data to discover patterns, to spot anomalies, to test hypotheses, and to check assumptions with the help of summary statistics and graphical representations. - Prasad Patil

EDA Using Data Visualization Techniques

Some of the EDA techniques used are graphical and non-graphical. In this article, we will focus on the graphical techniques used for data visualization. Under graphical techniques, there are univariate and multivariate methods. Univariate analysis focuses on one variable, whereas multivariate focuses on comparison between two or more variables. We will use a sample Airbnb dataset for price prediction to demonstrate the data visualization techniques for easy understanding.

Boxplot

A boxplot is a plot that describes the distribution of a single variable using five attributes: minimum, first quartile, median, third quartile, and maximum values. In our dataset, the boxplots plotted are for each city in the dataset. It can be seen that the distribution for Amsterdam, Barcelona, and Paris is more spread out as compared to the rest which is uniformly concentrated around the median. The dots on top represent the outliers found in the data.

Boxplot

Histogram

A histogram is a plot that shows the frequency distribution of a single variable along the range of the data. It works best with continuous data and can be used to show the underlying data distribution. In our case, the distribution of Airbnb price can be described as skewed to the left and may need some normalization before passing the data to a model for prediction.

Histogram

Bar plot

A bar plot is used in multivariate analysis to show the distribution of more than one variable. The length shows their frequency distribution. It is mostly used to show a summary of categorical data and uses the mean for that. In our case, the barplot shows the room types plotted against their price and it is seen that entire home types are more costly compared to the rest.

Barplot

Scatter plot

A scatter plot is a two-dimensional plot that uses scattered dots to represent values and distribution of different variables. As seen below, the scatter plot is Cleanliness Rating vs Price, and it is seen that the higher the rating, the higher the price. Since the dots are also not linear, we cannot infer a linear relationship between the cleanliness rating and the price.

Scatterplot

Conclusion

The techniques discussed above are among many other data visualization techniques used in EDA. EDA is a critical step needed to obtain insights about your data and derive more knowledge before embarking on machine learning. Thus, it should be done accurately to produce desired results for the Data Scientist.

References

References
What is Exploratory Data Analysis? (https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)
Introduction to Boxplot Visualizations (https://www2.microstrategy.com/producthelp/Current/MSTRWeb/WebHelp/Lang_1033/Content/Introduction_to_Box_Plot_Visualizations.htm#:~:text=A%20box%20plot%20visualization%20allows,standard%20deviation%20as%20dashed%20lines.)

Top comments (0)