DEV Community

kamandenduati
kamandenduati

Posted on

EXPLORATORY DATA ANALYSIS ULTIMATE GUIDE

In this article, I will give you the ultimate experience of EDA(Exploratory data analysis) using the iris dataset. The iris dataset is contained in the sklearn module. First things first, what is EDA? EDA is all about learning about data using summarising and visualization techniques. It's about getting to know about the data, interacting with it and wanting to know the nook and crook about it.
Let's create a simple analogy for EDA so that we can understand what it could mean in layman's terms. For example, the talking stages before an individual decide to date. The question, the interrogations to find out more about the person you want to spend a part of your life with. The same can be said about data, you do all of this to get meaningful information about the data. Hopefully, it made sense.

What shall we cover?

1.How to load the dataset

2.How to convert the data into a data frame for analysis

3.A step-by-step guide to analysing the data.

Let's go!!

Loading the dataset

First of all, we will be using Jupyter anaconda, you can install anaconda, Installation guide, software which comes with jupyter Anaconda pre-installed.
Let's go ahead and import all the tools we will be required to use:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
import seaborn as sns
Enter fullscreen mode Exit fullscreen mode

Next, we load the dataset and give it a variable name.
iris = datasets.load_iris()

fig 1
Head()

Next is transforming it into a dataframe for analysis, but first, let's get to know the target and feature variables

The 'species' column is in numpy form and when converted to a pandas data frame is in float type. As you can see in the diagram after running the head() function. Head()- outputs the first five rows. Tail() gives the last five rows. The info() gives more information on the data.

conversion of species from numpy to pandas

The species is converted to their original information as shown above.

The describe() function gives statistical information on the data given for univariate analysis.

checking for missing values

Function'.isnull.sum()' is used to check if there are any missing values in the data.
Function iris.values_counts('species') is used to check how many values are in each of the column species variables

Univariate analysis and multivariate analysis

The word 'uni' means one, therefore univariate analysis means the analysis of one variable independently. We use graphical data to conduct such analysis. The explanation will be on the diagram.
In multivariate analysis, we try to establish sensible relationships with all the variables.

Data visualization enables us to conduct such analysis. Boxplots, histogram, histogram with distplots. I will take you through boxplots and histograms. We deduce information from the said plots to give conclusions.

Histogram
What have you observed, what is the highest frequency of occurrence and the length of each variable?

boxplot
There are outliers in the sepal width, which need to be removed to improve the efficiency of the machine learning algorithm.

correlation
We find the correlation between different variables, a negative correlation is indicated by a coefficient closer to -1. A strong coefficient is indicated by a coefficient closer to 1.

Heat map
This is a visual representation of correlation.
From the heat map we can see:
Petal width and sepal length are positively correlated
Petal length and sepal length are positively correlated
Petal length and petal width are positively correlated

Bar graph
The bar graph shows each column grouped by species. This gives the ability to compare the values against the values of different species. Observations could include:
-The Sepal length and petal length of Virginica is larger than the sepal length of other species
-The sepal width of Setosa is larger than the sepal width of other species

Conclusion

Based on the length and width of the petal/sepal alone we can dare say that versicolor and virginica might resemble in size, but the setosa species is different from other species.
There are many ways you can conduct EDA, but I couldn't go through them because of time constraints and also because you also have to figure out some stuff by yourself. EDA is fun, data analysis is fun.
Until the next article, ciao

Top comments (0)