Exploratory Data Analysis Ultimate Guide

#datascience #python #tutorial

Exploratory Data Analysis(EDA) is one of the most important initial actions during data analysis process. This step helps one to understand the data that the have and make initial decisions that will make the process more easier and enable one to reach their specific goal.

EDA is applied so that you can investigate the data for any anomalies which may affect your data analysis process. It will give the analyst the overview of the data; distribution, null values and many more.

The basic EDA process involve the following general steps;

1.Loading data
2.Viewing data
3.Cleaning data
4.Analyzing data

1.LOADING DATA
This is an important step during EDA process. In this step, you find and upload your dataset into the platform you are going to perform the EDA. In this case python will be used as an example.
#import libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as pt

          #define working directory
          os.chdir("C:/users/~~LENOVO/Desktop/bootcamp~~")

         #import dataset
         salary=pd.read_csv("ITSalarySurveyEU2020.csv")

In the above description, the libraries that will be used in the process have been defined and uploaded. As well, the location of the dataset is defined and then it has been uploaded successfully. This step will bring the dataset into the platform so that it can be accessed and EDA is performed.

2.VIEWING DATA
In this step, you try to understand the overview of your dataset. This may include:
The size of your dataset in terms of number of rows and columns.
#data info
*salary.info()

RangeIndex: 1253 entries, 0 to 1252
Data columns (total 23 columns)
You may also check the first columns and rows of the dataset.
#top five obs
*salary.head()
dataset. This will enable your to see the overview, what
the dataset contains.>
You may also check the last rows and columns of your dataset.
#bottom five obs
*salary.tail()
your dataset.>

3.CLEANING DATA
This step involves finding and cleaning any anomalies in your dataset. This may include finding duplicate data, missing data and nulls in the dataset.
Finding Duplicate data.
salary.duplicated().sum()

If you have duplicated data in your dataset, you can decide to delete the duplicated data as follows;
salary.drop.duplicate(keep='first')
4.ANALYZING DATA
Once the dataset have been cleaned, its time to understand your dataset well. This involve exploring the dataset. Exploration may include the following:
You may decide to explore a specify column in your dataset to look at the amount of data in that column, mean, standard deviation, datatype and even maximum.
Example.
salary.Age.describe()
count 1226.000000
mean 32.509788
std 5.663804
min 20.000000
25% 29.000000
50% 32.000000
75% 35.000000
max 69.000000
Name: Age, dtype: float64

          *salary.Gender.describe()*

count 1243
unique 3
top Male
freq 1049
Name: Gender, dtype: object

         *salary.City.describe()*

count 1253
unique 119
top Berlin
freq 681
Name: City, dtype: object

After exploring the data, you can also can also plot graphs about the data. This will help you see if there is any outlier in your dataset.
Example.
salary.Age.plot()

This is therefore the easiest way to undertake EDA in a dataset.

DEV Community

Exploratory Data Analysis Ultimate Guide

Top comments (0)

Read next

Scikit-Learn Hello World with Docker Init

Practical and Beginner friendly guide for speeding up your web-apps

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

Working with Dates and Times in JavaScript