DEV Community

Cover image for EDA IN PYTHON
Simone Kamande
Simone Kamande

Posted on

EDA IN PYTHON

Exploratory Data Analysis (EDA) in python is a crucial an important step in any data analysis process. It involves exploring and understanding the structure, patterns, and relationships within the data. we will be discussing how to perform EDA in Python using various libraries and techniques.

Python has a wide range of libraries for data analysis, including NumPy, Pandas, Matplotlib, and Seaborn to mention but a few, there are so many other libraries. These libraries provide an extensive set of tools to handle, clean, and visualize data. Let's dive into how to perform EDA in Python.

Importing Data in jupyter notebook(prefferablly used)
The first step in any data analysis process is importing the data. In Python, we can import data using various libraries such as Pandas, NumPy, or CSV. Here, we will be using the Pandas library to import the data.

python

import pandas as pd
data = pd.read_csv('data.csv')
Pandas read_csv function reads the CSV file and creates a Pandas dataframe. A dataframe is a 2D labeled data structure with rows and columns, similar to a spreadsheet.

Understanding the Data in the data set
Before analyzing the data, it's essential to understand its structure and properties. The Pandas library provides various functions to explore the data, such as head, describe, and info.

python
C

Check the first five rows of data

print(data.head())

Get the summary statistics of data

print(data.describe())

Get the information about the data

print(data.info())
The head function displays the first five rows of the data, while the describe function provides the summary statistics of the data. The info function displays the column names, data types, and non-null values of each column.

Data Cleaning
Data cleaning involves identifying and correcting errors, missing values, and inconsistencies in the data.

python

Check for missing values

print(data.isnull().sum())

Drop rows with missing values

data.dropna(inplace=True)

Check for duplicate values

print(data.duplicated().sum())

Drop duplicate rows

data.drop_duplicates(inplace=True)
The isnull function identifies missing values in the data, and the sum function returns the number of missing values in each column. We can drop the rows with missing values using the dropna function. The duplicated function identifies duplicate rows in the data, and the drop_duplicates function drops the duplicate rows.

Data Visualization
Data visualization is a powerful tool to understand and communicate insights from the data. Python has several libraries for data visualization, including Matplotlib and Seaborn.

python

import matplotlib.pyplot as plt
import seaborn as sns

Plot a histogram of a numerical variable

sns.histplot(data['age'])
plt.show()

Plot a scatterplot of two numerical variables

sns.scatterplot(x='age', y='income', data=data)
plt.show()

Plot a barplot of a categorical variable

sns.barplot(x='gender', y='income', data=data)
plt.show()
The histplot function in Seaborn plots a histogram of a numerical variable, while the scatterplot function plots a scatterplot of two numerical variables. The barplot function plots a barplot of a categorical variable.

Data Analysis
After cleaning and visualizing the data, we can perform analysis to understand the relationships and patterns in the data.

python

Calculate the correlation matrix

corr_matrix = data.corr()

Plot a heatmap of the correlation matrix

sns.heatmap(corr_matrix, annot=True)
plt.show()

Calculate the mean income by gender

mean_income = data.groupby('gender')['income'].mean()

Plot a barplot of the mean income by gender

basically this is the first step in understanding the data set ,a very essential tool that comes in handy fir data science..

Top comments (0)