Exploratory Data Analysis (EDA) in python is a crucial an important step in any data analysis process. It involves exploring and understanding the structure, patterns, and relationships within the data. we will be discussing how to perform EDA in Python using various libraries and techniques.
Python has a wide range of libraries for data analysis, including NumPy, Pandas, Matplotlib, and Seaborn to mention but a few, there are so many other libraries. These libraries provide an extensive set of tools to handle, clean, and visualize data. Let's dive into how to perform EDA in Python.
Importing Data in jupyter notebook(prefferablly used)
The first step in any data analysis process is importing the data. In Python, we can import data using various libraries such as Pandas, NumPy, or CSV. Here, we will be using the Pandas library to import the data.
python
import pandas as pd
data = pd.read_csv('data.csv')
Pandas read_csv function reads the CSV file and creates a Pandas dataframe. A dataframe is a 2D labeled data structure with rows and columns, similar to a spreadsheet.
Understanding the Data in the data set
Before analyzing the data, it's essential to understand its structure and properties. The Pandas library provides various functions to explore the data, such as head, describe, and info.
python
C
Check the first five rows of data
print(data.head())
Get the summary statistics of data
print(data.describe())
Get the information about the data
print(data.info())
The head function displays the first five rows of the data, while the describe function provides the summary statistics of the data. The info function displays the column names, data types, and non-null values of each column.
Data Cleaning
Data cleaning involves identifying and correcting errors, missing values, and inconsistencies in the data.
python
Check for missing values
print(data.isnull().sum())
Drop rows with missing values
data.dropna(inplace=True)
Check for duplicate values
print(data.duplicated().sum())
Drop duplicate rows
data.drop_duplicates(inplace=True)
The isnull function identifies missing values in the data, and the sum function returns the number of missing values in each column. We can drop the rows with missing values using the dropna function. The duplicated function identifies duplicate rows in the data, and the drop_duplicates function drops the duplicate rows.
Data Visualization
Data visualization is a powerful tool to understand and communicate insights from the data. Python has several libraries for data visualization, including Matplotlib and Seaborn.
python
import matplotlib.pyplot as plt
import seaborn as sns
Plot a histogram of a numerical variable
sns.histplot(data['age'])
plt.show()
Plot a scatterplot of two numerical variables
sns.scatterplot(x='age', y='income', data=data)
plt.show()
Plot a barplot of a categorical variable
sns.barplot(x='gender', y='income', data=data)
plt.show()
The histplot function in Seaborn plots a histogram of a numerical variable, while the scatterplot function plots a scatterplot of two numerical variables. The barplot function plots a barplot of a categorical variable.
Data Analysis
After cleaning and visualizing the data, we can perform analysis to understand the relationships and patterns in the data.
python
Calculate the correlation matrix
corr_matrix = data.corr()
Plot a heatmap of the correlation matrix
sns.heatmap(corr_matrix, annot=True)
plt.show()
Calculate the mean income by gender
mean_income = data.groupby('gender')['income'].mean()
Plot a barplot of the mean income by gender
basically this is the first step in understanding the data set ,a very essential tool that comes in handy fir data science..
Top comments (0)