Exploratory Data Analysis (EDA) is a critical step in the data analysis process, involving a variety of techniques to understand the structure, relationships, and patterns in a dataset.
Some of the essentials incude:
1. Identifying the Data Structure
This helps in understanding how many rows and columns the dataset has.Generally,the shape of the dataframe.
It is also important to find out the datatypes of the various columns and also use the describe function to know the mean,standard deviation,min,max and percentiles of numerical columns
2. Handling Missing Data
Identify how much data is missing in each of the columns for example null values and determine how you'd want to impute the missing data e.g by mean.If you choose not to impute, removing rows that have missing data is also an option.
3. Detecting Outliers
Plot graphs such as boxplots to see if there are outliers.You can use Statistical Methods such as z-score or IQR to quantify outliers.
4. Plot Visualisations
- Histograms display the distribution of numerical variables.
- Boxplots show the spread and identify outliers in numerical data.
- Bar Chart to visualize the distribution of categorical variables.
5. Explore Relationships between variables(BI-VARIATE ANALYSIS)
Correlation Analysis: Use correlation matrices and heatmaps to explore relationships between numerical variables.
Pair Plots: Create scatter plot matrices to visualize relationships between pairs of variables.
6. Feature Engineering
You can Add more columns based on existing ones to get additional information for analysis purposes
7. Identifying Patterns and Trends
Once the data is clean and you have all the columns you need,give explanations on the plots visualised to explain more on the patterns you have noticed
- Data Cleaning involves handling the outliers,ensuring the datatypes are in the right format,removing duplicates and detecting empty rows
Top comments (0)