Exploratory Data Analysis(EDA), a critical initial step in any data analysis process that sets to summarize the data's main characteristics often using visual methods. This aims to understand the data's skeleton, it's associated patterns, any data anomalies and assumptions in it's test helping in the formation of hypotheses for further data analysis.
For a better understanding of data via analysis using Exploratory Data Analysis(EDA) a stepwise procedure is followed to attain effective insight in the following steps:
1. Data Collection and Exploration
A crucial step which involves sourcing of data for analysis either from a database, APIs, files(.csv, .json, .xcl) or even webscrapping.
Collected data is the explored using Python Pandas or R to display basic information about data using methods such as .describe(), .info(), .head()
import pandas as pd
# Load data
data = pd.read_csv('data.csv')
# Display basic information
print(data.info())
print(data.describe())
print(data.head())
2. Data Cleaning and Preprocessing
A stage which involves identifying and handling of missing values using methods to fix errors and inconsistencies in the data.
In this stage data normalization and transformation are considered data preprocessing techniques
3. Descriptive Statistics
In this stage of analysis statistically analysis such as measure of central tendencies(mean, mode, median), measure of dispersion(range, standard deviation, variance) and general statistically summary of the data.
4. Visualization of Data
It involves the use of tools and libraries to gain visual insights from a dataset. Each visualization is effective for intended purpose e.g.
- Histograms: are effective for visualizing distributions
- Scatter Plots: effectively explore relationship between continuous variable
- Heatmaps: visualizes effectively correlation matrices etc.
5. Handling Missing Data, Identifying and Handling Outlier
In this phase techniques to visually inspect and summarize statistically the data, removing rows and columns with missing values or rather filling values using measures of central tendencies or use advanced methods like KNN imputation. Detected outliers using box or scatter plots or calculate Z-scores, IQR methods. Identified outliers are handled by removal, transformation or crapping.
6. Understanding Data Distribution
The phase involves analyzing distributions if either normal distribution, skewness and kurtosis then transform using log, square root to normalize data.
7. Feature Engineering and Transformation
Feature engineering involves creating new features such as Date-Time features to extract year, month, day while transformation involves scaling or encoding.
8. Correlation Analysis
Calculations are made by computing Pearson, Spearman or Kendall as correlational coefficients and matrix correlations. In this phase multicollinearity is detected using VIF and condition index and mitigation to remove or even combine correlated features.
9. Iterative Review and Refinement
This finals stage involves documenting and reviewing the insights gained from the Exploratory Data Analysis(EDA) and refining the data preprocessing steps based on the findings. Findings are shared on with stakeholders for feedback and iterated over EDA process based on new insights and feedbacks.
In conclusion the EDA implementation cycle is an iterative and comprehensive procedure that for sees a thorough insight and understanding of the data. By symmetrically above steps, and analyst uncover valuable insights, detect potential issues in the data and preprocess the data for further analysis and modeling.
Top comments (2)
Good Read
Thank you