Exploratory Data Analysis

Exploratory Data Analysis is the process of investigating and visualizing data to highlight its key features, patterns and relationships.

Steps in EDA

Data Cleaning
This involves the process of identifying or removing discrepancies, inconsistencies and missing values in the dataset. The dataset format may be JSON, CSV or even TXT file. Data cleaning ensures the accuracy of the analysis. This corrects the three types of errors namely; missing values, bad values and duplicates.
Data Visualization
Data visualization applies graphs, charts and other visual tools to represent the data. This step basically helps make the patterns and relationships in the data presentable. Thus it eases communicating insights to stakeholders.
Descriptive Statistics
With descriptive statistics basic calculations of statistical measures such as mean, median, mode, standard deviation, and variance is performed. These measures can provide insights into the central tendency, spread and skewness of the data.
Data Distribution
Data distribution involves the pattern of data points across a range of values. Once the distribution of data is understood, outliers are identified, the spread of the data is understood, and trends are identified.
Correlation
Correlation refers to the degree of association between two variables. Identifying correlations between variables can provide insights into their relationships and potential causal factors.

EDA Techniques
EDA may be executed with the help of various tools and techniques. This depends on the nature of the data and the research questions. Some of the frequently applied techniques include:

Histograms
They are used to visualize the distribution of a single variable. They show the number of data points that fall into each bin or range of values.
Box Plots
Box plots are used to picture the distribution of a variable and identify outliers. They show the median, quartiles, and outliers of the data.
Scatter Plots
Scatter plots are used to evoke the relationship between two variables. They can identify correlations(positive and negative), patterns as well as outliers in the data.
Heatmaps
Heatmaps are used to visualize the correlation between multiple variables. They can identify patterns and relationships in the data.
Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique used to identify the underlying structure of complex datasets. It reduces the number of variables while retaining the most important information.

EDA is therefore pivotal in the data analysis process. By using various techniques such as data cleaning, data visualization, descriptive statistics, data distribution, and correlation analysis, data analysts can identify insights and patterns in the data.