Exploratory Data Analysis refers to the critical process of performing initial investigations on data in order to discover patterns, identify outliers, test hypothesis and check assumptions with the help of summary statistics and visualisations.
Exploratory Data Analysis (EDA) is the process of understanding your data before performing complex analysis, or building models.
The Main checks done on the imported data frame are:
1. Understanding Your Data
This is the process of checking the dataset to establish the datatypes, as well as checking the shape of the data to view the size of the data. These checks allow you to better plan your analysis.
2. Handling Duplicates
This is the process of checking for duplicates within the dataset. This improves the quality of analysis allowing the deletion or modification of the duplicates.
3. Handling Missing Values
This is the process of checking the dataset to establish any omissions or incomplete data entries. The use of the 'isnull' function checks for the null values. For modification, the data is replaced with either the mean, median or whichever measure is best suited for the datatype being replaced. This process ensures that there are no repeated records in the dataset.
4. Describing the Data
This is the process of carrying out Descriptive Statistical Analysis to inform the kind of data models that can be used for analysis. The 'Describe' function is used to get an overview of the variables. Basic statistical calculations such as Mean, Mode, Variance, Min, Max and Percentiles are done.
5. Understand Distribution of Data
At this point, visualization can be done to establish distribution and relationships between the variables.
Univariate Analysis (One Variable at a Time): Visualising individual variables. Use histograms for numerical data to see the distribution and bar charts for categorical data to understand frequency.
Bivariate Analysis (Two Variables Together): Explore relationships between two variables using scatter plots for numerical variables and bar charts for categorical and numerical pairings.
Scatter Plots and other visualisations can be used to further explore data to identify patterns or trends.
6. Handling Outliers
Outliers are individual data points that can be observed outside the average value of group of statistics.
Outliers can be identified by the use of visualisations such as Box Plots.
7. Exploring Correlation between the Variables
Correlation analysis is carried out to understand how one variable affects the other. It aids in establishing the extent to which the variables are independent.
The Strength of Correlation can be measured in the following ways:
Pearson Correlation Coefficient (PCC)- a measure of linear correlation between two variables.
Correlation Matrix-displays the correlation values between all variables. A Heatmap can be used to visualize a correlation matrix.
This careful examination and cleaning of your data lays the foundation for more accurate and meaningful analysis, helping you to draw better insights from your data.
Top comments (0)