DEV Community

Cover image for The Ultimate Guide to Exploratory Data Analysis (EDA).
brightgitari
brightgitari

Posted on

The Ultimate Guide to Exploratory Data Analysis (EDA).

Exploratory Analysis Ultimate Guide.
Exploratory data analysis (EDA) is an essential step in the data analysis process. It is used to understand the data and find the relationship between the variables in a dataset. There is a range of activities and techniques applied by different analysts and professionals, but in this guide we are going to go through the steps and techniques used to carry out an effective EDA on your dataset.
The code blocks provided in this link are in python. They can be modified to carry out your specific task.
1. Knowing the Data.
This is very first step in EDA. It aids in understanding of your dataset. In this step you get to know the structure of the dataset and the variables in the your data altogether with its properties. Techniques used to know your data include:
Data Summary: This involves summary calculation of statistical values such as median, mean, mode, variance, and the standard deviation hence getting an overview of the data. This aids in finding or at least having a clue of the range of values, the central tendency and the spread of the data.
Data Visualization: this is the art of creating charts and graphs to represent the data hence anyone (the client) can be able to visualize the data. Visualization helps in a big way to see how your data is spread or distributed even before you get deep into the numbers. Visualizing the data helps you identify the patterns in the data, trends and anomalies. The most popular visualization techniques used include: histograms, box plots, bar charts, scatter plots, pie charts, heat maps and count plots.
Data Sampling: Sometimes one may obtain extremely large data sets. This type of data is too hard to analyze at once therefore one can take a sample of data and it is easier to find the properties of the data. This technique helps a great deal when it comes to find any biases and sampling errors in the data.

** Code Block **

import pandas as pd

# Read data from a CSV file
df = pd.read_csv('data.csv')

# Display the first 5 rows of the data
print(df.head())

# Display the last 5 rows of the data
print(df.tail())

# Display the number of rows and columns in the data
print(df.shape)

# Display the column names
print(df.columns)

# Display the data types of each column
print(df.dtypes)

# Display summary statistics for numeric columns
print(df.describe())

# Display unique values in a column
print(df['column_name'].unique())

# Display the count of each unique value in a column
print(df['column_name'].value_counts())

# Display the number of missing values in each column
print(df.isna().sum())

# Display the number of non-missing values in each column
print(df.count())

# Display the correlation matrix for numeric columns
print(df.corr())

Enter fullscreen mode Exit fullscreen mode

These techniques as explained above will aid you as an analyst or a data professional to understand your data and point out any anomalies that perhaps need to be fixed before starting the analysis.

2. Identifying the missing values and outliers in the data.
Identifying the missing values, outliers is the second step when carrying out EDA on any data set. Missing values and outliers greatly impact the quality and accuracy of the analysis hence should be checked out and fixed before analysis of the data begins.
Techniques used to handle missing values and outliers include:
Identifying missing values: One can check missing values by checking for NaN values, Null values or zeros in the variable columns. The missing values can be handled by replacing them by median or mean values or by deleting the rows with the missing values.
Outliers: Outliers are identified using visualization such as box plots, scatter plots and histograms. Statistical techniques can be used to identify outliers, these techniques are, z-score or Tuckey fences. Once outliers are identified, they can either be removed or replaced by a much more reasonable value. Handling outliers greatly depends on the magnitude that it will have on the analysis of the dataset.

Code Block

# Check for missing values
print(df.isna().sum())

# Identify outliers using box plots
import seaborn as sns
sns.boxplot(x=df['column_name'])

Enter fullscreen mode Exit fullscreen mode

Identification of outliers and missing values contributes largely on making sure your data is clean before proceeding with analysis.
3. Access the data quality.
This involves examining the data for errors, inconsistencies and other issues that can affect the quality and accuracy of the analysis. The following techniques are used to access the data quality.
Data Completeness: This involves checking if data in the required columns and variables is complete and to find out if there are any missing values. This ensures that the data necessary for analysis is available.
Data consistency: The process of Checking if the data is consistent across the different sources. For example, if you have data from different sources, consistency is checked across all the data sources.
Data Accuracy: Data accuracy can be affirmed by counterchecking it with external data sources or by use of logical reasoning to check if the data is valid and reasonable.
Data Relevance: This is used to check if the data is relevant and that it aligns with the research question or the problem statement. The process ensures that the values and variables in the dataset are appropriate for the analysis.
Code Block

# Check data completeness
print(df.isna().sum())

# Check data consistency
df2 = pd.read_csv('data2.csv')
df3 = pd.read_csv('data3.csv')
df_merged = pd.merge(df, df2, on='key')
df_merged = pd.merge(df_merged, df3, on='key')

# Check data accuracy
print(df['column_name'].describe())

# Check data relevance
print(df.info())

Enter fullscreen mode Exit fullscreen mode

By accessing the data quality one identify any potential issues or errors in the data and take appropriate steps to address them. This stage makes sure that the accuracy and reliability of the analysis is guaranteed.

4. Exploring the relationship between variables.
EDA involves checking the relationship between variables. This involves checking the correlation between variables and looking for patterns in a dataset. The techniques commonly applied include:
Correlation Analysis: Calculation of correlation between variables checks for linear correlation between variables. This helps identify variables which are strongly correlated and the ones which are weakly correlated.
Scatter Plots: Scatter plots are a great way to visualize the relationship between variables. They help identify patterns between two variables and one can easily tell and identify the trend between the variables.
Heat Maps: They help show correlation between variables. They are used to show clusters of strongly correlated variables and the ones that are weakly correlated.
Regression Analysis: Regression analysis is used to identify the relationship between one dependent variable and a one or more independent variables. This can help you identify which independent variables are most strongly related to the dependent variable.
Code Block

# Correlation analysis
print(df.corr())

# Scatter plot
import matplotlib.pyplot as plt
plt.scatter(x=df['column1'], y=df['column2'])

# Heat map
sns.heatmap(df.corr(), annot=True)

# Regression analysis
import statsmodels.formula.api as smf
model = smf.ols('dependent_variable ~ independent_variable', data=df).fit()
print(model.summary())

Enter fullscreen mode Exit fullscreen mode

By exploring relationships between variables one can easily identify variables with the utmost importance to the analysis and how they relate to each other. This stage helps identify issues or biases in the data to aid in guiding how to do the data analysis.
5. Test Hypotheses
Testing the hypotheses is an instrumental step especially when carrying out analysis for research. Testing hypotheses involves formulating an hypothesis about the data based on the analysis and testing it using statistical methods. The following are techniques on how to test hypotheses.
T-Tests: T-tests are used to perform comparison between the mean of two groups and determine if they are statistically different.
ANOVA: ANOVA (Analysis of Variance) test is carried out to compare the means of multiple groups and determine if they are statistically different.
Chi-square Test: is carried out to check if there is a significant association between two categorical variables.
Regression Analysis: As explained above regression is carried out to check if there is a relationship between a dependent variable and one or more independent variable.
Code Block

# T-Tests
from scipy.stats import ttest_ind
group1 = df[df['group'] == 1]['column_name']
group2 = df[df['group'] == 2]['column_name']
ttest_ind(group1, group2)

# ANOVA
import statsmodels.api as sm
from statsmodels.formula.api import ols
model = ols('column_name ~ C(group)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Chi-Square Test
from scipy.stats import chi2_contingency
obs = pd.crosstab(df['column1'], df['column2'])
chi2_contingency(obs)

# Regression Analysis
import statsmodels.formula.api as smf
model = smf.ols('dependent_variable ~ independent_variable', data=df).fit()
print(model.summary())

Enter fullscreen mode Exit fullscreen mode

By testing hypotheses, one can determine if there is a significant difference or relationship between the variables being analyzed. This step helps validate the results of analysis and can provide insights into the data that may not be apparent from exploratory analysis alone.

6. Document Findings:
The final steps is documentation of the findings in regard to the project carried out. Documenting involves summarizing the key insights and conclusions from the analysis and presenting them ina clear concise manner. The following are examples of techniques for documenting findings.
Summary Statistics: Provide summary statistics such as mean, median, mode, range standard deviation and correlation coefficients of each variable analyzed.
Visualization: Present visualizations such as scatter plots, histograms, and box plots to help communicate patterns and trends in the data.
Key Findings: Summarize the key findings of the analysis in a clear concise manner . this can include; insights into relationships between variables, potential biases or issues with the data and any significant differences or trends identified in the analysis.
Conclusions: Draw conclusions based on the analysis and provide recommendations for next steps or further analysis that may be needed.
Sample Code Block on Summarization

# Summary Statistics
print(df.describe())

# Visualizations
sns.pairplot(df)

# Key Findings
print('There is a strong positive correlation between column1 and column2.')

# Conclusions
print('Based on the analysis, it is recommended that further research be conducted to investigate the relationship between column1 and column2.')

Enter fullscreen mode Exit fullscreen mode

By documenting findings, you can communicate the insights and conclusions from the analysis to others and help inform decision-making based on the data. This step helps to ensure that the analysis is well understood and can be effectively used to support business decisions or research findings.

Top comments (0)