Exploratory Data analysis (EDA) Ultimate guide
Introduction
Data is everywhere around us, in spreadsheets, on various social media platforms, in survey forms, and more. The process of cleaning, transforming, interpreting, analyzing, and visualizing this data to extract useful information and gain valuable insights to make more effective business decisions is called Data Analysis.
Exploratory data analysis (EDA) is an important step in the data analysis process, where we try to understand the data and uncover patterns, trends, and relationships between variables. In this guide, we'll cover some of the key steps and techniques involved in EDA.
Data Collection and Preparation
Before conducting EDA, it's important to collect and prepare the data. This includes identifying the sources of data, gathering the data, and cleaning and transforming the data so that it's ready for analysis.
Data Collection:
Data collection involves obtaining data from various sources and storing it in a structured format that can be easily analyzed. Here is an example of how to collect data from a CSV file using Pandas library.
import pandas as pd
df = pd.read_csv('data.csv')
This code uses the 'read_csv()' function from pandas to read data from a CSV file named 'data.csv'. The resulting data is stored in pandas DataFrame named 'df'.
Data Preparation:
Data preparation involves cleaning and transforming the data to make it suitable for analysis. Here are some common data preparation steps:
a. Data Cleaning:
Data cleaning involves removing or correcting any errors or inconsistencies in the data. Here is an example of how to remove any missing values from the data using Pandas library.
df = df.dropna()
This code removes missing values.
b. Data Transformation:
Data transformation involves converting the data into a more suitable format for analysis. Here are some examples of how to transform data using Pandas library:
Convert categorical variables into numerical variables using one-hot encoding.
df = pd.get_dummies(df, columns=['category'])
Scale numerical variables using standardization.
import StandardScaler scaler = StandardScaler() df['numeric_var'] =
scaler.fit_transform(df['numeric_var'].values.reshape(-1, 1))
c. Feature Engineering:
Feature engineering involves creating new features from existing features that may be more relevant for analysis. Here is an example of how to create a new feature based on existing features using Pandas library:
df['new_feature'] = df['feature_1'] + df['feature_2']
This code creates a new feature.
Once the data has been collected and prepared, you can proceed with EDA to gain insights into the data.
Descriptive Statistics
Descriptive statistics provide a summary of the data, including measures of central tendency, dispersion, and shape. These statistics can help us understand the distribution of the data and identify any outliers or unusual values.
Here are some common descriptive statistics used in EDA:
Measures of central tendency
Mean: The average value of the data.
Median: The middle value of the data.
Mode: The most common value of the data.Measures of dispersion
Range: The difference between the maximum and minimum values of the data.
Variance: The average squared deviation from the mean.
Standard deviation: The square root of the variance.Quantiles
Quartiles: The values that divide the data into four equal parts.
Interquartile range (IQR): The difference between the upper and lower quartiles.Skewness and kurtosis
Skewness: A measure of the asymmetry of the data distribution.
Kurtosis: A measure of the "peakedness" of the data distribution.
Here is an example of how to calculate some of these descriptive statistics using Python and the Pandas library:
import pandas as pd
# Load data
df = pd.read_csv('data.csv')
# Calculate mean, median, and mode
mean = df['column_name'].mean()
median = df['column_name'].median()
mode = df['column_name'].mode()
# Calculate range, variance, and standard deviation
range = df['column_name'].max() - df['column_name'].min()
variance = df['column_name'].var()
std_dev = df['column_name'].std()
# Calculate quartiles and interquartile range
q1 = df['column_name'].quantile(0.25)
q3 = df['column_name'].quantile(0.75)
iqr = q3 - q1
# Calculate skewness and kurtosis
skewness = df['column_name'].skew()
kurtosis = df['column_name'].kurt()
In this example, column_name represents the column of interest in the dataset. The code calculates the mean, median, and mode
using the mean(), median(), and mode() methods, respectively. It calculates the range, variance, and standard deviation using the max(), min(), var(), and std() methods. It calculates the quartiles and interquartile range using the quantile() method, and it calculates the skewness and kurtosis using the skew() and kurt() methods.
Data Visualization
Data visualization is a powerful tool for EDA, as it allows us to see patterns and relationships in the data. Some common types of visualizations include scatterplots, histograms, bar charts, box plots, line charts and heat maps. Visualization can be done using libraries such as matplotlib, seaborn or plotly. Below, I will provide an overview of some commonly used visualization techniques in EDA.
•Histograms: A histogram is a graphical representation of the distribution of numerical data. It consists of bars that represent the frequency of data within a certain range of values. Histograms are useful for identifying the shape of the data distribution, as well as any outliers or gaps in the data.
•Scatter plots: A scatter plot is a graph that represents the relationship between two numerical variables. Each point on the plot represents the value of the two variables for a single observation. Scatter plots are useful for identifying patterns and trends in the data, as well as any outliers or clusters of data points.
•Box plots: A box plot is a graphical representation of the distribution of numerical data. It consists of a box that represents the middle 50% of the data (the interquartile range), with a line inside the box that represents the median value. The "whiskers" of the box plot extend to the minimum and maximum values of the data, with any outliers represented as individual points. Box plots are useful for identifying the shape of the data distribution, as well as any outliers or extreme values.
•Bar charts: A bar chart is a graph that represents categorical data using bars of different heights or lengths. Bar charts are useful for comparing the frequency or proportion of different categories.
•Heatmaps: A heatmap is a graphical representation of data in a matrix format, where the values in each cell are represented by a colour. Heatmaps are useful for identifying patterns and trends in large datasets, particularly when the data can be organized into categories or groups.
•Line charts: A line chart is a graph that represents the relationship between two numerical variables over time or some other continuous variable. Line charts are useful for identifying trends and patterns in data over time.
Here is an example of data visualization using Python and the Matplotlib library:
import pandas as pd
import matplotlib.pyplot as plt
# Load data df = pd.read_csv('data.csv')
# Histogram
plt.hist(df['column_name'], bins=10)
plt.title('Histogram of Column Name')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Scatter plot
plt.scatter(df['column_name_1'], df['column_name_2'])
plt.title('Scatter Plot of Column Name 1 vs. Column Name 2')
plt.xlabel('Column Name 1')
plt.ylabel('Column Name 2')
plt.show()
# Box plot
plt.boxplot(df['column_name'])
plt.title('Box Plot of Column Name') plt.xlabel('Column Name') plt.show()
# Bar chart
counts = df['category_column'].value_counts()
plt.bar(counts.index, counts.values)
plt.title('Bar Chart of Category Column')
plt.xlabel('Category')
plt.ylabel('Frequency')
plt.show()
# Heatmap
corr_matrix = df.corr()
plt.imshow(corr_matrix, cmap='hot', interpolation='nearest')
plt.title('Heatmap of Correlation Matrix')
plt.colorbar()
plt.show()
# Line chart
plt.plot(df['date_column'], df['value_column'])
plt.title('Line Chart of Value Column Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
In this example, column_name, column_name_1, column_name_2, and value_column represent the columns of interest in the dataset, and category_column represents a categorical column. The code creates a histogram, scatter plot, box plot, bar chart, heatmap, and line chart using the Matplotlib library.
Univariate Analysis
Univariate analysis focuses on a single variable and explores its distribution and characteristics. This can include calculating summary statistics, creating histograms or density plots, and looking for outliers or missing values.
Assuming we have a dataset called "data" and we are interested in analyzing a variable called "variable_of_interest":
•Load the necessary libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
•Load the dataset:
data = pd.read_csv("filename.csv")
•Calculate basic summary statistics:
print(data["variable_of_interest"].describe())
This will provide us with the count, mean, standard deviation, minimum, and maximum values of the variable.
•Plot the distribution of the variable:
sns.histplot(data=data, x="variable_of_interest", kde=True)
plt.show()
This will create a histogram of the variable, with a density curve overlaid on top.
•Check for outliers:
sns.boxplot(data=data, y="variable_of_interest")
plt.show()
This will create a boxplot of the variable, which can help us identify any outliers or extreme values.
•Check for skewness:
sns.kdeplot(data=data, x="variable_of_interest")
plt.show()
This will create a density plot of the variable, which can help us determine if the data is skewed.
•Check for normality:
from scipy.stats import shapiro
stat, p = shapiro(data["variable_of_interest"])
print("Shapiro-Wilk test statistic: ", stat)
print("p-value: ", p)
This will conduct a Shapiro-Wilk test of normality on the variable. If the p-value is less than 0.05, we can conclude that the data is not normally distributed.
Bivariate Analysis
Bivariate analysis examines the relationship between two variables. This can include creating scatterplots, calculating correlation coefficients, and conducting hypothesis tests to determine whether there is a significant relationship between the variables.
Multivariate Analysis
Multivariate analysis explores the relationship between three or more variables. This can include creating heat maps or correlation matrices to visualize the relationship between multiple variables.
https://ppcexpo.com/blog/exploratory-data-analysis#:~:text=Exploratory%20data%20analysis%20is%20a%20methodology%20in%20statistics,explore%20what%20data%20can%20reveal%20beyond%20hypothesis%20testing.
Hypothesis Testing
Hypothesis testing is used to determine whether a certain hypothesis or claim about the data is true or false. This can include testing for differences between groups, testing for the significance of correlation coefficients, or conducting ANOVA tests to compare means across multiple groups.
In this example, I will demonstrate how to perform hypothesis testing using Python.
I will use the scipy library to perform hypothesis testing. The scipy.stats module provides a wide range of statistical tests. In this example, I will use the t-test to compare the means of two samples.
Let's assume that there are two samples of data, sample1 and sample2, and want to test the hypothesis that their means are equal.
import numpy as np from scipy
import stats
# Generate two samples of data
sample1 = np.random.normal(loc=10, scale=2, size=100)
sample2 = np.random.normal(loc=12, scale=2, size=100)
# Compute the mean and standard deviation of each sample
mean1, std1 = np.mean(sample1), np.std(sample1)
mean2, std2 = np.mean(sample2), np.std(sample2)
# Perform a two-sided t-test assuming equal variances
t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True)
# Print the results
print(f"Sample 1: mean={mean1:.2f}, std={std1:.2f}")
print(f"Sample 2: mean={mean2:.2f}, std={std2:.2f}")
print(f"t-statistic={t_statistic:.2f}, p-value={p_value:.4f}")
if p_value < 0.05:
print("Reject null hypothesis: the means are different")
else:
print("Fail to reject null hypothesis: the means are the same")
In this example, two samples of data were generated using the normal distribution with means of 10 and 12 and standard deviations of 2. We compute the mean and standard deviation of each sample using the np.mean() and np.std() functions. Then perform a two-sided t-test assuming equal variances using the stats.ttest_ind() function. Finally, print the results and check if the p-value is less than 0.05, which is the standard significance level. If the p-value is less than 0.05, reject the null hypothesis that the means are equal; otherwise, we fail to reject the null hypothesis.
Note that we assumed equal variances when performing the t-test. If the variances are unequal, we can use the Welch's t-test by setting the equal_var parameter to False.
Machine Learning
Machine learning can be used in EDA to predict or classify data based on the relationships between variables. This can include using regression models to predict a continuous variable or classification models to predict a categorical variable.
Interpretation and Conclusion
The final step in EDA is to interpret the results and draw conclusions based on the findings. This can include summarizing the key findings, discussing any limitations of the analysis, and suggesting areas for further research.
In conclusion, EDA is a critical step in the data analysis process, as it helps us understand the data and uncover patterns and relationships between
variables. By following these steps and techniques, we can gain valuable insights from our data and make informed decisions based on the results.
Top comments (0)