DEV Community

MEROLINE LIZLENT
MEROLINE LIZLENT

Posted on

Exploratory Data Analysis (EDA) Workflow

Introduction

Exploratory Data Analysis (EDA) is the crucial first step in any data project. It involves summarizing, visualizing, and understanding your dataset to uncover patterns, detect anomalies, and generate insights before modeling.

Why EDA Matters

  • Data is messier than ever (streaming sources, mixed types, missing values from APIs).
  • Models (including LLMs) are sensitive to data quality; bad EDA leads to garbage-in-garbage-out.
  • It's iterative and creative: ask questions, visualize, transform, repeat.

Core EDA Workflow

A solid, reusable EDA workflow typically follows these phases. It's not strictly linear you'll loop back often.

  1. Understand the Business/Problem Context

    What question are you trying to answer? (e.g., "Why is churn increasing?" or "What drives sales?")

    Know the domain, key metrics, and stakeholders. This guides what to look for and prevents aimless exploration.

  2. Load and Inspect the Data

    Import your dataset and get a quick overview.

    • Shape, data types, first/last rows, summary statistics.
    • Check for duplicates, unexpected values, memory usage.
  3. Data Cleaning & Quality Checks

    Handle issues that would break downstream analysis:

    • Missing values (percentage per column, patterns in missingness).
    • Outliers and anomalies.
    • Inconsistent formatting (dates, strings, categories).
    • Incorrect data types (e.g., numeric stored as object).
  4. Univariate Analysis (Understand individual variables)

    • Numerical: histograms, box plots, summary stats (mean, median, std, skewness, kurtosis).
    • Categorical: value counts, bar plots, frequency tables. Goal: Understand distributions and spot weirdness.
  5. Bivariate & Multivariate Analysis Relationships between variables

    • Numerical vs Numerical: scatter plots, correlation matrices/heatmaps.
    • Numerical vs Categorical: box plots, violin plots, groupby aggregations.
    • Categorical vs Categorical: stacked bars, crosstabs, chi-square if needed. Look for correlations, interactions, and potential confounders.
  6. Feature Engineering & Transformations (during EDA)

    Create new features (ratios, bins, groupings, time-based).

    Apply log/power transforms for skewed data.

    Test assumptions (normality, linearity).

  7. Advanced Checks & Hypothesis Generation

    • Segment analysis (group by key categories).
    • Time-series specific (trends, seasonality if applicable).
    • Dimensionality reduction (PCA, t-SNE) for high-dim data. Document insights, questions, and next steps.
  8. Document & Communicate Findings

    Create a clean notebook/report with key plots and takeaways.

    Use storytelling: "We noticed X correlates strongly with Y, but only in segment Z."

Tools & Libraries

  • Core Stack:

    • pandas + numpy — loading, cleaning, stats.
    • matplotlib + seaborn — beautiful, statistical visualizations.
    • plotly or altair — interactive plots.
  • Automation Boosters:

    • pandas-profiling / ydata-profiling — one-line HTML report.
    • Sweetviz, AutoViz, D-Tale — quick overviews.
    • PandasAI — ask natural language questions about your DataFrame.
  • Others: scipy/statsmodels for deeper stats, missingno for missing data visualization.

Example starter imports:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional modern additions
import plotly.express as px
# from ydata_profiling import ProfileReport
Enter fullscreen mode Exit fullscreen mode

Practical Code Examples (Titanic Dataset Style)

Assume you have a DataFrame df.

Step 2-3: Quick Inspection & Cleaning

print(df.shape)
print(df.info())
print(df.describe(include='all'))  # includes categorical

# Missing values
print(df.isnull().sum() / len(df) * 100)

# Quick profile (if using ydata-profiling)
# ProfileReport(df, title="EDA Report").to_file("report.html")
Enter fullscreen mode Exit fullscreen mode

Univariate Visualization

# Numerical
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()

# Categorical
sns.countplot(data=df, x='sex')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Bivariate

# Correlation heatmap
numeric_df = df.select_dtypes(include=np.number)
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Relationship
sns.boxplot(data=df, x='pclass', y='fare')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Handling Issues

# Impute example (mean/median/mode or advanced)
df['age'] = df['age'].fillna(df['age'].median())

# Outlier detection (simple IQR)
Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df['fare'] < (Q1 - 1.5*IQR)) | (df['fare'] > (Q3 + 1.5*IQR)))]
Enter fullscreen mode Exit fullscreen mode

Further Reading

https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/
https://www.geeksforgeeks.org/data-analysis/exploratory-data-analysis-in-python/
https://docs.profiling.ydata.ai/
https://seaborn.pydata.org/
https://python.plainenglish.io/the-complete-guide-to-exploratory-data-analysis-eda-with-python-40f84e1f9a6c

Top comments (1)

Collapse
 
danikeya profile image
Daniel Keya

nice work