Akhilesh

Posted on Apr 28

Exploratory Data Analysis: How to Read a Dataset

#ai #programming #productivity #python

Loading data is not the start of understanding it.

You have done the loading. You have done the cleaning. You have merged tables and filtered rows and built charts one at a time.

But those were isolated skills. EDA is what happens when you use all of them together with a specific purpose. You are not just making charts. You are interrogating a dataset. Asking questions. Finding answers. Forming new questions from those answers.

Real data science looks like this: you load a dataset, you do not know what is in it, and forty minutes later you know its shape, its problems, its patterns, and you have three specific hypotheses to test with a model.

This post walks through that entire process on one real dataset, start to finish.

The Dataset

We will use the Housing dataset, available on Kaggle (search "House Prices Advanced Regression Kaggle"). It has 79 features and 1460 rows describing residential homes in Ames, Iowa. The target is the sale price.

If you cannot download it right now, use this simplified version to follow along:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

np.random.seed(42)
n = 500

df = pd.DataFrame({
    "SalePrice":    np.random.normal(180000, 50000, n).clip(50000, 500000),
    "GrLivArea":    np.random.normal(1500, 400, n).clip(500, 4000),
    "YearBuilt":    np.random.randint(1900, 2010, n),
    "OverallQual":  np.random.randint(1, 11, n),
    "GarageCars":   np.random.choice([0, 1, 2, 3], n, p=[0.1, 0.2, 0.5, 0.2]),
    "TotalBsmtSF":  np.random.normal(1000, 300, n).clip(0, 3000),
    "Neighborhood": np.random.choice(["A", "B", "C", "D", "E"], n),
    "HouseStyle":   np.random.choice(["1Story", "2Story", "1.5Fin"], n),
    "MasVnrArea":   np.random.exponential(100, n).clip(0, 1600),
})

df.loc[np.random.choice(n, 30, replace=False), "TotalBsmtSF"] = np.nan
df.loc[np.random.choice(n, 10, replace=False), "GrLivArea"]   = np.random.uniform(8000, 15000, 10)

Phase 1: First Contact

The very first thing. No charts yet. Just numbers and structure.

print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Shape: {df.shape}")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024:.1f} KB\n")

print("Column Types:")
print(df.dtypes.value_counts())

print("\nFirst 5 rows:")
print(df.head())

print("\nBasic Statistics:")
print(df.describe().round(2))

What you are looking for at this stage:

How many rows and columns. A dataset with 500 rows and 79 features is very wide relative to its length. That matters for modeling.

Dtypes. Are numerical columns actually numerical? Are categorical columns stored as strings or codes?

The min and max values in describe(). An age of -5 or a salary of 10 billion tells you something went wrong. Spot it here.

The difference between mean and median (50%). Large differences signal skewness or outliers.

Phase 2: Missing Values Map

missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
missing_df = pd.DataFrame({
    "missing_count": missing,
    "missing_pct":   missing_pct
}).query("missing_count > 0").sort_values("missing_pct", ascending=False)

print("Missing Values:")
print(missing_df)

fig, ax = plt.subplots(figsize=(8, 4))
ax.barh(missing_df.index, missing_df["missing_pct"], color="coral")
ax.set_xlabel("Missing %")
ax.set_title("Missing Values by Column")
ax.axvline(x=50, color="red", linestyle="--", label="50% threshold")
ax.legend()
plt.tight_layout()
plt.savefig("missing_values.png", dpi=150)
plt.show()

Columns with more than 50% missing are usually not worth imputing. They carry too little signal. Drop them or note them for investigation.

Columns with less than 5% missing are safe to impute with mean, median, or mode.

The pattern of missingness matters too. If MasVnrArea is missing, is it also always missing in the same rows as MasVnrArea type? Missing together suggests a structural relationship, not random noise.

Phase 3: Target Variable First

Before anything else, understand what you are trying to predict.

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].hist(df["SalePrice"], bins=40, color="steelblue", edgecolor="white")
axes[0].set_title("SalePrice Distribution (Raw)")
axes[0].set_xlabel("Price")

axes[1].hist(np.log1p(df["SalePrice"]), bins=40, color="coral", edgecolor="white")
axes[1].set_title("SalePrice Distribution (Log)")
axes[1].set_xlabel("Log Price")

stats_text = (
    f"Mean:   ${df['SalePrice'].mean():,.0f}\n"
    f"Median: ${df['SalePrice'].median():,.0f}\n"
    f"Std:    ${df['SalePrice'].std():,.0f}\n"
    f"Skew:   {df['SalePrice'].skew():.2f}"
)
axes[2].text(0.5, 0.5, stats_text, transform=axes[2].transAxes,
             fontsize=12, va="center", ha="center",
             bbox=dict(boxstyle="round", facecolor="lightblue", alpha=0.5))
axes[2].set_title("Target Statistics")
axes[2].axis("off")

plt.tight_layout()
plt.savefig("target_distribution.png", dpi=150)
plt.show()

Sale prices are almost always right-skewed. A few very expensive homes pull the mean above the median. Log transformation often makes them more normally distributed, which helps many ML algorithms.

The skew value confirms this. Skew above 1 or below -1 usually means you should consider transforming the target.

Phase 4: Numerical Features Deep Dive

num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
num_cols = [c for c in num_cols if c != "SalePrice"]

n_cols = 3
n_rows = (len(num_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 3.5))
axes = axes.flatten()

for i, col in enumerate(num_cols):
    axes[i].hist(df[col].dropna(), bins=30, color="steelblue", edgecolor="white", alpha=0.8)
    axes[i].set_title(col, fontsize=10)
    axes[i].set_xlabel("")

for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.suptitle("Distribution of All Numerical Features", fontsize=14, y=1.01)
plt.tight_layout()
plt.savefig("feature_distributions.png", dpi=150, bbox_inches="tight")
plt.show()

You are looking for:

Skewed features that might need transformation.

Binary features disguised as continuous (only values 0 and 1).

Features with very low variance (almost all the same value). These add noise to models.

Bimodal distributions suggesting two subpopulations that might need to be modeled separately.

Phase 5: Correlation Analysis

corr = df[num_cols + ["SalePrice"]].corr()

top_corr = corr["SalePrice"].abs().sort_values(ascending=False).drop("SalePrice")
print("Top correlations with SalePrice:")
print(top_corr.head(8).round(3))

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

top_features = top_corr.head(6).index.tolist()
sns.heatmap(
    df[top_features + ["SalePrice"]].corr(),
    annot=True, fmt=".2f", cmap="coolwarm",
    center=0, square=True, ax=axes[0]
)
axes[0].set_title("Correlation: Top Features vs Target")

top_corr.head(8).sort_values().plot(kind="barh", ax=axes[1], color="steelblue")
axes[1].set_title("Feature Correlation with SalePrice")
axes[1].set_xlabel("Absolute Correlation")
axes[1].axvline(x=0.5, color="red", linestyle="--", alpha=0.7)

plt.tight_layout()
plt.savefig("correlations.png", dpi=150)
plt.show()

Features with correlation above 0.5 with the target are strong candidates for your model.

Features highly correlated with each other (above 0.8) are redundant. Keeping both adds noise without adding information. You will need to choose one.

Phase 6: Scatter Plots for Top Features

top_4 = top_corr.head(4).index.tolist()

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for i, col in enumerate(top_4):
    axes[i].scatter(df[col], df["SalePrice"], alpha=0.4, color="steelblue", s=20)

    m, b = np.polyfit(df[col].fillna(df[col].median()), df["SalePrice"], 1)
    x_line = np.linspace(df[col].min(), df[col].max(), 100)
    axes[i].plot(x_line, m * x_line + b, color="red", linewidth=2)

    axes[i].set_xlabel(col)
    axes[i].set_ylabel("SalePrice")
    corr_val = df[[col, "SalePrice"]].corr().iloc[0, 1]
    axes[i].set_title(f"{col} vs SalePrice (r={corr_val:.2f})")
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("scatter_top_features.png", dpi=150)
plt.show()

Scatter plots reveal what correlation numbers cannot.

A linear relationship looks like a clean diagonal cloud. A curved relationship means linear regression will underfit and you need polynomial features or a tree-based model. Heteroscedasticity (cone-shaped scatter) means variance increases with the feature value, common in house prices.

Outliers stand out visually here. One point far above the trend line is not noise, it is a story. Investigate it.

Phase 7: Categorical Features

cat_cols = df.select_dtypes(include=["object"]).columns.tolist()

fig, axes = plt.subplots(1, len(cat_cols), figsize=(14, 5))

for i, col in enumerate(cat_cols):
    order = df.groupby(col)["SalePrice"].median().sort_values(ascending=False).index
    sns.boxplot(
        data=df, x=col, y="SalePrice",
        order=order, ax=axes[i], palette="Set2"
    )
    axes[i].set_title(f"SalePrice by {col}")
    axes[i].set_xlabel("")
    axes[i].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.savefig("categorical_analysis.png", dpi=150)
plt.show()

Box plots by category show you the median and spread of the target for each category value. If neighborhoods have very different median prices, neighborhood is a powerful feature. If house styles have similar medians with overlapping boxes, the style might not matter much.

Phase 8: Outlier Detection and Investigation

from scipy import stats

z_scores = np.abs(stats.zscore(df[["SalePrice", "GrLivArea"]].dropna()))
outlier_mask = (z_scores > 3).any(axis=1)

print(f"Outliers detected: {outlier_mask.sum()}")

fig, ax = plt.subplots(figsize=(10, 6))

normal = df[~df.index.isin(df[["SalePrice", "GrLivArea"]].dropna().index[outlier_mask])]
outliers = df[df.index.isin(df[["SalePrice", "GrLivArea"]].dropna().index[outlier_mask])]

ax.scatter(normal["GrLivArea"], normal["SalePrice"],
           alpha=0.5, color="steelblue", s=20, label="Normal")
ax.scatter(outliers["GrLivArea"], outliers["SalePrice"],
           color="red", s=80, marker="X", label="Outlier", zorder=5)
ax.set_xlabel("Above Grade Living Area (sq ft)")
ax.set_ylabel("Sale Price")
ax.set_title("Outlier Detection: GrLivArea vs SalePrice")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("outliers.png", dpi=150)
plt.show()

The classic housing data problem: some very large houses sell for surprisingly low prices. These are often partial sales, auction sales, or non-standard transactions. For modeling normal market behavior, you might remove them. Document the decision.

Phase 9: EDA Summary Report

After all that exploration, write down what you found.

summary = """
EDA FINDINGS SUMMARY
=====================

Dataset: Housing Prices
Shape: {} rows x {} columns

KEY FINDINGS:
1. Target (SalePrice) is right-skewed (skew={:.2f}). Log transform recommended.
2. {} columns have missing values. TotalBsmtSF has {:.1f}% missing.
3. Top predictors: OverallQual (r={:.2f}), GrLivArea (r={:.2f}), GarageCars (r={:.2f})
4. {} outliers detected in GrLivArea vs SalePrice. Investigate before modeling.
5. Neighborhoods show significant price variation. Include as categorical feature.

RECOMMENDED PREPROCESSING:
- Log transform SalePrice
- Impute TotalBsmtSF with median
- Encode Neighborhood and HouseStyle
- Remove or cap outliers in GrLivArea
- Consider polynomial features for GrLivArea (curved relationship)
""".format(
    df.shape[0], df.shape[1],
    df["SalePrice"].skew(),
    df.isnull().any(axis=1).sum(),
    df["TotalBsmtSF"].isnull().mean() * 100,
    df[["OverallQual", "SalePrice"]].corr().iloc[0,1],
    df[["GrLivArea", "SalePrice"]].corr().iloc[0,1],
    df[["GarageCars", "SalePrice"]].corr().iloc[0,1],
    outlier_mask.sum()
)

print(summary)

with open("eda_summary.txt", "w") as f:
    f.write(summary)

Always end EDA with a written summary. What you found, what it means, what you will do about it. This is the document that guides every preprocessing and modeling decision that follows.

The EDA Mindset

EDA is not a checklist. It is a conversation with your data.

Every chart raises a question. Why does this neighborhood have such a wide price range? Why are there so many houses with zero masonry veneer area? Why does the scatter for GrLivArea show two clusters?

Follow the questions. The answers tell you what your model needs to learn and what obstacles it will face.

The best data scientists are not the ones who build the fanciest models. They are the ones who understand their data so thoroughly that when the model behaves unexpectedly, they already have a hypothesis about why.

EDA is how you build that understanding.

A Blog That Shaped How People Think About EDA

Will Koehrsen wrote a piece on Towards Data Science called "A Gentle Introduction to Exploratory Data Analysis" using the exact Ames Housing dataset we referenced here. It became one of the most-read EDA tutorials in the data science community. His approach of treating EDA as hypothesis generation rather than just plotting everything influenced a generation of practitioners. Search "Will Koehrsen gentle introduction EDA Towards Data Science."

Try This

Create eda_practice.py.

Download the Ames Housing dataset from Kaggle (search "House Prices Advanced Regression Techniques Kaggle"). The full version has 79 features.

Run the complete EDA workflow from this post on the real dataset. All nine phases.

Then answer these specific questions using only code and charts:

Which five features have the strongest linear correlation with SalePrice?

Is the relationship between GrLivArea and SalePrice truly linear or does it curve at the high end?

Which neighborhoods have the highest median SalePrice? Which have the highest variance?

Are there any two features that are so correlated with each other that keeping both is redundant?

Write your own EDA summary report and save it to a text file.

What's Next

Phase 3 ends with one more post: a full data project pulling together everything from loading to EDA on a real dataset with a real question to answer. After that, Phase 4: SQL. Then Phase 5: dev tools. Then the real thing begins. Machine learning.

DEV Community