π Table of Contents
- Overview
- Part 01: Basic Visualizations
- Part 02: Geographic Visualizations
- Part 03: Statistical Visualizations
- Part 04: 3D Visualizations
- Part 05: Missing Data Visualization
- Comparison Table
- Best Practices
- Quick Reference
Overview
This documentation provides a comprehensive guide to five enhanced Jupyter notebooks designed for machine learning and data science visualization. Each notebook progressively builds upon visualization concepts, from basic plots to advanced 3D visualizations and data quality assessment.
π― Purpose
These notebooks serve as:
- Educational Resource: Step-by-step tutorials for beginners
- Reference Guide: Quick lookup for visualization techniques
- Best Practices: Production-ready code examples
- Portfolio Projects: Demonstrable data science skills
π¦ Prerequisites
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.7+ | Core language |
| Pandas | 1.0+ | Data manipulation |
| NumPy | 1.18+ | Numerical operations |
| Matplotlib | 3.1+ | Static visualizations |
| Seaborn | 0.10+ | Statistical plots |
| Plotly | 4.0+ | Interactive visualizations |
π Workflow Overview
graph TD
A[Data Loading] --> B[Data Exploration]
B --> C[Data Preprocessing]
C --> D{Visualization Type}
D -->|Basic| E[Part 01: Scatter, Bar, Line]
D -->|Geographic| F[Part 02: Maps, Choropleth]
D -->|Statistical| G[Part 03: Distributions, Correlations]
D -->|Advanced| H[Part 04: 3D, Multi-dimensional]
D -->|Quality| I[Part 05: Missing Data, Binning]
E --> J[Insights & Interpretation]
F --> J
G --> J
H --> J
I --> J
Part 01: Basic Visualizations
π Overview
File: machine-learning-visualization-part-1.ipynb
File on Kaggle: Kaggle link
File on Github: Github link
Focus: Fundamental visualization techniques using Matplotlib, Seaborn, and Plotly.
Code Cells: 80 | Markdown Cells: 18
π― Learning Objectives
- Load and explore datasets
- Create basic scatter plots
- Build bar charts and categorical visualizations
- Understand marginal distributions
- Master plot customization
π Visualization Flow
flowchart LR
A[Load Dataset] --> B[Data Exploration]
B --> C[Scatter Plots]
C --> D[Bar Charts]
D --> E[Line Plots]
E --> F[Marginal Distributions]
F --> G[Customization]
G --> H[Export & Share]
π Key Features
| Feature | Description | Library | Complexity |
|---|---|---|---|
| Scatter Plots | Relationship between 2 variables | Matplotlib/Plotly | β Basic |
| Bar Charts | Categorical comparisons | Matplotlib/Seaborn | β Basic |
| Line Plots | Trends over time/sequence | Matplotlib | β Basic |
| Histograms | Distribution visualization | Seaborn | ββ Intermediate |
| Box Plots | Statistical summaries | Seaborn | ββ Intermediate |
| Marginal Plots | Combined distributions | Plotly | βββ Advanced |
π Key Sections
Section 1: Data Loading
# Standard data loading pattern
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv('dataset.csv')
# Initial exploration
df.head()
df.info()
df.describe()
Purpose: Understand data structure, types, and basic statistics.
Section 2: Scatter Plots
Techniques Covered:
- Basic scatter plot
- Colored by category
- Sized by variable
- With trend lines
- Interactive with Plotly
When to Use:
- Exploring relationships between two continuous variables
- Identifying correlations
- Detecting outliers
Section 3: Bar Charts
Variations:
- Vertical/horizontal bars
- Grouped bars
- Stacked bars
- Percentage bars
Best For:
- Comparing categories
- Showing rankings
- Displaying distributions across groups
Section 4: Marginal Distributions
Combines:
- Central scatter plot
- Marginal histograms/box plots on axes
- Statistical overlays
Value: Shows both individual variable distributions AND their relationship.
π¨ Customization Techniques
| Aspect | Options | Code Example |
|---|---|---|
| Colors | Named, hex, RGB, colormaps |
color='red', cmap='viridis'
|
| Markers | Shapes and sizes |
marker='o', s=100
|
| Labels | Titles, axes, legends |
plt.title(), plt.xlabel()
|
| Style | Themes and presets | sns.set_style('darkgrid') |
| Layout | Subplots, grids | plt.subplot(2,2,1) |
π‘ Best Practices (Part 01)
-
Always explore data first: Use
.info(),.describe(),.head() - Handle missing values: Before visualizing
- Choose appropriate plot types: Match visualization to data type
- Label everything: Axes, titles, legends
- Use color purposefully: Not just for aesthetics
- Consider accessibility: Color-blind friendly palettes
π Sample Code Pattern
import matplotlib.pyplot as plt
import seaborn as sns
# Set style
sns.set_style('whitegrid')
plt.figure(figsize=(10, 6))
# Create visualization
sns.scatterplot(data=df, x='feature1', y='feature2',
hue='category', size='value', alpha=0.6)
# Customize
plt.title('Feature Relationship Analysis', fontsize=16)
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.legend(title='Category', bbox_to_anchor=(1.05, 1))
# Display
plt.tight_layout()
plt.show()
Part 02: Geographic Visualizations
π Overview
File: machine-learning-visualization-part-2.ipynb
File on Kaggle: Kaggle link
File on Github: Github link
Focus: Interactive geographic visualizations using Plotly and mapping techniques.
Code Cells: 12 | Markdown Cells: 18
π― Learning Objectives
- Create choropleth maps
- Master Plotly Express for interactive plots
- Perform geocoding (location β coordinates)
- Build animated time-series maps
- Visualize spatial distributions
πΊοΈ Visualization Workflow
flowchart TD
A[Geographic Data] --> B{Has Coordinates?}
B -->|Yes| C[Direct Mapping]
B -->|No| D[Geocoding]
D --> E[Get Lat/Long]
E --> C
C --> F{Map Type}
F -->|Regions| G[Choropleth Map]
F -->|Points| H[Scatter Geo]
F -->|Density| I[Density Mapbox]
G --> J[Add Interactivity]
H --> J
I --> J
J --> K[Time Animation]
K --> L[Final Map]
π Key Features
| Visualization | Purpose | Interactivity | Best Use Case |
|---|---|---|---|
| Choropleth Map | Color-coded regions | Hover, zoom, pan | Country/state comparisons |
| Scatter Geo | Points on map | Click, hover | City locations, events |
| Density Map | Heat mapping | Zoom, filter | Population density, hotspots |
| Animated Map | Time-series | Play/pause, slider | Data evolution over time |
| Line Map | Routes/connections | Hover paths | Migration, trade routes |
π Key Sections
Section 1: Environment Setup
Libraries:
-
plotly.express: High-level interactive plots -
plotly.graph_objects: Low-level customization -
geocoder: Location to coordinates conversion
Section 2: Data Exploration
Dataset: Heart Disease Dataset (with geographic augmentation)
Key Operations:
# Load data
df = pd.read_csv('heart.csv')
# Check structure
print(df.shape)
df.head()
Section 3: Gapminder Dataset
What is Gapminder?
- Historical statistics (GDP, life expectancy, population)
- Multiple countries and years
- Perfect for animated visualizations
Loading:
gapminder = px.data.gapminder()
Section 4: Geocoding
Purpose: Convert location names to latitude/longitude
Example:
import geocoder
# Get coordinates for a location
g = geocoder.osm('New York City')
lat, lng = g.latlng
Use Cases:
- Customer locations
- Store addresses
- Event venues
Section 5: Choropleth Maps
Code Pattern:
import plotly.express as px
fig = px.choropleth(
df,
locations='country_code', # ISO country codes
color='value', # Color by this column
hover_name='country', # Show on hover
color_continuous_scale='Viridis',
title='World Data Visualization'
)
fig.show()
Key Parameters:
| Parameter | Description | Example Values |
|-----------|-------------|----------------|
| locations | Geographic identifiers | ISO codes, state names |
| locationmode | Type of location | 'ISO-3', 'USA-states' |
| color | Data for coloring | Any numeric column |
| scope | Map region | 'world', 'usa', 'europe' |
| projection | Map projection | 'natural earth', 'orthographic' |
Section 6: Animated Visualizations
Creating Time-Series Animations:
fig = px.choropleth(
gapminder,
locations='iso_alpha',
color='lifeExp',
hover_name='country',
animation_frame='year', # Animate by year
animation_group='country',
color_continuous_scale='Plasma',
title='Life Expectancy Over Time'
)
fig.show()
Controls:
- Play/Pause button
- Year slider
- Speed adjustment
π¨ Customization Options
fig.update_layout(
geo=dict(
showframe=False,
showcoastlines=True,
projection_type='natural earth'
),
title=dict(
text='Custom Title',
x=0.5,
font=dict(size=20, color='darkblue')
)
)
π‘ Best Practices (Part 02)
- Use appropriate projections: Natural Earth for world, Albers for USA
- Choose color scales wisely: Sequential for continuous, categorical for discrete
- Include hover information: Make maps informative
- Test geocoding results: Verify coordinates before plotting
- Optimize for performance: Limit data points for smooth interaction
- Consider map context: Show coastlines, borders as needed
π Advanced Techniques
Multi-Layer Maps:
# Combine choropleth with scatter points
fig = go.Figure()
# Add choropleth layer
fig.add_trace(go.Choropleth(...))
# Add scatter points
fig.add_trace(go.Scattergeo(...))
fig.show()
Part 03: Statistical Visualizations
π Overview
File: machine-learning-visualization-part-3.ipynb
File on Kaggle: Kaggle link
File on Github: Github link
Focus: Statistical analysis through visualization using Seaborn.
Code Cells: 5 | Markdown Cells: 6
π― Learning Objectives
- Create and interpret joint plots
- Visualize distributions effectively
- Compare distributions across categories
- Build correlation heatmaps
- Use pair plots for multi-variable analysis
π Statistical Visualization Pipeline
flowchart LR
A[Loaded Data] --> B[Univariate Analysis]
B --> C[Distribution Plots]
A --> D[Bivariate Analysis]
D --> E[Joint Plots]
D --> F[Regression Plots]
A --> G[Multivariate Analysis]
G --> H[Pair Plots]
G --> I[Heatmaps]
C --> J[Insights]
E --> J
F --> J
H --> J
I --> J
π Key Features
| Plot Type | Purpose | Shows | Seaborn Function |
|---|---|---|---|
| Joint Plot | Bivariate + distributions | 2 variables + margins | sns.jointplot() |
| Distribution Plot | Data spread | Histogram + KDE | sns.displot() |
| Box Plot | Statistical summary | Quartiles, outliers | sns.boxplot() |
| Violin Plot | Distribution shape | Density + quartiles | sns.violinplot() |
| Heatmap | Matrix visualization | Correlations, patterns | sns.heatmap() |
| Pair Plot | Multiple relationships | All variable pairs | sns.pairplot() |
π Key Sections
Section 1: Environment Setup
Core Libraries:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy import stats
Configuration:
# Set style for better aesthetics
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)
Section 2: Data Loading
Dataset: Heart Disease Dataset
Initial Exploration:
- Shape and structure
- Data types
- Missing values
- Basic statistics
Section 3: Joint Plots
What is a Joint Plot?
A joint plot combines:
- Central plot: Scatter, hexbin, or KDE of two variables
- Marginal plots: Distribution of each variable on the axes
Types:
| Kind | Central Plot | Use Case |
|---|---|---|
scatter |
Scatter plot | Individual data points |
reg |
Scatter + regression | Linear relationships |
hex |
Hexbin density | Large datasets |
kde |
2D density | Smooth distributions |
hist |
2D histogram | Binned counts |
Code Example:
# Basic joint plot
sns.jointplot(data=df, x='age', y='chol', kind='scatter')
# With regression
sns.jointplot(data=df, x='age', y='chol', kind='reg',
color='steelblue', height=8)
# KDE joint plot
sns.jointplot(data=df, x='age', y='chol', kind='kde',
fill=True, cmap='Blues')
Section 4: Regression Analysis
Understanding Regression Lines:
- Shows linear trend
- Confidence interval (shaded area)
- Pearson correlation coefficient
Interpretation:
# Calculate correlation
from scipy.stats import pearsonr
corr, p_value = pearsonr(df['age'], df['chol'])
print(f'Correlation: {corr:.3f}, P-value: {p_value:.4f}')
Statistical Significance:
- p < 0.05: Significant relationship
- p β₯ 0.05: No significant relationship
Section 5: Distribution Plots
Visualizing Single Variables:
# Histogram with KDE
sns.histplot(data=df, x='age', kde=True, bins=30)
# Distribution plot with hue
sns.displot(data=df, x='age', hue='target',
kind='kde', fill=True, alpha=0.5)
Options:
-
kde=True: Add kernel density estimate -
hue: Separate by category -
bins: Number of histogram bins -
fill: Fill KDE area
Section 6: Box & Violin Plots
Box Plot Structure:
Max (or Q3 + 1.5*IQR)
ββββββ
β
Q3 βββ€
β β IQR (Interquartile Range)
Q2 βββ€ β Median
β
Q1 βββ€
β
ββββββ
Min (or Q1 - 1.5*IQR)
β’ β Outliers
Code Examples:
# Box plot
sns.boxplot(data=df, x='target', y='age')
# Violin plot (shows distribution shape)
sns.violinplot(data=df, x='target', y='age',
split=True, inner='quartile')
# Grouped comparison
sns.boxplot(data=df, x='cp', y='chol', hue='target')
Section 7: Correlation Heatmaps
Purpose: Visualize relationships between all numeric variables
Code Pattern:
# Calculate correlation matrix
corr_matrix = df.corr()
# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix,
annot=True, # Show values
fmt='.2f', # Format to 2 decimals
cmap='coolwarm', # Color scheme
center=0, # Center colormap at 0
square=True, # Square cells
linewidths=1, # Grid lines
cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()
Interpreting Correlations:
| Value Range | Interpretation |
|-------------|----------------|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.5 to 0.7 | Moderate positive |
| 0.3 to 0.5 | Weak positive |
| -0.3 to 0.3 | Negligible |
| -0.5 to -0.3 | Weak negative |
| -0.7 to -0.5 | Moderate negative |
| -0.9 to -0.7 | Strong negative |
| -1.0 to -0.9 | Very strong negative |
Section 8: Pair Plots
Multi-Variable Exploration:
# Basic pair plot
sns.pairplot(df)
# With categorical coloring
sns.pairplot(df, hue='target', palette='Set2',
diag_kind='kde', # Diagonal plots
plot_kws={'alpha': 0.6})
What It Shows:
- Diagonal: Distribution of each variable
- Off-diagonal: Scatter plots between variable pairs
- Colored by category (with
hue)
When to Use:
- Initial data exploration
- Feature selection
- Identifying patterns across multiple variables
π‘ Best Practices (Part 03)
- Check assumptions: Linearity, normality for appropriate tests
- Handle outliers: Identify and treat before statistical analysis
- Choose appropriate plots: Match plot to data distribution
- Report statistics: Include correlation coefficients, p-values
- Use appropriate color scales: Diverging for correlations
- Consider sample size: Some plots need sufficient data points
π Statistical Interpretation Guide
P-Value Interpretation:
- p < 0.001: Very significant
- p < 0.01: Significant
- p < 0.05: Significant
- p β₯ 0.05: Not significant
Effect Size:
- Small: |r| < 0.3
- Medium: 0.3 β€ |r| < 0.5
- Large: |r| β₯ 0.5
Part 04: 3D Visualizations
π Overview
File: machine-learning-visualization-part-4.ipynb
File on Kaggle: Kaggle link
File on Github: Github link
Focus: Three-dimensional and advanced multi-dimensional visualizations.
Code Cells: 6 | Markdown Cells: 8
π― Learning Objectives
- Create 3D scatter and surface plots
- Build interactive 3D visualizations with Plotly
- Visualize multi-dimensional data
- Use dimensionality reduction (PCA) for visualization
- Create bubble charts (4D visualization)
π 3D Visualization Pipeline
flowchart TD
A[Multi-Dimensional Data] --> B{Dimensions}
B -->|3D| C[Direct 3D Plot]
B -->|>3D| D[Dimensionality Reduction]
D --> E[PCA/t-SNE]
E --> C
C --> F{Plot Type}
F --> G[3D Scatter]
F --> H[3D Surface]
F --> I[3D Line]
G --> J[Add Interactivity]
H --> J
I --> J
J --> K{Library}
K -->|Matplotlib| L[Static 3D]
K -->|Plotly| M[Interactive 3D]
L --> N[Final Visualization]
M --> N
π Key Features
| Visualization | Dimensions | Best For | Library |
|---|---|---|---|
| 3D Scatter | 3-4 (with color/size) | Point distributions | Matplotlib/Plotly |
| 3D Surface | Z = f(X, Y) | Continuous functions | Matplotlib/Plotly |
| 3D Line | Time-series in 3D | Trajectories, paths | Matplotlib |
| Bubble Chart | 4 (x, y, z, size) | Multi-dimensional relationships | Plotly |
| PCA 3D | N β 3 | High-dimensional data | Plotly + sklearn |
π Key Sections
Section 1: Environment Setup
Core Libraries:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import plotly.express as px
import plotly.graph_objects as go
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
3D Matplotlib Setup:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
Section 2: Data Loading
Dataset: Brain Stroke Dataset
Features for 3D Visualization:
- Age
- BMI (Body Mass Index)
- Average Glucose Level
- (Color/size for additional dimensions)
Section 3: Basic 3D Scatter Plot
Matplotlib Example:
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
# Create scatter plot
scatter = ax.scatter(df['age'],
df['bmi'],
df['avg_glucose_level'],
c=df['stroke'], # Color by target
cmap='viridis',
s=50, # Point size
alpha=0.6,
edgecolors='k')
# Labels
ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('BMI', fontsize=12)
ax.set_zlabel('Glucose Level', fontsize=12)
ax.set_title('3D Patient Data Visualization', fontsize=14)
# Add colorbar
plt.colorbar(scatter, label='Stroke')
plt.show()
Key Parameters:
| Parameter | Description | Example |
|-----------|-------------|---------|
| projection='3d' | Enable 3D axes | Required for 3D |
| c | Color values | Numeric or categorical |
| cmap | Color map | 'viridis', 'plasma' |
| s | Point size | 50, or array |
| alpha | Transparency | 0.0 to 1.0 |
Section 4: Interactive 3D with Plotly
Why Plotly?
- Interactive rotation
- Zoom and pan
- Hover information
- Better for presentations
Basic Plotly 3D:
fig = px.scatter_3d(df,
x='age',
y='bmi',
z='avg_glucose_level',
color='stroke',
symbol='gender',
size='age',
hover_data=['work_type', 'smoking_status'],
title='Interactive 3D Patient Analysis',
labels={'age': 'Age (years)',
'bmi': 'Body Mass Index',
'avg_glucose_level': 'Glucose Level'})
fig.update_traces(marker=dict(line=dict(width=0.5, color='DarkSlateGrey')))
fig.show()
Plotly Advantages:
- β Interactive controls
- β Hover tooltips
- β Export to HTML
- β Better for web dashboards
Section 5: 3D Surface Plots
Creating Meshgrid Data:
# Generate grid
x = np.linspace(-5, 5, 50)
y = np.linspace(-5, 5, 50)
X, Y = np.meshgrid(x, y)
# Define function
Z = np.sin(np.sqrt(X**2 + Y**2))
Matplotlib Surface:
fig = plt.figure(figsize=(12, 9))
ax = fig.add_subplot(111, projection='3d')
surf = ax.plot_surface(X, Y, Z,
cmap='coolwarm',
edgecolor='none',
alpha=0.8)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
plt.colorbar(surf)
plt.show()
Plotly Surface:
fig = go.Figure(data=[go.Surface(x=X, y=Y, z=Z,
colorscale='Viridis')])
fig.update_layout(title='3D Surface Plot',
scene=dict(
xaxis_title='X Axis',
yaxis_title='Y Axis',
zaxis_title='Z Axis'),
width=900,
height=700)
fig.show()
Section 6: Dimensionality Reduction
Why PCA for Visualization?
- Reduce high-dimensional data to 3D
- Preserve maximum variance
- Visualize complex datasets
PCA Workflow:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Select numeric features
features = df.select_dtypes(include=[np.number]).columns
X = df[features]
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scaled)
# Create DataFrame
pca_df = pd.DataFrame(data=X_pca,
columns=['PC1', 'PC2', 'PC3'])
pca_df['target'] = df['stroke'].values
# Visualize
fig = px.scatter_3d(pca_df,
x='PC1', y='PC2', z='PC3',
color='target',
title='PCA 3D Visualization')
fig.show()
Explained Variance:
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Total variance explained:",
sum(pca.explained_variance_ratio_))
Section 7: Bubble Charts (4D Visualization)
Adding Fourth Dimension with Size:
fig = px.scatter_3d(df,
x='age',
y='bmi',
z='avg_glucose_level',
color='stroke', # 4th dimension
size='heart_disease', # 5th dimension!
hover_name='id',
title='5D Visualization (x, y, z, color, size)')
fig.show()
Dimension Mapping:
| Dimension | Visual Encoding | Best For |
|-----------|-----------------|----------|
| X-axis | Horizontal position | Primary variable |
| Y-axis | Vertical position | Secondary variable |
| Z-axis | Depth | Tertiary variable |
| Color | Hue | Categorical or continuous |
| Size | Point radius | Magnitude or importance |
| Shape | Marker type | Categories (limited) |
π‘ Best Practices (Part 04)
- Limit data points: Too many points obscure patterns
- Use appropriate projections: Orthographic for technical, perspective for natural
- Add interactivity: Rotation enhances understanding
- Choose colors carefully: 3D depth perception affected by color
- Provide multiple views: Show from different angles
- Consider accessibility: Some users struggle with 3D perception
- Standardize data: Before PCA or other dimensionality reduction
- Explain axes: Especially for PCA (variance explained)
π¨ Customization Techniques
Camera Position (Plotly):
fig.update_layout(
scene_camera=dict(
eye=dict(x=1.5, y=1.5, z=1.5),
center=dict(x=0, y=0, z=0),
up=dict(x=0, y=0, z=1)
)
)
Viewing Angle (Matplotlib):
ax.view_init(elev=30, azim=45) # Elevation and azimuth
π When to Use 3D Visualizations
Good Use Cases:
- β Truly 3-dimensional data (spatial, physical)
- β Demonstrations and presentations (interactive)
- β Exploratory analysis of multi-dimensional data
- β Showing trajectories or time-series paths
When to Avoid:
- β 2D alternatives are clearer
- β Printed/static reports (hard to interpret)
- β Precise value reading required
- β Large datasets (performance issues)
Part 05: Missing Data Visualization
π Overview
File: machine-learning-visualization-part-5.ipynb
File on Kaggle: Kaggle link
File on Github: Github link
Focus: Visualizing and handling missing data, binning, and data preprocessing.
Code Cells: 7 | Markdown Cells: 8
π― Learning Objectives
- Visualize missing data patterns
- Assess data quality
- Perform binning and discretization
- Handle missing values appropriately
- Create preprocessed datasets for modeling
π Missing Data Analysis Pipeline
flowchart TD
A[Raw Dataset] --> B[Load Data]
B --> C[Check Missing Values]
C --> D{Missing Data?}
D -->|Yes| E[Visualize Patterns]
D -->|No| K[Proceed to Analysis]
E --> F[Missing Matrix]
E --> G[Bar Chart]
E --> H[Heatmap]
E --> I[Dendrogram]
F --> J{Action Required?}
G --> J
H --> J
I --> J
J -->|Drop| L[Remove Rows/Columns]
J -->|Impute| M[Fill Values]
J -->|Keep| K
L --> K
M --> K
K --> N[Binning/Discretization]
N --> O[Final Clean Dataset]
π Key Features
| Visualization | Purpose | Library | Insights Provided |
|---|---|---|---|
| Missing Matrix | Overview of missingness | missingno | Patterns, extent |
| Bar Chart | Missing counts per column | missingno | Which features affected |
| Heatmap | Correlation of missingness | missingno | Related missing patterns |
| Dendrogram | Hierarchical clustering | missingno | Groups of missingness |
π Key Sections
Section 1: Environment Setup
Core Libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno # Specialized for missing data
Installing missingno:
pip install missingno
Section 2: Data Loading
Dataset: Heart Disease Dataset (with induced missing values for demonstration)
Initial Check:
# Load data
df = pd.read_csv('heart.csv')
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
# Percentage missing
print("\nPercentage missing:")
print((df.isnull().sum() / len(df)) * 100)
Section 3: Why Visualize Missing Data?
Importance:
- Pattern Detection: Random vs. systematic missingness
- Impact Assessment: How much data is affected
- Relationship Analysis: Which variables have correlated missingness
- Decision Making: Drop, impute, or keep as-is
Types of Missingness:
| Type | Description | Example | Handling |
|------|-------------|---------|----------|
| MCAR | Missing Completely At Random | Random survey non-response | Safe to drop |
| MAR | Missing At Random | Income missing for unemployed | Impute conditionally |
| MNAR | Missing Not At Random | High earners hide income | Complex imputation |
Section 4: Missing Data Visualizations
Matrix Visualization:
# Missing data matrix
msno.matrix(df, figsize=(12, 6), fontsize=12)
plt.title('Missing Data Matrix')
plt.show()
Interpretation:
- White lines = missing values
- Black/colored = present values
- Patterns indicate systematic missingness
Bar Chart:
# Missing data bar chart
msno.bar(df, figsize=(12, 6), fontsize=12, color='steelblue')
plt.title('Missing Data Count by Feature')
plt.show()
Shows:
- Absolute count of missing values
- Completeness bar (on right axis)
Heatmap:
# Missing data correlation heatmap
msno.heatmap(df, figsize=(12, 10), fontsize=12)
plt.title('Missing Data Correlation')
plt.show()
Interpretation:
- Values close to 1: Missingness strongly correlated
- Values close to 0: Independent missingness
- Negative values: Inverse relationship
Dendrogram:
# Hierarchical clustering of missingness
msno.dendrogram(df, figsize=(12, 6), fontsize=12)
plt.title('Missing Data Dendrogram')
plt.show()
Use: Identifies groups of features with similar missing patterns
Section 5: Handling Missing Data
Strategies:
| Method | When to Use | Pros | Cons |
|---|---|---|---|
| Drop Rows | MCAR, <5% missing | Simple, no bias | Data loss |
| Drop Columns | >50% missing, not important | Clean dataset | Feature loss |
| Mean/Median Imputation | MCAR, numeric data | Simple, fast | Reduces variance |
| Mode Imputation | Categorical data | Preserves distribution | May increase mode frequency |
| Forward/Backward Fill | Time series | Maintains trends | Propagates errors |
| Interpolation | Ordered data | Smooth estimates | Assumes continuity |
| Model-Based | MAR, complex patterns | Sophisticated | Computationally expensive |
Code Examples:
# Drop rows with any missing values
df_dropped = df.dropna()
# Drop columns with >50% missing
threshold = len(df) * 0.5
df_dropped_cols = df.dropna(axis=1, thresh=threshold)
# Mean imputation
df['age'].fillna(df['age'].mean(), inplace=True)
# Median imputation (more robust to outliers)
df['chol'].fillna(df['chol'].median(), inplace=True)
# Mode imputation for categorical
df['cp'].fillna(df['cp'].mode()[0], inplace=True)
# Forward fill (time series)
df.fillna(method='ffill', inplace=True)
# Interpolation
df['chol'].interpolate(method='linear', inplace=True)
Advanced: Multiple Imputation:
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df),
columns=df.columns)
Section 6: Binning and Discretization
Purpose: Convert continuous variables to categorical bins
Why Bin Data?
- Simplify models: Reduce continuous complexity
- Handle outliers: Group extreme values
- Create categories: For business rules (e.g., age groups)
- Improve interpretability: Easier to understand
Equal-Width Binning:
# Create bins of equal width
df['age_bin'] = pd.cut(df['age'],
bins=5, # Number of bins
labels=['Very Young', 'Young', 'Middle',
'Senior', 'Elderly'])
# Custom bin edges
df['chol_bin'] = pd.cut(df['chol'],
bins=[0, 200, 240, 300],
labels=['Low', 'Normal', 'High'])
Equal-Frequency Binning (Quantiles):
# Each bin has approximately same number of observations
df['age_qbin'] = pd.qcut(df['age'],
q=4, # Quartiles
labels=['Q1', 'Q2', 'Q3', 'Q4'])
Visualizing Bins:
# Distribution of binned data
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
df['age_bin'].value_counts().plot(kind='bar', color='skyblue')
plt.title('Age Distribution (Equal-Width Bins)')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.subplot(1, 2, 2)
df['age_qbin'].value_counts().plot(kind='bar', color='lightcoral')
plt.title('Age Distribution (Quantile Bins)')
plt.xlabel('Quartile')
plt.ylabel('Count')
plt.tight_layout()
plt.show()
Section 7: Advanced Joint Plots with Hue
Multi-Dimensional Visualization:
# Joint plot with categorical hue
sns.jointplot(data=df,
x='age',
y='chol',
hue='target',
kind='kde',
fill=True,
alpha=0.5,
height=10)
plt.suptitle('Age vs. Cholesterol by Heart Disease Status',
y=1.02, fontsize=14)
plt.show()
Benefits:
- Shows distributions for each category
- Identifies class separation
- Useful for feature selection
π‘ Best Practices (Part 05)
- Always visualize first: Before handling missing data
- Document decisions: Record why you dropped/imputed
- Check assumptions: Ensure MCAR before simple imputation
- Test sensitivity: See how imputation affects models
- Preserve original data: Keep a copy before modifications
- Consider domain knowledge: Subject matter experts guide imputation
- Bin carefully: Too few bins lose information, too many overfit
- Choose appropriate bin strategy: Equal-width vs. quantile based on use case
π Data Quality Checklist
- [ ] Missing values identified and quantified
- [ ] Missingness patterns analyzed
- [ ] Appropriate handling strategy selected
- [ ] Imputation assumptions validated
- [ ] Outliers identified and addressed
- [ ] Binning applied where beneficial
- [ ] Data types correct
- [ ] Ranges validated (no impossible values)
- [ ] Duplicates checked
- [ ] Final dataset documented
Comparison Table
Feature Comparison Across All Notebooks
| Feature | Part 01 | Part 02 | Part 03 | Part 04 | Part 05 |
|---|---|---|---|---|---|
| Primary Focus | Basic plots | Geographic | Statistical | 3D/Multi-D | Data quality |
| Code Cells | 80 | 12 | 5 | 6 | 7 |
| Markdown Cells | 18 | 18 | 6 | 8 | 8 |
| Difficulty | β Beginner | ββ Intermediate | ββ Intermediate | βββ Advanced | ββ Intermediate |
| Interactivity | Medium | High | Low | High | Medium |
| Main Library | Matplotlib | Plotly | Seaborn | Plotly/Matplotlib | missingno |
| Dataset Used | Various | Gapminder + Heart | Heart Disease | Brain Stroke | Heart Disease |
| Key Technique | Scatter/Bar | Choropleth maps | Joint plots | 3D scatter | Missing data viz |
| Animation | β | β | β | β | β |
| 3D Support | β | β | β | β | β |
| Statistical Tests | β | β | β | β | β |
| Best For | Learning basics | Location data | Correlations | Complex data | Preprocessing |
Library Usage Matrix
| Library | Part 01 | Part 02 | Part 03 | Part 04 | Part 05 |
|---|---|---|---|---|---|
| Pandas | β | β | β | β | β |
| NumPy | β | β | β | β | β |
| Matplotlib | β | β | β | β | β |
| Seaborn | β | β | β | β | β |
| Plotly Express | β | β | β | β | β |
| Plotly Graph Objects | β | β | β | β | β |
| Geocoder | β | β | β | β | β |
| SciPy | β | β | β | β | β |
| Scikit-learn | β | β | β | β | β |
| missingno | β | β | β | β | β |
Best Practices
General Visualization Principles
-
Know Your Audience
- Technical vs. non-technical
- Adjust complexity accordingly
- Provide context and interpretation
Choose the Right Chart Type
Comparison β Bar charts
Distribution β Histograms, box plots
Relationship β Scatter plots
Composition β Pie charts, stacked bars
Trends β Line charts
Geographic β Choropleth, point maps
-
Design for Clarity
- Clear titles and labels
- Appropriate color schemes
- Sufficient white space
- Readable font sizes
- Legends when needed
-
Color Usage
- Sequential: One variable, ordered (e.g., low to high)
- Diverging: Data with a meaningful center (e.g., correlations)
- Categorical: Distinct categories
- Accessibility: Color-blind friendly palettes
-
Storytelling with Data
- Guide the viewer's attention
- Highlight key insights
- Provide context
- Explain unexpected patterns
Code Quality
- Reproducibility
# Set random seed
np.random.seed(42)
# Document versions
# Python 3.8.10
# pandas 1.3.0
# matplotlib 3.4.2
- Modularity
def create_scatter_plot(df, x, y, hue=None, title=''):
"""
Create standardized scatter plot.
Parameters:
-----------
df : DataFrame
x, y : str, column names
hue : str, optional categorical column
title : str
"""
fig, ax = plt.subplots(figsize=(10, 6))
sns.scatterplot(data=df, x=x, y=y, hue=hue, ax=ax)
ax.set_title(title, fontsize=14)
plt.tight_layout()
return fig, ax
- Error Handling
try:
df = pd.read_csv('data.csv')
except FileNotFoundError:
print("Error: File not found")
except pd.errors.EmptyDataError:
print("Error: File is empty")
-
Documentation
- Comment complex operations
- Use docstrings for functions
- Explain non-obvious choices
- Include sources for data/methods
Performance Optimization
-
Large Datasets
- Sample for initial exploration
- Use appropriate data types
- Consider aggregation
- Use hexbin for dense scatter plots
-
Interactive Plots
- Limit data points for Plotly (< 10k recommended)
- Use webgl renderer for large datasets
- Disable unused features
Memory Management
# Delete unnecessary DataFrames
del df_temp
# Use categorical dtype
df['category'] = df['category'].astype('category')
# Load only needed columns
df = pd.read_csv('data.csv', usecols=['col1', 'col2'])
Quick Reference
Common Plot Types and When to Use Them
| Data Type | Comparison | Distribution | Relationship | Composition | Trend |
|---|---|---|---|---|---|
| Categorical | Bar, column | - | - | Pie, stacked bar | - |
| Continuous | Box plot | Histogram, KDE | Scatter, joint plot | Area chart | Line chart |
| Time Series | - | - | Line, area | Stacked area | Line chart |
| Geographic | - | - | - | Choropleth | Animated map |
| 3D | - | - | 3D scatter | 3D surface | 3D line |
Essential Code Snippets
Matplotlib Basic Setup
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
# Your plot code here
plt.title('Title', fontsize=14)
plt.xlabel('X Label', fontsize=12)
plt.ylabel('Y Label', fontsize=12)
plt.tight_layout()
plt.show()
Seaborn Quick Plot
import seaborn as sns
sns.set_style('whitegrid')
sns.scatterplot(data=df, x='var1', y='var2', hue='category')
plt.show()
Plotly Interactive
import plotly.express as px
fig = px.scatter(df, x='var1', y='var2', color='category',
hover_data=['additional_info'])
fig.show()
Missing Data Check
import missingno as msno
# Quick overview
msno.matrix(df)
plt.show()
# Detailed analysis
print(df.isnull().sum())
print((df.isnull().sum() / len(df)) * 100)
Color Palettes
Seaborn Built-in:
-
deep,muted,pastel,bright,dark,colorblind
Matplotlib Colormaps:
- Sequential:
viridis,plasma,inferno,magma,cividis - Diverging:
coolwarm,RdYlBu,seismic - Qualitative:
tab10,tab20,Set1,Set2,Set3
Plotly Color Scales:
- Sequential:
Blues,Greens,Reds,Viridis,Plasma - Diverging:
RdBu,PiYG,Spectral
File Export
# Matplotlib
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.savefig('plot.svg') # Vector format
plt.savefig('plot.pdf')
# Plotly
fig.write_html('plot.html')
fig.write_image('plot.png', width=1200, height=800)
Troubleshooting
| Issue | Solution |
|---|---|
| Plot not showing | Call plt.show() or use %matplotlib inline in Jupyter |
| Overlapping labels | Use plt.tight_layout() or adjust figure size |
| Too slow (Plotly) | Reduce data points or use sampling |
| Memory error | Load data in chunks or use smaller sample |
| Font too small | Increase with fontsize parameter |
| Legend outside plot | bbox_to_anchor=(1.05, 1), loc='upper left' |
Conclusion
These five notebooks provide a comprehensive journey through data visualization for machine learning:
- Part 01: Foundation - Basic plots and techniques
- Part 02: Geographic - Maps and spatial data
- Part 03: Statistical - Correlations and distributions
- Part 04: Advanced - 3D and multi-dimensional
- Part 05: Quality - Missing data and preprocessing
Learning Path Recommendation
graph LR
A[Complete Beginner] --> B[Part 01: Basics]
B --> C{Interest?}
C -->|Location Data| D[Part 02: Geographic]
C -->|Statistics| E[Part 03: Statistical]
C -->|Advanced Tech| F[Part 04: 3D]
C -->|Data Cleaning| G[Part 05: Missing Data]
D --> H[Intermediate Level]
E --> H
F --> H
G --> H
H --> I[Combine Techniques]
I --> J[Real Projects]
Next Steps
- Practice: Apply techniques to your own datasets
- Combine: Use multiple visualization types together
- Customize: Develop your own plotting functions
- Share: Create dashboards and reports
- Contribute: Improve these notebooks on GitHub
Resources
- Matplotlib Documentation
- Seaborn Tutorial
- Plotly Python Guide
- missingno Documentation
- Kaggle Datasets
Author: Kaggle User thecoder8890
Repository: thecoder8890/ml-visual-handbook
Last Updated: 2025
License: MIT (if applicable)
This documentation is maintained alongside the notebooks. For issues, suggestions, or contributions, please open an issue on GitHub.
Top comments (0)