Shivappa

Posted on Apr 22

(EDA Part-5) Multivariate Analysis — Wrapping Up EDA and What Comes Next

#machinelearning #ai #matplotlib #pandas

Over the past four parts, we zoomed in on single features (univariate analysis), then looked at pairs (bivariate). Now it’s time for the real fun: multivariate analysis. This is where we weave three or more features together and look for patterns we simply couldn’t see before.

The analogy — the blind men and the elephant 🐘

You have probably heard this story. Six blind men each touch a different part of an elephant — trunk, tusk, leg, tail, ear, body. Each one describes something completely different and none of them is wrong. But none of them sees the full picture either.

That is exactly what happens when you look at features in isolation.

Sex alone tells you about women and children first
Pclass alone tells you about wealth and lifeboat access
Age alone shows almost nothing (r = -0.05)

But combine all three together? Suddenly you see the elephant. A young boy (Age < 14) in 1st class had very different odds from an old man in 3rd class. Neither Sex, Pclass, nor Age reveals this alone.

That is what multivariate analysis is for—finally seeing the whole elephant.

So, what exactly is multivariate analysis?

Just investigating three or more variables at the same time.
Let us break that down:

Type	Features involved	Example
Univariate	1	What does the Age distribution look like? (One variable)
Bivariate	2	How does Age relate to Survived? (Two variables)
Multivariate	3+	How does Age + Sex + Pclass together relate to Survived?

Each step reveals patterns the previous step couldn't see.

Step 1: Three-way survival breakdown — Sex + Pclass + Survived

Remember the crosstab from Part 4? Almost all first-class women survived, but only a handful of third-class men did. Let us make that visual with a grouped bar chart.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Survival rate grouped by both Sex and Pclass
survival_grouped = df.groupby(['Pclass', 'Sex'])['Survived'].mean().reset_index()

fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(
    data=survival_grouped,
    x='Pclass',
    y='Survived',
    hue='Sex',
    palette={'male': '#378ADD', 'female': '#D85A30'},
    ax=ax
)

ax.set_title('Survival rate by Passenger Class and Sex', fontsize=14, pad=15)
ax.set_xlabel('Passenger class (1 = Upper, 2 = Middle, 3 = Lower)')
ax.set_ylabel('Survival rate (0 = 0%, 1 = 100%)')
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(['1st class', '2nd class', '3rd class'])
ax.legend(title='Gender')
ax.set_ylim(0, 1.1)

print(ax.patches)


# Add percentage labels on each bar
for p in ax.patches:
    if p.get_height() > 0: # This skips the "hidden" patches
        ax.annotate(
           f'{p.get_height():.0%}',
           (p.get_x() + p.get_width() / 2., p.get_height()),
           ha='center', va='bottom', fontsize=10, fontweight='bold'
        )

plt.tight_layout()
plt.show()

Class	Female survival	Male survival	Gap
1st	97%	37%	60 points
2nd	92%	16%	76 points
3rd	50%	14%	36 points

What pops out? The gender gap is huge, especially in second class—92% of women survived vs just 16% of men. Even third-class women had a survival rate of 50%; men, just 14%. The “women and children first” approach wasn’t just folklore—it is visible in the data.

Step 2: Add Age into the picture — the FacetGrid

Let us complicate things. Using seaborn’s FacetGrid, we can compare age distributions split by both Sex and Pclass, and see how survival changed in every combination.

grid = sns.FacetGrid(
    df,
    row='Pclass',
    col='Sex',
    hue='Survived',
    palette={0: '#E24B4A', 1: '#1D9E75'},
    height=3,
    aspect=1.4
)

grid.map(plt.hist, 'Age', bins=20, alpha=0.6, edgecolor='white')
grid.add_legend(title='Survived', labels=['Did not survive', 'Survived'])

# Add axis labels to every subplot
for ax in grid.axes.flat:
    ax.set_xlabel('Age (years)')
    ax.set_ylabel('Number of passengers')

grid.set_titles(row_template='Class {row_name}', col_template='{col_name}')
grid.figure.suptitle('Age distribution by Class, Sex and Survival', y=1.02, fontsize=13)

plt.tight_layout()
plt.show()

Key Observations from the FacetGrid:

First-class females — green (survived) bars dominate at almost every age. Near-universal survival.
Third-class males — red (died) bars dominate at almost every age. Very few survived regardless of age.
Children (Age < 10) — in 1st and 2nd class, green bars appear in the youngest age bins even for males. The "children first" policy had a small effect in upper classes.
Elderly passengers — older passengers in 3rd class have almost zero green bars. Age + class + sex combined paints a very different picture than any one feature alone.

Step 3: Engineering features—extracting Title from Name**

Back in Part 4, we saw that Age alone barely gave any clues about survival. But we also saw that children did much better than adults—you just can’t separate them cleanly by age column alone. Here is a trick: the Name column actually contains titles, and those do a much better job dividing up people by age and gender together.

The Name column contains titles — and titles encode both age and sex together far more efficiently than either column alone.

# Extract title from Name
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

# Check all titles
print(df['Title'].value_counts())

Mr        517
Miss      182
Mrs       125
Master     40   # ← key one: 'Master' = young boys under ~14
Dr          7
Rev         6
...

There are some Title we need to bring to common format like Mme (French Abbreviation for Madame) is Mrs. So you see there are some count difference between above code and below.

# Group rare titles into one bucket
df['Title'] = df['Title'].replace(
    ['Lady', 'Countess', 'Capt', 'Col', 'Don',
     'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare'
)
df['Title'] = df['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})

# Survival rate by Title
print(df.groupby('Title')['Survived'].agg(['mean', 'count']).round(2))

Now, look at survival rates by Title:

        mean  count
Title
Master  0.57     40   # ← young boys: much higher than adult men
Miss    0.70    185   # ← unmarried women + girls: very high
Mr      0.16    517   # ← adult men: lowest by far
Mrs     0.79    126   # ← married women: highest
Rare    0.35     23

Note: This is the power of multivariate thinking. Sex alone gives you male vs female. Age alone gives you a weak signal. But Title gives you Master (young boy, 57%) vs Mr (adult man, 16%) — a massive difference that neither Sex nor Age could show individually. Two features combined into one engineered feature reveals what neither showed alone.

Now plot it visually:

fig, ax = plt.subplots(figsize=(9, 5))

title_survival = df.groupby('Title')['Survived'].mean().sort_values(ascending=False)

colors = ['#1D9E75' if v >= 0.6 else '#EF9F27' if v >= 0.35 else '#E24B4A'
          for v in title_survival.values]

bars = ax.bar(
    title_survival.index,
    title_survival.values,
    color=colors,
    edgecolor='white',
    width=0.6
)

ax.set_title('Survival rate by passenger title', fontsize=13, pad=12)
ax.set_xlabel('Title extracted from Name column')
ax.set_ylabel('Survival rate (0 = 0%, 1 = 100%)')
ax.set_ylim(0, 1.0)

for bar, val in zip(bars, title_survival.values):
    ax.text(
        bar.get_x() + bar.get_width() / 2,
        val + 0.02,
        f'{val:.0%}',
        ha='center', fontsize=11, fontweight='bold'
    )

plt.tight_layout()
plt.show()

Step 4: Family size — creating another multivariate feature

We saw in Part 4, that SibSp and Parch each showed very weak individual correlations with survival. But combined into a single Family_Size feature, a clear non-linear pattern emerges.

# Combine SibSp + Parch into Family_Size
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1  # +1 = the passenger themselves

# Survival rate by family size
family_survival = df.groupby('Family_Size')['Survived'].agg(['mean', 'count'])
family_survival.columns = ['Survival Rate', 'Count']
print(family_survival.round(2))

Survival rates:

             Survival Rate  Count
Family_Size
1                     0.30    537 # ← alone: 30%
2                     0.55    161 # ← pair: 55%
3                     0.58    102 # ← small family: 58%
4                     0.72     29 # ← medium family: 72%!
5                     0.20     15 # ← large family: drops to 20%
6                     0.14     22
7                     0.33     12
8                     0.00      6
11                    0.00      7 # ← huge family: 0%

fig, ax = plt.subplots(figsize=(10, 5))

colors = ['#1D9E75' if v >= 0.55 else '#EF9F27' if v >= 0.3 else '#E24B4A'
          for v in family_survival['Survival Rate']]

ax.bar(
    family_survival.index,
    family_survival['Survival Rate'],
    color=colors,
    edgecolor='white',
    width=0.7
)

ax.set_title('Survival rate by family size aboard', fontsize=13, pad=12)
ax.set_xlabel('Family size (passenger + number of relatives aboard)')
ax.set_ylabel('Survival rate (0 = 0%, 1 = 100%)')
ax.set_xticks(family_survival.index)
ax.set_ylim(0, 0.85)

for size, row in family_survival.iterrows():
    ax.text(size, row['Survival Rate'] + 0.02,
            f"{row['Survival Rate']:.0%}", ha='center', fontsize=9)

plt.tight_layout()
plt.show()

What this shows ? Neither SibSp nor Parch showed strong survival signals individually. But Family_Size reveals a clear sweet spot at 2–4 people. Small families could coordinate and stay together. Solo travellers had nobody helping them. Huge families (7+) couldn't move together and had 0% survival. Classic multivariate insight — more than the sum of its parts.

Step 5: Updated correlation heatmap with engineered features

Now that we have engineered Title and Family_Size, let us re-run the correlation heatmap and see how much it improved compared to the raw features from Part 4.

import numpy as np

# Encode for correlation
df['Sex_enc']   = (df['Sex'] == 'female').astype(int)
df['Title_enc'] = df['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})

corr_cols = ['Survived', 'Pclass', 'Sex_enc', 'Age', 'Fare',
             'Has_Cabin', 'Family_Size', 'Title_enc']

corr_matrix = df[corr_cols].corr()

fig, ax = plt.subplots(figsize=(9, 7))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt='.2f',
    cmap='RdYlGn',
    center=0,
    square=True,
    linewidths=0.5,
    ax=ax
)
ax.set_title('Correlation heatmap — raw + engineered features', fontsize=13, pad=12)
ax.set_xlabel('Features')
ax.set_ylabel('Features')

plt.tight_layout()
plt.show()

Before vs after engineering:

Feature	Before (raw)	After (engineered)
Age	r = -0.05	r = -0.05 (unchanged)
Title_enc	—	r = +0.54 (new — encodes age + sex together)
Family_Size	—	r = +0.02 (new — replaces SibSp + Parch)
Sex_enc	r = +0.54	r = +0.54 (unchanged)
Pclass	r = -0.34	r = -0.34 (unchanged)

Engineering Title alone gave us a brand new feature that correlates as strongly with Survived as Sex itself — and it came purely from multivariate thinking.

The complete EDA journey — all 5 parts in one view

Part 1 → WHY EDA matters
         Analogies: Sherlock Holmes, the Doctor, the Chef, the Navigator
         Titanic dataset introduced — 891 rows, 12 columns

Part 2 → FIRST LOOK
         df.shape, df.dtypes, df.describe(), df.isnull().sum()
         Reading the describe() output — mean vs median, skewness clues
         Categorical columns: Sex, Pclass, Embarked

Part 3 → UNIVARIATE ANALYSIS
         Histograms and box plots for Age and Fare
         3 types of missing data: MCAR, MAR, MNAR
         Fixing nulls correctly for each type

Part 4 → BIVARIATE ANALYSIS
         Survival rates: 74% female vs 19% male
         Correlation heatmap — which raw features predict survival
         Crosstab: 97% of 1st class women survived vs 14% of 3rd class men

Part 5 → MULTIVARIATE ANALYSIS (this article)
         Three-way breakdown: Sex + Pclass + Survived
         FacetGrid: Age + Sex + Pclass all at once
         Engineered features: Title and Family_Size
         How combining features reveals what none showed alone

The final definition of EDA

After 5 parts and a real dataset, here is how I now define EDA in one sentence:

EDA is the practice of asking honest questions about your data before you ask it to predict anything.

You ask: what is in here? What is broken? What is missing? What relates to what? And only after you have answered those honestly do you move to modelling.

Skipping EDA is like a surgeon who doesn't read your X-ray before operating. The surgery might go fine. Or you might operate on the wrong leg.

What is coming next — the ML algorithms series

Now that EDA is done, the natural next question is: how do we actually build a model from all these insights?

That is exactly what I am planning for the next series. I am currently learning and will be writing about each algorithm as I understand it:

Linear Regression — predicting a continuous number (e.g. house prices)
Logistic Regression — predicting a category (e.g. survived or not)
Decision Trees — how machines make if-then decisions
Random Forests — why many trees beat one tree
And more — as I learn them, I will write them honestly

Each series will follow the same approach as this one — real datasets, real code, real-world analogies, and zero pretending I know more than I do.

Note: I am writing these as I learn. So if something is unclear or you spot a mistake, please drop a comment — your feedback genuinely makes these articles better. We are learning together.

Thank you 🙏

This series took a lot of effort to put together and I truly hope it helped you see EDA not as a boring checkbox, but as the most important habit you can build as a learner.

If you found it useful, pass it on to someone else on the same journey. And stick around for the ML algorithms series—Linear Regression is up next.

See you there!

DEV Community