DEV Community

Cover image for (EDA Part-4) Bivariate Analysis — What Actually Predicted Who Survived (on Titanic)?
Shivappa
Shivappa

Posted on

(EDA Part-4) Bivariate Analysis — What Actually Predicted Who Survived (on Titanic)?

Part 4 of 5 — Intermediate → Advanced


This is the most exciting part of EDA. We've learned each feature individually in Part 3. Now we ask the real question: who survived, and why?


The analogy — the detective reveals the suspect 🔍

Sherlock has examined every clue individually. Now he starts connecting them.

The mud on the shoes came from the east garden, which only the gardener used between 9 and 11 AM, and the victim was last seen at 10:30...

Each piece connects. That is what bivariate analysis. Connecting features to each other and to the outcome. We stop looking at columns in isolation and start asking: what do two columns tell us together?


What does the real Titanic data say?

Let us look at the actual numbers right out of the dataset:

Group Survival rate
Female passengers 74%
Male passengers 19%
1st class passengers 63%
2nd class passengers 47%
3rd class passengers 24%

No complicated model needed. Just pandas and a couple lines of code, and clear patterns jump out.


Step 1: who survived? Feature vs target

import seaborn as sns
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Sex vs Survival
survival_by_sex = df.groupby('Sex')['Survived'].mean()
bars = axes[0].bar(survival_by_sex.index, survival_by_sex.values,
                   color=['#1D9E75', '#E24B4A'])
axes[0].set_title('Survival rate by Sex')
axes[0].set_ylabel('Survival rate')
# Add percentage labels on bars
for bar, val in zip(bars, survival_by_sex.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, val + 0.01,
                 f'{val:.1%}', ha='center', fontweight='bold')

# Pclass vs Survival
survival_by_class = df.groupby('Pclass')['Survived'].mean()
bars2 = axes[1].bar(['1st', '2nd', '3rd'],
                    survival_by_class.values,
                    color=['#1D9E75', '#EF9F27', '#E24B4A'])
axes[1].set_title('Survival rate by Passenger Class')
for bar, val in zip(bars2, survival_by_class.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, val + 0.01,
                 f'{val:.1%}', ha='center', fontweight='bold')

# Age distribution: survived vs died
df[df['Survived']==1]['Age'].hist(bins=25, ax=axes[2], alpha=0.6,
                                   label='Survived', color='#1D9E75')
df[df['Survived']==0]['Age'].hist(bins=25, ax=axes[2], alpha=0.6,
                                   label='Died', color='#E24B4A')
axes[2].set_title('Age: survived vs died')
axes[2].set_xlabel('Age (years)')
axes[2].legend()

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Graph7

Key insight: Being female gave you a 74% chance of surviving. Being male: only 19%. Being in 1st class gave you 63% survival vs 24% in 3rd class. Sex and Pclass will be the two most important features in any Titanic model.


Step 2: Violin plots — see the full distribution, not just the average

Bar charts show averages. But Violin plots show the full shape of the distribution per group.

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Fare distribution by survival
sns.violinplot(
    data=df, 
    x='Survived', 
    y='Fare', 
    hue='Survived',
    ax=axes[0], 
    palette=['#E24B4A', '#1D9E75'],
    legend=False
)
axes[0].set_title('Fare distribution: survived vs died')
axes[0].set_xticks([0, 1])
axes[0].set_xticklabels(['Died', 'Survived'])

# Age distribution by survival
sns.violinplot(
    data=df, 
    x='Survived', 
    y='Age', 
    hue='Survived',
    ax=axes[1], 
    palette=['#E24B4A', '#1D9E75'],
    legend=False
)
axes[1].set_title('Age distribution: survived vs died')
axes[1].set_xticks([0, 1])
axes[1].set_xticklabels(['Died', 'Survived'])

plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Graph8

What the Fare violin tells us ?
Survivors paid significantly higher fares on average. The violin for "Survived" is thicker at higher fare values — passengers who paid more had access to better cabin locations, closer to lifeboats. This is Fare acting as a proxy for wealth and location on the ship.

What the Age violin tells us ?
The shapes look quite similar! Both groups have a similar age distribution. This is why Age showed only -0.07 correlation with Survived earlier. But look carefully — the survived group has a slightly fatter section in the 0-10 range. Kids did get priority. You can literally see the women and children first effect right in the data.


Step 3: The correlation heatmap — who are the strongest predictors?

# Encode Sex as numeric for correlation analysis
df['Sex_encoded'] = (df['Sex'] == 'female').astype(int)

corr_cols = ['Survived', 'Pclass', 'Sex_encoded', 'Age',
             'SibSp', 'Parch', 'Fare', 'Has_Cabin']
corr_matrix = df[corr_cols].corr()

plt.figure(figsize=(9, 7))
mask = np.zeros_like(corr_matrix, dtype=bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr_matrix,
            annot=True,
            fmt='.2f',
            cmap='RdYlGn',
            center=0,
            square=True,
            linewidths=0.5,
            mask=mask)
plt.title('Feature correlation matrix — Titanic')
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Graph9

Check out these results (correlation with Survived):

Feature Correlation What it means
Sex_encoded +0.54 Strong predictor—females lived
Has_Cabin +0.32 Having a cabin (= 1st/2nd class) predicts survival
Fare +0.26 More money, better outcome
Pclass -0.34 Higher class number (lower class) predicts death
Age -0.05 Almost no linear relationship with survival
SibSp -0.04 Negligible

Remember, this is Pearson correlation. If something is nonlinear—like kids having extra odds—it won’t show up here. That is why you always want to pair this kind of analysis with a closer group look with heatmap.


Step 4: The crosstab—the two-line powerhouse

Want the biggest insight with almost no code? Try a crosstab:

# Survival rate by Sex AND Pclass combined
pivot = pd.crosstab(
    index=df['Pclass'],
    columns=df['Sex'],
    values=df['Survived'],
    aggfunc='mean'
).round(2)

print(pivot)
Enter fullscreen mode Exit fullscreen mode
Sex        female    male
Pclass
1           0.97     0.37    ← 97% of 1st class women survived!
2           0.92     0.16
3           0.50     0.14    ← only 14% of 3rd class men survived
Enter fullscreen mode Exit fullscreen mode

This tiny table tells a huge story:

  • First-class woman** had a 97% survival rate — almost guaranteed rescue
  • Third-class man** had a 14% survival rate — almost guaranteed death.
  • Even for men, being in first class (37%) more than doubled your odds vs third class (14%).

Step 5: Point biserial correlation for binary targets

For a binary target like Survived, there is a more precise way to check feature correlations:

from scipy import stats

numeric_features = ['Age', 'Fare', 'SibSp', 'Parch', 'Has_Cabin', 'Sex_encoded']

print("Point Biserial Correlation with Survived:\n")
for col in numeric_features:
    corr, pval = stats.pointbiserialr(df['Survived'], df[col])
    significance = "✓ significant" if pval < 0.05 else "✗ not significant"
    print(f"  {col:<16} r={corr:+.3f}   p={pval:.4f}   {significance}")
Enter fullscreen mode Exit fullscreen mode

Result:

Point Biserial Correlation with Survived:

  Age              r=-0.047   p=0.1587   ✗ not significant
  Fare             r=+0.257   p=0.0000   ✓ significant
  SibSp            r=-0.035   p=0.2922   ✗ not significant
  Parch            r=+0.082   p=0.0148   ✓ significant
  Has_Cabin        r=+0.317   p=0.0000   ✓ significant
  Sex_encoded      r=+0.543   p=0.0000   ✓ significant
Enter fullscreen mode Exit fullscreen mode

So, what did we learn?

Here is the short list, straight from EDA:

Feature Signal strength Direction Notes
Sex Very strong Female = survive Most important feature
Pclass Strong 1st = survive Proxy for wealth + cabin location
Fare Moderate Higher = survive Correlated with Pclass — redundant but useful
Has_Cabin Moderate Has cabin = survive We engineered this from the missing Cabin col
Age Weak but real Younger = slight advantage Non-linear — children specifically helped
SibSp Not significant Drop or combine into Family_Size
Parch Weak Combine with SibSp

What is next?

In Part 5, we’ll mix all this together. Multivariate analysis—combining everything we have learned and building a true predictive model. See you there.


Top comments (0)