Akhilesh

Posted on Apr 27

Statistical Visualizations With Seaborn

#ai #python #programming #productivity

You spent the last post fighting with Matplotlib.

Setting figure sizes. Adjusting label positions. Choosing colors. Adding grids. Writing five lines to make one chart look presentable.

Seaborn does not eliminate Matplotlib. It sits on top of it. What it does is handle all the statistical chart types with sensible defaults, so you spend time on insights instead of formatting.

The difference is real. What takes 20 lines in Matplotlib takes 3 in Seaborn for the same result.

Setup

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

sns.set_theme(style="whitegrid", palette="husl")

titanic = sns.load_dataset("titanic")
tips    = sns.load_dataset("tips")
iris    = sns.load_dataset("iris")

print(titanic.shape, tips.shape, iris.shape)

Output:

(891, 15) (244, 7) (150, 5)

sns.load_dataset() downloads built-in practice datasets. Titanic, tips, and iris are the three you will see in every data science tutorial. They are small, clean, and cover common analysis scenarios.

sns.set_theme() applies a consistent visual style to everything that follows. whitegrid is clean and readable. husl is a colorblind-friendly palette.

The Plot That Replaces Three Matplotlib Charts

fig, ax = plt.subplots(figsize=(9, 5))

sns.histplot(
    data=titanic,
    x="age",
    hue="survived",
    multiple="stack",
    bins=30,
    ax=ax
)

ax.set_title("Age Distribution by Survival Status", fontsize=14)
ax.set_xlabel("Age")
ax.legend(labels=["Did Not Survive", "Survived"])

plt.tight_layout()
plt.savefig("age_survival.png", dpi=150)
plt.show()

One histplot call. It draws the histogram, colors it by survival status, stacks the bars, adds a legend. In Matplotlib you would need to split the data manually, draw two separate histograms with matching bin sizes, and align everything.

hue is the key argument in Seaborn. Pass a column name and Seaborn splits and colors the chart by that column automatically.

Boxplot: See Distribution and Outliers at Once

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.boxplot(
    data=titanic,
    x="pclass",
    y="fare",
    hue="survived",
    palette={"0": "coral", "1": "steelblue"},
    ax=axes[0]
)
axes[0].set_title("Fare Distribution by Class and Survival")
axes[0].set_xlabel("Passenger Class")
axes[0].set_ylabel("Fare (£)")

sns.violinplot(
    data=titanic,
    x="pclass",
    y="age",
    hue="sex",
    split=True,
    inner="quart",
    ax=axes[1]
)
axes[1].set_title("Age Distribution by Class and Gender")
axes[1].set_xlabel("Passenger Class")
axes[1].set_ylabel("Age")

plt.tight_layout()
plt.savefig("box_violin.png", dpi=150)
plt.show()

A boxplot shows median (middle line), interquartile range (box), and outliers (dots beyond the whiskers). Everything about a distribution in one compact shape.

A violin plot adds a density curve to the box, showing where values actually cluster within the range. The split=True argument splits the violin by the hue variable, male on one side, female on the other. Clean way to compare two groups within each category.

Barplot: Mean With Confidence Interval

fig, ax = plt.subplots(figsize=(9, 5))

sns.barplot(
    data=titanic,
    x="pclass",
    y="survived",
    hue="sex",
    errorbar="ci",
    capsize=0.1,
    ax=ax
)

ax.set_title("Survival Rate by Class and Gender", fontsize=14)
ax.set_xlabel("Passenger Class")
ax.set_ylabel("Survival Rate")
ax.set_ylim(0, 1)

plt.tight_layout()
plt.savefig("barplot_survival.png", dpi=150)
plt.show()

Seaborn's barplot shows the mean of the y column for each group and automatically draws a 95% confidence interval as an error bar. That error bar tells you how certain the estimate is. A short bar means you have enough data to be confident. A tall bar means the estimate is uncertain.

This is the key difference from Matplotlib's bar(). Matplotlib draws whatever value you give it. Seaborn calculates the statistic and its uncertainty for you.

Heatmap: Correlation at a Glance

numeric_cols = titanic.select_dtypes(include=[np.number]).drop(columns=["survived"])
corr_matrix  = numeric_cols.corr()

fig, ax = plt.subplots(figsize=(8, 6))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    center=0,
    vmin=-1, vmax=1,
    square=True,
    linewidths=0.5,
    ax=ax
)

ax.set_title("Correlation Matrix of Numeric Features", fontsize=13)

plt.tight_layout()
plt.savefig("heatmap.png", dpi=150)
plt.show()

Every cell shows the correlation between two variables. Red means positively correlated. Blue means negatively correlated. White means no relationship.

Before building any machine learning model, run this. If two features are highly correlated with each other (close to 1 or -1), you might only need one of them. If a feature has near-zero correlation with everything, it might not be useful.

Pairplot: Every Variable Against Every Other

pair_df = iris.copy()

g = sns.pairplot(
    pair_df,
    hue="species",
    diag_kind="kde",
    plot_kws={"alpha": 0.6, "s": 40},
    height=2.5
)

g.fig.suptitle("Iris Feature Relationships", y=1.02, fontsize=14)

plt.savefig("pairplot.png", dpi=150, bbox_inches="tight")
plt.show()

One function call. Every scatter plot combination. Diagonal shows each feature's distribution. Colors by species.

On a dataset with 10 features, this makes 100 scatter plots in one command. You can scan them quickly to find which feature pairs show the clearest separation between classes. That tells you which features will be most useful for classification before you even start building a model.

Scatterplot With Regression Line

fig, axes = plt.subplots(1, 2, figsize=(13, 5))

sns.scatterplot(
    data=tips,
    x="total_bill",
    y="tip",
    hue="time",
    size="size",
    sizes=(40, 200),
    alpha=0.7,
    ax=axes[0]
)
axes[0].set_title("Tips vs Total Bill")

sns.regplot(
    data=tips,
    x="total_bill",
    y="tip",
    scatter_kws={"alpha": 0.4},
    line_kws={"color": "red", "linewidth": 2},
    ax=axes[1]
)
axes[1].set_title("Tips vs Total Bill with Regression")

plt.tight_layout()
plt.savefig("scatter_reg.png", dpi=150)
plt.show()

scatterplot with size encodes a third variable as point size. Big parties (larger table size) show as bigger dots. hue adds a fourth. That is four variables in one chart.

regplot adds the regression line automatically, with a shaded confidence interval around it. One call replaces np.polyfit + manual line drawing + confidence band calculation.

Catplot: The Swiss Army Knife

g = sns.catplot(
    data=titanic,
    x="pclass",
    y="age",
    col="survived",
    kind="box",
    height=5,
    aspect=0.8,
    palette="Set2"
)

g.set_axis_labels("Passenger Class", "Age")
g.set_titles("Survived: {col_name}")
g.fig.suptitle("Age Distribution by Class and Survival", y=1.03)

plt.savefig("catplot.png", dpi=150, bbox_inches="tight")
plt.show()

catplot is a figure-level function. The col argument splits the chart into separate panels, one per unique value of the column. kind switches the chart type inside each panel. One line creates a multi-panel comparison that would take dozens of lines in Matplotlib.

Figure-level functions (catplot, relplot, displot, lmplot) return a FacetGrid object instead of an Axes. Use g.fig.suptitle() instead of ax.set_title(). Use g.set_axis_labels() instead of ax.set_xlabel().

The Key Arguments You Will Use Constantly

# hue:      color by a categorical column
# size:     size by a numeric column
# style:    marker style by a categorical column
# col:      create separate panels by column
# row:      create separate rows by column
# palette:  color scheme ("husl", "Set2", "viridis", "coolwarm", etc.)
# alpha:    transparency 0-1
# ax:       which matplotlib axes to draw on
# height:   figure height per panel
# aspect:   panel width = height * aspect

Most Seaborn functions accept hue, size, and style together. You can encode four variables in one chart without it becoming unreadable. Know when to stop. Three is usually the limit before it gets confusing.

Seaborn Plus Matplotlib Together

Seaborn draws on a Matplotlib figure. You can always drop back to Matplotlib to add things Seaborn cannot do directly.

fig, ax = plt.subplots(figsize=(9, 5))

sns.kdeplot(
    data=titanic,
    x="age",
    hue="pclass",
    fill=True,
    alpha=0.4,
    ax=ax
)

ax.axvline(x=titanic["age"].median(), color="black", linestyle="--",
           linewidth=1.5, label=f"Median age: {titanic['age'].median():.0f}")

ax.set_title("Age Distribution by Passenger Class", fontsize=14)
ax.set_xlabel("Age")
ax.set_ylabel("Density")
ax.legend()

plt.tight_layout()
plt.savefig("kde.png", dpi=150)
plt.show()

Seaborn draws the KDE curves. Matplotlib adds the reference line. They share the same ax. This is the standard workflow. Use Seaborn for the statistical chart, Matplotlib for custom annotations and lines.

A Resource Worth Your Time

Michael Waskom, who created Seaborn, wrote detailed explanations of the design decisions behind each chart type in the Seaborn documentation at seaborn.pydata.org. Not just reference docs. Actual explanations of when to use each chart and why. Most library documentation does not come close to this quality.

Towards Data Science has a piece by Naveen Venkatesan called "Seaborn: The Most Useful Python Visualization Library" that walks through real datasets with practical examples and covers the exact scenarios that come up in data science work. Search "Naveen Venkatesan Seaborn useful visualization library".

Try This

Create seaborn_practice.py.

Load the tips dataset: tips = sns.load_dataset("tips").

Build a figure with six subplots in a 2x3 grid. Each one should answer a different question about the data.

Top row: who tips better, smokers or non-smokers? Use a boxplot. Which day has the highest average total bill? Use a barplot with confidence intervals. Is there a relationship between total bill and tip on weekends vs weekdays? Use a scatterplot with hue="day".

Bottom row: what does the distribution of tips look like separated by gender? Use a violinplot. Is tip percentage (tip/total_bill) higher at lunch or dinner? Calculate the column, then use a histplot with hue="time". Show the full pairplot of numeric columns colored by time of day. Save it as a separate file since pairplot creates its own figure.

Give everything proper titles, axis labels, and a consistent color palette.

What's Next

Static charts have limits. The next post is Plotly, which makes charts interactive. Hover tooltips. Zoom. Pan. Dropdown filters. Charts that live in a browser and respond to user input. The kind of visualization you actually send to stakeholders.

DEV Community