DEV Community

Cover image for Descriptive Statistics with Python for Beginner Data Scientists
Bala Priya C
Bala Priya C

Posted on

Descriptive Statistics with Python for Beginner Data Scientists

You've probably heard that data scientists spend most of their time cleaning and exploring data... not building fancy models. That's true, and descriptive statistics is the foundation of that exploration.

Before you can ask "what does this data predict?", you need to ask "what does this data look like?" Descriptive statistics gives you the tools to answer that. It tells you where your data is centered, how spread out it is, and whether it's shaped in a way that might cause problems for the models you'll build later.

In this article, you'll work through the core concepts of descriptive statistics using Python, pandas, and matplotlib. Along the way you'll build intuition — not just know which function to call, but understand what the numbers are actually telling you.

If you'd like, you can follow along with the Google Colab notebook.


Prerequisites

Before reading this article, you should be comfortable with:

  • Basic Python (variables, lists, loops, functions)
  • What a DataFrame is in pandas (you don't need to be an expert; knowing pd.DataFrame() and df.head() is enough)
  • Installing packages with pip

You do not need any prior statistics knowledge. We'll build everything from scratch.


Table of Contents

  1. What Is Descriptive Statistics?
  2. The Dataset
  3. Measures of Central Tendency — Mean and Median
  4. Measures of Spread — Std Dev, Variance, IQR
  5. The Five-Number Summary
  6. Skewness — Is Your Distribution Symmetric?
  7. Putting It All Together
  8. Which Chart for Which Job?
  9. Summary

What Is Descriptive Statistics?

Descriptive statistics is the practice of summarizing and describing the features of a dataset. Unlike inferential statistics — which draws conclusions about a larger population from a sample — descriptive statistics just tells you about the data you actually have in front of you.

Think of it this way: if a colleague hands you a spreadsheet of 10,000 sales transactions and asks "what do you see?", you can't read every row. Descriptive statistics is how you compress that spreadsheet into a handful of numbers that tell the real story.

There are three categories of descriptive statistics you'll use often:

Category What it answers Examples
Central tendency Where is the "middle" of my data? mean, median, mode
Spread / variability How spread out is the data? variance, std dev, range, IQR
Shape What does the distribution look like? skewness, kurtosis

Let's work through each one.


The Dataset

We'll use sales records from 12 retail store branches across a single quarter. Each row represents one branch's performance: total revenue, units sold, and product return rate.

This is small enough to reason about by hand but realistic enough to make the statistics meaningful.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = {
    "branch": ["Atlanta", "Boston", "Chicago", "Dallas",
               "Denver", "Houston", "Miami", "Nashville",
               "Phoenix", "Portland", "Seattle", "Tampa"],
    "revenue_usd": [142000, 198000, 175000, 161000,
                    134000, 189000, 210000, 123000,
                    155000, 168000, 202000, 139000],
    "units_sold": [940, 1280, 1105, 1020,
                   870, 1190, 1340, 790,
                   985, 1060, 1295, 855],
    "return_rate_pct": [3.1, 2.4, 2.9, 4.2,
                        3.8, 2.1, 1.9, 5.1,
                        3.4, 2.7, 2.2, 4.6]
}

df = pd.DataFrame(data)
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Output:

    branch  revenue_usd  units_sold  return_rate_pct
0  Atlanta       142000         940              3.1
1   Boston       198000        1280              2.4
2  Chicago       175000        1105              2.9
3   Dallas       161000        1020              4.2
4   Denver       134000         870              3.8
Enter fullscreen mode Exit fullscreen mode

Measures of Central Tendency

Central tendency tells you the "typical" value in your data — the value around which everything else clusters.

Mean

The mean is the sum of all values divided by the count. It's the most commonly used measure of center, but it has a weakness: it is sensitive to outliers.

If one branch had a super good quarter at $900,000, the mean would shoot up and no longer represent the typical branch.

Median

The median is the middle value when your data is sorted. Half the values fall below it, half above.

Because it only cares about position — and not magnitude — extreme values don't move it. This makes it more robust than the mean.

A useful rule of thumb: if mean ≠ median, something interesting is going on. It could be outliers, skewness, or both.

Let's compute both:

revenue = df["revenue_usd"]

mean_rev   = revenue.mean()
median_rev = revenue.median()

print(f"Mean revenue:   ${mean_rev:,.0f}")
print(f"Median revenue: ${median_rev:,.0f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Mean revenue:   $166,333
Median revenue: $164,500
Enter fullscreen mode Exit fullscreen mode

The two values are close — only $1,833 apart. That tells us no single branch is substantially distorting the average. The data is reasonably balanced.

Visualizing Mean vs Median

A bar chart with reference lines is the clearest way to see this relationship in action. Bars below the mean are colored blue, above it green, so you can immediately see which branches are performing above and below average.

fig, ax = plt.subplots(figsize=(10, 5))

colors = ["#4a90d9" if v < mean_rev else "#2ecc71"
          for v in df["revenue_usd"]]

ax.bar(df["branch"], df["revenue_usd"] / 1000,
       color=colors, edgecolor="white", linewidth=0.8)

ax.axhline(mean_rev / 1000, color="#e74c3c", linewidth=2,
           linestyle="--", label=f"Mean: ${mean_rev/1000:.0f}k")
ax.axhline(median_rev / 1000, color="#f39c12", linewidth=2,
           linestyle="-.", label=f"Median: ${median_rev/1000:.0f}k")

ax.set_title("Q3 Revenue by Branch with Mean and Median",
             fontsize=14, fontweight="bold")
ax.set_xlabel("Branch")
ax.set_ylabel("Revenue (USD thousands)")
ax.legend()
ax.tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.savefig("revenue_central_tendency.png", dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Q3 Revenue by Branch with Mean and Median

In the above dataset, Nashville is the branch with the lowest sales. Miami, Boston, and Seattle lead. The two reference lines sitting almost on top of each other confirms: no outlier is pulling the average off-center.


Measures of Spread

Two datasets can have identical means and still look completely different. One might have all values clustered tightly around the center; another might be all over the place. Measures of spread capture that difference.

Standard Deviation

Standard deviation tells you the typical distance of a data point from the mean. A small std dev means values are packed tightly together. A large one means they're spread out widely.

It's expressed in the same units as your data — dollars here — which makes it easy to interpret directly.

Variance

Variance is the square of the standard deviation. It's used heavily in statistical formulas behind the scenes, but its units are squared (dollars²), which makes it awkward to communicate. Day-to-day, stick with std dev.

Range and IQR

The range (max − min) is simple but not quite robust because one extreme value changes it completely.

The IQR (interquartile range) is the distance between the 25th and 75th percentiles. It describes the spread of the middle 50% of your data. Because it ignores the top and bottom 25%, it's resistant to outliers — just like the median.

std_rev   = revenue.std()
var_rev   = revenue.var()
range_rev = revenue.max() - revenue.min()
iqr_rev   = revenue.quantile(0.75) - revenue.quantile(0.25)

print(f"Std deviation: ${std_rev:,.0f}")
print(f"Variance:      ${var_rev:,.0f}")
print(f"Range:         ${range_rev:,.0f}")
print(f"IQR:           ${iqr_rev:,.0f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Std deviation: $28,908
Variance:      $835,696,970
Range:         $87,000
IQR:           $50,000
Enter fullscreen mode Exit fullscreen mode

The typical branch deviates from the mean by about $29k. The middle 50% of branches span a $50k revenue range. These are the numbers you'd quote in a report. The variance of $831 million — while mathematically correct — would confuse anyone you hand it to.

Visualizing Spread: The Box Plot

The box plot helps visualize the spread of the data. It encodes five numbers in one chart: the minimum, Q1, median, Q3, and maximum.

  • The box represents the IQR.
  • The line inside the box is the median. Points beyond the whiskers are potential outliers.
fig, axes = plt.subplots(1, 3, figsize=(12, 5))

cols   = ["revenue_usd", "units_sold", "return_rate_pct"]
labels = ["Revenue (USD)", "Units Sold", "Return Rate (%)"]
colors = ["#4a90d9", "#2ecc71", "#e67e22"]

for ax, col, label, color in zip(axes, cols, labels, colors):
    bp = ax.boxplot(df[col], patch_artist=True, widths=0.5,
                    medianprops=dict(color="black", linewidth=2))
    bp["boxes"][0].set_facecolor(color)
    bp["boxes"][0].set_alpha(0.7)
    ax.set_title(label, fontsize=12, fontweight="bold")
    ax.set_xticks([])

plt.suptitle("Spread of Key Metrics Across Branches",
             fontsize=14, fontweight="bold", y=1.02)
plt.tight_layout()
plt.savefig("boxplots_spread.png", dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Three side-by-side box plots. Revenue (blue) and Units Sold (green) have compact symmetric boxes with the median near the center. Return Rate (orange) has an asymmetric box skewed upward


The Five-Number Summary

The five-number summary — min, Q1, median, Q3, max — gives you a compact, complete picture of a column's spread. In pandas, describe() gives you this plus the mean and standard deviation for every numeric column at once.

Make this the first thing you run on any new dataset:

print(df[["revenue_usd", "units_sold", "return_rate_pct"]].describe().round(2))
Enter fullscreen mode Exit fullscreen mode

Output:

       revenue_usd  units_sold  return_rate_pct
count        12.00       12.00            12.00
mean     166333.33     1060.83             3.20
std       28908.42      184.29             1.04
min      123000.00      790.00             1.90
25%      141250.00      922.50             2.35
50%      164500.00     1040.00             3.00
75%      191250.00     1212.50             3.90
max      210000.00     1340.00             5.10
Enter fullscreen mode Exit fullscreen mode

A few things stand out:

  • The return rate range (1.9% to 5.1%) is wide relative to its mean of 3.2%
  • Revenue mean ($166k) and median ($164.5k) are close — no major outlier distortion
  • The count row is your quick data quality check: if any column shows fewer rows than expected, you have missing data

Skewness

Skewness measures whether your distribution is symmetric around the mean or has a longer "tail" pulling to one side.

  • Skewness ≈ 0 — roughly symmetric; the normal distribution is the classic example
  • Skewness > 0 — right-skewed; a few very high values stretch the tail to the right
  • Skewness < 0 — left-skewed; a few very low values stretch the tail to the left

Why does this matter? Many statistical tests — and some machine learning algorithms — assume your data is roughly symmetric. If it isn't, results can be unreliable. Knowing skewness early lets you decide whether to transform the data before going further.

for col in ["revenue_usd", "units_sold", "return_rate_pct"]:
    skew = df[col].skew()
    print(f"{col:<25} skewness = {skew:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

revenue_usd               skewness = 0.095
units_sold                skewness = 0.167
return_rate_pct           skewness = 0.549
Enter fullscreen mode Exit fullscreen mode

Visualizing Skewness: The Histogram

Histograms are the best way to see skewness. The shape of the bars tells you where data clusters and where it trails off. Overlaying the mean and median makes the direction of any skew immediately visible — when they're apart, the tail is on the side of the mean.

fig, axes = plt.subplots(1, 3, figsize=(13, 4))

cols   = ["revenue_usd", "units_sold", "return_rate_pct"]
labels = ["Revenue (USD)", "Units Sold", "Return Rate (%)"]
colors = ["#4a90d9", "#2ecc71", "#e67e22"]

for ax, col, label, color in zip(axes, cols, labels, colors):
    ax.hist(df[col], bins=6, color=color, edgecolor="white",
            linewidth=0.8, alpha=0.85)

    mean_val   = df[col].mean()
    median_val = df[col].median()
    ax.axvline(mean_val,   color="#e74c3c", linestyle="--",
               linewidth=1.8, label="Mean")
    ax.axvline(median_val, color="black",   linestyle="-.",
               linewidth=1.8, label="Median")

    skew_val = df[col].skew()
    ax.set_title(f"{label}\nskew = {skew_val:.3f}",
                 fontsize=11, fontweight="bold")
    ax.legend(fontsize=9)

plt.suptitle("Histograms with Mean vs Median — Detecting Skewness",
             fontsize=13, fontweight="bold", y=1.03)
plt.tight_layout()
plt.savefig("histograms_skewness.png", dpi=150)
plt.show()
Enter fullscreen mode Exit fullscreen mode

Output:

Three histograms side by side.

Look at the return rate panel on the right. The mean (red dashed line) sits to the right of the median (black dash-dot). That gap is the visual fingerprint of positive skew. Revenue and units sold, by contrast, are not as skewed.


Putting It All Together

Here's a reusable function that prints the key descriptive stats for any numeric column:

def summarize(series, label):
    print(f"\n── {label} ──")
    print(f"  Mean:    {series.mean():>10.2f}")
    print(f"  Median:  {series.median():>10.2f}")
    print(f"  Std Dev: {series.std():>10.2f}")
    print(f"  IQR:     {series.quantile(0.75) - series.quantile(0.25):>10.2f}")
    print(f"  Skew:    {series.skew():>10.3f}")

summarize(df["revenue_usd"],     "Revenue (USD)")
summarize(df["units_sold"],      "Units Sold")
summarize(df["return_rate_pct"], "Return Rate (%)")
Enter fullscreen mode Exit fullscreen mode

Output:

── Revenue (USD) ──
  Mean:     166333.33
  Median:   164500.00
  Std Dev:   28908.42
  IQR:       50000.00
  Skew:         0.095

── Units Sold ──
  Mean:       1060.83
  Median:     1040.00
  Std Dev:     184.29
  IQR:         290.00
  Skew:         0.167

── Return Rate (%) ──
  Mean:          3.20
  Median:        3.00
  Std Dev:       1.04
  IQR:           1.55
  Skew:         0.549
Enter fullscreen mode Exit fullscreen mode

Drop this function into any exploratory notebook and run it on every column before you analyze the data further.


Which Chart for Which Job?

Here's an overview of which chart you should use:

What you want to show Best chart Key method
Distribution shape and skewness Histogram ax.hist()
Spread, median, and outliers Box plot ax.boxplot()
Comparing values across categories Bar chart ax.bar()
Mean vs median on a bar chart Bar + reference lines ax.axhline()

Summary

Concept What it tells you pandas method
Mean Average value .mean()
Median Middle value — outlier-robust .median()
Std deviation Typical distance from the mean .std()
IQR Spread of the middle 50% .quantile()
Skewness Symmetry of the distribution .skew()
Five-number summary Full spread at a glance .describe()

What's Next?

In this article you learned to describe your data with numbers and charts. But there's a deeper question behind all of this: why do values cluster and spread the way they do? The answer lies in probability distributions, and that's exactly what we'll cover in the next article in this series.

Top comments (0)