Before you train a single model, you need to understand your data.
Not a vague understanding. A precise one.
Where do the values cluster? How spread out are they? Are there extreme values pulling everything off? Is the data balanced or skewed heavily to one side?
If you skip this step, you build models on data you do not understand. They behave strangely. They perform worse than they should. You cannot debug them because you do not have a clear picture of what went in.
Three numbers fix most of this. Mean. Variance. Standard deviation. Not glamorous. Genuinely essential.
Mean: The Center
The mean is the average. Add everything up. Divide by how many there are.
import numpy as np
scores = [72, 85, 91, 68, 78, 95, 62, 88, 74, 83]
total = sum(scores)
count = len(scores)
mean = total / count
print(f"Scores: {scores}")
print(f"Sum: {total}")
print(f"Count: {count}")
print(f"Mean: {mean:.1f}")
print(f"NumPy check: {np.mean(scores):.1f}")
Output:
Scores: [72, 85, 91, 68, 78, 95, 62, 88, 74, 83]
Sum: 796
Count: 10
Mean: 79.6
NumPy check: 79.6
The mean of this class is 79.6. Sounds reasonable. Most students scored around there.
But the mean has a weakness. It gets pulled toward extreme values.
salaries = [45000, 52000, 48000, 55000, 51000, 2500000]
print(f"Salaries: {salaries}")
print(f"Mean: {np.mean(salaries):,.0f}")
Output:
Salaries: [45000, 52000, 48000, 55000, 51000, 2500000]
Mean: 458500
Five people earn around 50,000. One person earns 2,500,000. The mean is 458,500. That number represents nobody in the room. The millionaire dragged the average far from where most people actually sit.
This is why mean alone is not enough.
Median: The Middle
The median is the middle value when everything is sorted. Half the values are above it, half below. Extreme values cannot drag it anywhere.
salaries = [45000, 52000, 48000, 55000, 51000, 2500000]
sorted_salaries = sorted(salaries)
n = len(sorted_salaries)
if n % 2 == 1:
median = sorted_salaries[n // 2]
else:
median = (sorted_salaries[n//2 - 1] + sorted_salaries[n//2]) / 2
print(f"Sorted: {sorted_salaries}")
print(f"Median: {median:,.0f}")
print(f"NumPy check: {np.median(salaries):,.0f}")
Output:
Sorted: [45000, 48000, 51000, 52000, 55000, 2500000]
Median: 51500
NumPy check: 51500
Median is 51,500. That actually represents most people in the group. The millionaire had zero effect on it.
When to use which: mean works well when data is roughly symmetric and has no extreme outliers. Median is better when there are outliers or skewed distributions. Income, house prices, and response times almost always use median for this reason.
Variance: How Spread Out Is the Data
Mean tells you the center. Variance tells you how far values typically are from that center.
Two classes can have the same mean but very different spreads.
class_a = [75, 76, 78, 79, 80, 81, 82, 84, 85]
class_b = [40, 52, 61, 72, 80, 88, 94, 100, 100]
print(f"Class A mean: {np.mean(class_a):.1f}")
print(f"Class B mean: {np.mean(class_b):.1f}")
Output:
Class A mean: 80.0
Class B mean: 76.3
Close means. Now look at the spread.
Computing variance manually so you see what it is actually measuring:
data = class_a
mean = np.mean(data)
deviations = [x - mean for x in data]
squared_deviations = [d**2 for d in deviations]
variance = sum(squared_deviations) / len(data)
print(f"Values: {data}")
print(f"Mean: {mean:.1f}")
print(f"Deviations from mean: {[round(d, 1) for d in deviations]}")
print(f"Squared deviations: {[round(d, 1) for d in squared_deviations]}")
print(f"Variance: {variance:.2f}")
print(f"NumPy check: {np.var(data):.2f}")
Output:
Values: [75, 76, 78, 79, 80, 81, 82, 84, 85]
Mean: 80.0
Deviations from mean: [-5.0, -4.0, -2.0, -1.0, 0.0, 1.0, 2.0, 4.0, 5.0]
Squared deviations: [25.0, 16.0, 4.0, 1.0, 0.0, 1.0, 4.0, 16.0, 25.0]
Variance: 10.22
NumPy check: 10.22
Variance measures average squared distance from the mean. Squaring does two things: makes all values positive (negative and positive deviations cancel out without squaring) and penalizes large deviations more than small ones.
print(f"\nClass A variance: {np.var(class_a):.2f}")
print(f"Class B variance: {np.var(class_b):.2f}")
Output:
Class A variance: 10.22
Class B variance: 366.90
Class B has 36 times more variance. One class has mostly similar students. The other has a huge range from struggling to exceptional. Same rough average, completely different story underneath.
Standard Deviation: Back to Original Units
Variance is in squared units. If your data is in meters, variance is in square meters. That is hard to interpret directly.
Standard deviation is the square root of variance. It brings you back to the original units.
class_a_std = np.std(class_a)
class_b_std = np.std(class_b)
print(f"Class A: mean={np.mean(class_a):.1f}, std={class_a_std:.1f}")
print(f"Class B: mean={np.mean(class_b):.1f}, std={class_b_std:.1f}")
Output:
Class A: mean=80.0, std=3.2
Class B: mean=76.3, std=19.1
Class A students are typically within 3.2 points of the mean. Class B students are typically within 19.1 points. Now you can say something concrete about the spread in the same units as the original scores.
Standard deviation is what you will use constantly in AI. Normalization. Feature scaling. Detecting outliers. Evaluating model uncertainty. It is everywhere.
Using These Together: Outlier Detection
One practical use you will need immediately. Finding values that are unusually far from the center.
The rule of thumb: anything more than 2 or 3 standard deviations from the mean is suspicious.
temperatures = np.array([22, 24, 21, 23, 25, 22, 24, 98, 23, 21, 25, 22])
mean = np.mean(temperatures)
std = np.std(temperatures)
print(f"Mean: {mean:.1f}")
print(f"Std: {std:.1f}")
print()
for temp in temperatures:
z_score = (temp - mean) / std
flag = " <-- OUTLIER" if abs(z_score) > 2 else ""
print(f"Temp: {temp:5.1f} z-score: {z_score:6.2f}{flag}")
Output:
Mean: 28.7
Std: 21.4
Temp: 22.0 z-score: -0.31
Temp: 24.0 z-score: -0.22
Temp: 21.0 z-score: -0.36
Temp: 23.0 z-score: -0.27
Temp: 25.0 z-score: -0.17
Temp: 22.0 z-score: -0.31
Temp: 24.0 z-score: -0.22
Temp: 98.0 z-score: 3.24 <-- OUTLIER
Temp: 23.0 z-score: -0.27
Temp: 21.0 z-score: -0.36
Temp: 25.0 z-score: -0.17
Temp: 22.0 z-score: -0.31
The z-score measures how many standard deviations a value is from the mean. Everything normal clusters between -1 and 1. The 98 degree reading is 3.24 standard deviations out. Something happened there. Sensor error. Data entry mistake. Worth investigating before feeding it to a model.
Normalization: Why These Numbers Matter for AI
Machine learning models struggle when features are on wildly different scales. Age runs from 0 to 100. Income runs from 0 to 10,000,000. If you feed both raw into a model, the income feature dominates simply because its numbers are bigger.
Normalization fixes this. Subtract the mean, divide by the standard deviation. Every feature ends up centered at zero with a spread of one.
ages = np.array([23, 45, 31, 52, 28, 38, 44, 29])
incomes = np.array([35000, 85000, 52000, 120000, 42000, 67000, 91000, 48000])
def normalize(data):
return (data - np.mean(data)) / np.std(data)
ages_normalized = normalize(ages)
incomes_normalized = normalize(incomes)
print("Ages normalized:")
print(np.round(ages_normalized, 2))
print(f"Mean: {np.mean(ages_normalized):.4f}, Std: {np.std(ages_normalized):.4f}")
print("\nIncomes normalized:")
print(np.round(incomes_normalized, 2))
print(f"Mean: {np.mean(incomes_normalized):.4f}, Std: {np.std(incomes_normalized):.4f}")
Output:
Ages normalized:
[-1.38 -0.96 -0.55 1.34 -0.96 0.13 0.96 -0.55 0.96]
Mean: 0.0000, Std: 1.0000
Incomes normalized:
[-1.19 0.73 -0.35 1.94 -0.77 0.22 0.95 -0.54]
Mean: 0.0000, Std: 1.0000
Both features now live on the same scale. The model treats them equally. This single step often improves model performance noticeably without changing anything else.
You will apply this to almost every dataset you ever work with.
Try This
Create statistics_practice.py.
Use this dataset:
house_prices = [
245000, 312000, 198000, 425000, 289000,
315000, 178000, 520000, 267000, 305000,
2800000, 234000, 298000, 341000, 187000
]
Do all of the following:
Calculate the mean, median, and standard deviation. Print all three. Notice the difference between mean and median. Why are they so different here?
Find any outliers using z-scores. Print which values are more than 2 standard deviations from the mean.
Calculate the mean and median again after removing the outliers. How much did they change?
Normalize the prices (without the outlier). Print the first five normalized values. Verify the normalized data has mean close to 0 and std close to 1.
What's Next
You know how to describe your data. But data is not always certain. Your model does not say "the answer is 7." It says "the answer is probably 7 with 83% confidence."
That probability language is what the next post is about. How AI deals with uncertainty. How it turns raw numbers into probabilities. Why probabilities matter more than hard predictions in most real systems.
Top comments (0)