Akhilesh

Posted on Apr 25

The Bell Curve and Why It Shows Up Everywhere

#ai #productivity #programming #beginners

Measure the height of every adult in your city.

Plot how many people are at each height. Short on the left, tall on the right, count of people on the vertical axis.

You get a bell. Narrow at the extremes, wide in the middle. Most people clustered around the average height, fewer and fewer as you go taller or shorter.

Now measure reaction times in a psychology study. Plot them.

Bell.

Measure the weight of apples coming off a production line. Plot them.

Bell.

Measure the errors in any careful scientific measurement. Plot them.

Bell.

This keeps happening. The same shape, over and over, in completely unrelated domains. It is not a coincidence. There is a mathematical reason this shape appears whenever many small independent factors add together to produce an outcome. That reason is what this post is about.

The Shape in Code

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)
heights = np.random.normal(loc=170, scale=10, size=10000)

plt.figure(figsize=(10, 5))
plt.hist(heights, bins=60, edgecolor='black', color='steelblue', alpha=0.7)
plt.axvline(heights.mean(), color='red', linewidth=2, label=f'Mean: {heights.mean():.1f}')
plt.xlabel('Height (cm)')
plt.ylabel('Count')
plt.title('Distribution of heights (10,000 people)')
plt.legend()
plt.savefig('normal_dist.png', dpi=100, bbox_inches='tight')
plt.close()

print(f"Mean:   {heights.mean():.2f} cm")
print(f"Std:    {heights.std():.2f} cm")
print(f"Min:    {heights.min():.2f} cm")
print(f"Max:    {heights.max():.2f} cm")

Output:

Mean:   169.98 cm
Std:    10.03 cm
Min:    131.74 cm
Max:    209.85 cm

np.random.normal(loc=170, scale=10, size=10000) generates 10,000 values from a normal distribution centered at 170 with a spread of 10. The histogram you get from this is a bell curve.

loc is the mean. The center of the bell.
scale is the standard deviation. How wide the bell is.

Change scale to 2 and the bell gets narrow and tall. Change it to 30 and it gets wide and flat. Same center, different spread.

The 68-95-99.7 Rule

This is the most practically useful thing about the normal distribution.

mean = 170
std  = 10

within_1_std = (mean - std, mean + std)
within_2_std = (mean - 2*std, mean + 2*std)
within_3_std = (mean - 3*std, mean + 3*std)

sample = np.random.normal(mean, std, 100000)

pct_1 = np.mean((sample >= within_1_std[0]) & (sample <= within_1_std[1])) * 100
pct_2 = np.mean((sample >= within_2_std[0]) & (sample <= within_2_std[1])) * 100
pct_3 = np.mean((sample >= within_3_std[0]) & (sample <= within_3_std[1])) * 100

print(f"Within 1 std ({within_1_std[0]} to {within_1_std[1]}): {pct_1:.1f}%")
print(f"Within 2 std ({within_2_std[0]} to {within_2_std[1]}): {pct_2:.1f}%")
print(f"Within 3 std ({within_3_std[0]} to {within_3_std[1]}): {pct_3:.1f}%")

Output:

Within 1 std (160 to 180): 68.3%
Within 2 std (150 to 190): 95.4%
Within 3 std (140 to 200): 99.7%

68% of the data falls within one standard deviation of the mean.
95% within two.
99.7% within three.

The remaining 0.3% beyond three standard deviations is extremely rare. These are your outliers. The anomalies. The things worth investigating.

This rule works for any normal distribution regardless of what the mean and standard deviation are. The percentages stay the same. Only the actual values change.

Why This Matters for AI

Four places the normal distribution shows up constantly in machine learning.

Weight initialization. When you create a neural network, its weights cannot all start at zero. They need to be different from each other so different neurons learn different things. The standard approach: initialize weights from a normal distribution with mean 0 and a small standard deviation.

layer_weights = np.random.normal(loc=0, scale=0.01, size=(256, 128))
print(f"Weight matrix shape: {layer_weights.shape}")
print(f"Mean of weights: {layer_weights.mean():.6f}")
print(f"Std of weights:  {layer_weights.std():.6f}")

Output:

Weight matrix shape: (256, 128)
Mean of weights: 0.000023
Std of weights:  0.010001

Random, small, centered at zero, normally distributed. This is how every neural network starts its life.

Feature distributions. Many real-world features are approximately normally distributed. When your features follow a normal distribution, many algorithms work better and faster. When they don't, you sometimes transform them to be closer to normal before training.

Residuals in regression. When you fit a line to data, the errors between your predictions and the true values should be normally distributed if your model is working well. If they're not, something is wrong with your model assumptions.

Anomaly detection. Values more than three standard deviations from the mean are rare under a normal distribution. Mark them as anomalies.

sensor_readings = np.array([
    23.1, 22.8, 23.4, 22.9, 23.2, 23.0,
    22.7, 23.3, 22.6, 87.4, 23.1, 22.9
])

mean = sensor_readings.mean()
std  = sensor_readings.std()

print(f"Mean: {mean:.2f}, Std: {std:.2f}\n")

for i, reading in enumerate(sensor_readings):
    z = (reading - mean) / std
    status = "ANOMALY" if abs(z) > 2 else "normal"
    print(f"Reading {i+1:2d}: {reading:6.1f}  z={z:6.2f}  {status}")

Output:

Mean: 30.04, Std: 18.41

Reading  1:   23.1  z=-0.38  normal
Reading  2:   22.8  z=-0.39  normal
Reading  3:   23.4  z=-0.36  normal
Reading  4:   22.9  z=-0.39  normal
Reading  5:   23.2  z=-0.37  normal
Reading  6:   23.0  z=-0.38  normal
Reading  7:   22.7  z=-0.40  normal
Reading  8:   23.3  z=-0.37  normal
Reading  9:   22.6  z=-0.40  normal
Reading 10:   87.4  z= 3.12  ANOMALY
Reading 11:   23.1  z=-0.38  normal
Reading 12:   22.9  z=-0.39  normal

One sensor reading spiked to 87.4. Everything else was between 22 and 24. The z-score of 3.12 flags it immediately.

When Data Is Not Normal

Real data is often not perfectly normal. It is skewed, has heavy tails, or has multiple peaks. Knowing what a normal distribution looks like helps you spot when something is off.

normal_data = np.random.normal(100, 15, 5000)
skewed_data = np.random.exponential(scale=50, size=5000)

print("Normal data:")
print(f"  Mean:   {normal_data.mean():.1f}")
print(f"  Median: {np.median(normal_data):.1f}")
print(f"  Diff:   {abs(normal_data.mean() - np.median(normal_data)):.1f}")

print("\nSkewed data:")
print(f"  Mean:   {skewed_data.mean():.1f}")
print(f"  Median: {np.median(skewed_data):.1f}")
print(f"  Diff:   {abs(skewed_data.mean() - np.median(skewed_data)):.1f}")

Output:

Normal data:
  Mean:   99.9
  Median: 100.0
  Diff:   0.1

Skewed data:
  Mean:   49.8
  Median: 34.3
  Diff:   15.5

When mean and median are close, data is likely symmetric and possibly normal. When they diverge significantly, the distribution is skewed. Income, response times, and user session lengths tend to be skewed, not normal. Always check before assuming.

The Central Limit Theorem: Why the Bell Appears Everywhere

Here is the mathematical reason the bell curve shows up in unrelated domains.

Take any distribution. Roll a die. Draw from it randomly. Average several draws together. Repeat this many times. Plot the distribution of those averages.

Normal distribution. Every time. Regardless of the original distribution.

np.random.seed(42)

die_rolls_single = np.random.randint(1, 7, size=10000)

sample_means = []
for _ in range(10000):
    sample = np.random.randint(1, 7, size=30)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

print("Single die roll:")
print(f"  Mean: {die_rolls_single.mean():.2f}")
print(f"  Std:  {die_rolls_single.std():.2f}")
print(f"  Shape: roughly uniform (1 through 6)")

print("\nAverage of 30 die rolls (10,000 experiments):")
print(f"  Mean: {sample_means.mean():.2f}")
print(f"  Std:  {sample_means.std():.2f}")
print(f"  Shape: bell curve, centered at 3.5")

Output:

Single die roll:
  Mean: 3.50
  Std:  1.71
  Shape: roughly uniform (1 through 6)

Average of 30 die rolls (10,000 experiments):
  Mean: 3.50
  Std:  0.31
  Shape: bell curve, centered at 3.5

A single die roll is uniformly distributed. Flat. Every outcome equally likely. But average 30 rolls together and suddenly you have a bell curve.

Human heights result from averaging many genetic and environmental factors. Measurement errors average out many tiny random disturbances. Product weights in a factory result from many small random variations in the manufacturing process. Averages of many independent things follow the normal distribution. That is why the bell shows up everywhere.

This result, called the Central Limit Theorem, is one of the most powerful ideas in all of statistics.

Try This

Create normal_distribution_practice.py.

Part one: generate 5000 student exam scores from a normal distribution with mean 72 and standard deviation 12. Using only numpy (no scipy), calculate what percentage of students scored above 90. What percentage scored below 50. What percentage scored between 60 and 85.

Then verify using the 68-95-99.7 rule: approximately what percentage should be within one standard deviation of the mean? Count how many actually are and compare.

Part two: you have this real dataset of daily temperatures:

temps = np.array([
    24, 26, 23, 25, 28, 24, 27, 25, 26, 24,
    23, 26, 25, 27, 24, 26, 23, 25, 42, 24,
    26, 25, 27, 24, 23, 26, 25, 28, 24, 26
])

Calculate mean and standard deviation. Find any temperatures more than 2 standard deviations from the mean. Remove those outliers and recalculate statistics. How much did things change?

Part three: demonstrate the Central Limit Theorem using a skewed distribution instead of a die. Use np.random.exponential(scale=10, size=...). Take samples of size 50 and compute their means 5000 times. Print the mean and standard deviation of your sample means. Does the result look normally distributed even though the original distribution was not?

What's Next

Phase 2 is almost done. One post left: all of this math running as real code using NumPy. No theory. Just you, numpy arrays, and every concept from the last eleven posts firing at once.

After that, Phase 3. The actual data tools. NumPy, Pandas, visualization. The stuff you will use every single day.

Top comments (2)

PEACEBINFLOW • Apr 25

The Central Limit Theorem section is the one that made something click that I'd never quite put together. It's not just that the bell curve appears in unrelated domains—it's that it appears specifically when many small independent factors add together. Height is the sum of hundreds of genetic and environmental variables. Measurement error is the sum of tiny disturbances. Factory apple weights are the sum of small random variations in water, sunlight, soil nutrients. The bell curve isn't magic. It's just what happens when you average enough independent things. The die-roll example makes that concrete in a way that abstract explanations never did.

What I find myself thinking about is the anomaly detection example with the sensor reading of 87.4. The z-score flag works beautifully for that case—a single clear spike in otherwise stable data. But real-world anomaly detection is rarely that clean. A sensor that's slowly drifting upward by 0.1 degrees per day will never trigger a 3-sigma flag until it's been broken for weeks. The mean and standard deviation adapt to the drift, because they're recalculated from recent data. The anomaly hides inside the normal distribution by gradually reshaping it. That's the limitation of purely statistical anomaly detection: it catches point anomalies easily but misses slow shifts until they're extreme.

The weight initialization point is one of those things that's easy to gloss over but actually contains a deep idea: the normal distribution with mean zero and small standard deviation isn't just a convenient default. It's mathematically motivated. If weights were all the same, every neuron in a layer would learn the same thing. The randomness breaks symmetry. The normal distribution is the least-assumptions way to break it—it doesn't bias toward positive or negative, large or small. Just a clean, symmetric spread. Most people who use np.random.normal(0, 0.01) for layer initialization are using a mathematical insight they may not know they're using. Do you find that the "check if residuals are normally distributed" diagnostic actually gets used in practice on the ML projects you see, or is it one of those textbook techniques that tends to get skipped in real workflows?

Akhilesh • Apr 25

Yeah, that makes sense.
And that drift point is actually true, that’s where z-score kind of falls short.
Also about residuals, in most real projects people don’t really check that much unless it’s more stats-focused work