Introduction
Ever looked at a dataset and thought "what does this even mean?" — That's exactly where statistics comes in.
Statistics helps us summarize, understand, and draw conclusions from data. And Python makes it ridiculously easy to do all of that with just a few lines of code.
In this blog, we'll cover three core pillars:
- 📦 Descriptive Statistics
- 🎲 Probability Distributions
- 🔬 Hypothesis Testing
No prior stats knowledge needed. Let's go!
🛠️ Setup
pip install numpy pandas scipy matplotlib seaborn
📦 Part 1: Descriptive Statistics
Descriptive statistics are the first thing you do with any dataset — they describe and summarize your data.
Think of it like meeting someone new. You'd ask basic questions: How old are you? Where are you from? That's descriptive stats for data.
Key Concepts
| Term | What it means |
|---|---|
| Mean | Average value |
| Median | Middle value when sorted |
| Mode | Most frequently occurring value |
| Std Dev | How spread out the data is |
| Variance | Spread squared |
Code Example
import numpy as np
import pandas as pd
data = [23, 45, 67, 23, 89, 45, 23, 56, 78, 90]
print("Mean:", np.mean(data)) # 53.9
print("Median:", np.median(data)) # 50.5
print("Std Dev:", np.std(data)) # 23.6
print("Variance:", np.var(data)) # 557.29
# Mode using pandas
s = pd.Series(data)
print("Mode:", s.mode()[0]) # 23
Visualizing It
import matplotlib.pyplot as plt
import seaborn as sns
sns.histplot(data, kde=True, color='steelblue')
plt.title("Data Distribution")
plt.xlabel("Values")
plt.show()
💡 Tip: Always visualize your data before jumping to conclusions. A histogram tells you more than a single number ever will.
🎲 Part 2: Probability Distributions
A probability distribution tells you how likely different outcomes are.
Imagine rolling a dice — each number has an equal chance of appearing. That's a uniform distribution. But if you measure people's heights, most cluster around the average — that's a normal distribution.
Normal Distribution (The Bell Curve)
The most famous distribution in all of statistics. Most natural phenomena follow it — heights, exam scores, errors in measurements.
from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt
# Generate data from a normal distribution
# mean=0, std=1
x = np.linspace(-4, 4, 100)
y = norm.pdf(x, loc=0, scale=1)
plt.plot(x, y, color='tomato', linewidth=2)
plt.fill_between(x, y, alpha=0.2, color='tomato')
plt.title("Normal Distribution (Mean=0, Std=1)")
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.show()
Binomial Distribution
Use this when you have yes/no outcomes — like flipping a coin 10 times and counting heads.
from scipy.stats import binom
n = 10 # number of trials
p = 0.5 # probability of success
# Probability of getting exactly 6 heads
print(binom.pmf(6, n, p)) # 0.205
# Plot
import matplotlib.pyplot as plt
x = range(0, 11)
y = [binom.pmf(k, n, p) for k in x]
plt.bar(x, y, color='mediumseagreen')
plt.title("Binomial Distribution (n=10, p=0.5)")
plt.xlabel("Number of Heads")
plt.ylabel("Probability")
plt.show()
💡 Rule of thumb: If your outcome is continuous (height, weight, price) → think Normal. If it's binary (yes/no, pass/fail) → think Binomial.
🔬 Part 3: Hypothesis Testing
This is where statistics gets powerful.
Hypothesis testing helps you answer questions like:
- Is this new drug actually effective?
- Did my website redesign improve conversions?
- Are these two groups actually different?
The Core Idea
You start with two hypotheses:
- H₀ (Null Hypothesis): Nothing is happening. No difference. Status quo.
- H₁ (Alternative Hypothesis): Something IS happening. There IS a difference.
Then you calculate a p-value. If p < 0.05, you reject H₀ and say the result is statistically significant.
One-Sample T-Test
"Is the average height of my sample different from the national average?"
from scipy import stats
# Sample data (heights in cm)
sample = [165, 170, 168, 172, 160, 175, 163, 169, 171, 167]
# Test against national average of 170cm
t_stat, p_value = stats.ttest_1samp(sample, popmean=170)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("✅ Reject H₀ — significant difference found!")
else:
print("❌ Fail to reject H₀ — no significant difference.")
Two-Sample T-Test
"Are Group A and Group B actually different?"
from scipy import stats
group_a = [78, 82, 85, 90, 88, 76, 95, 84]
group_b = [70, 74, 68, 72, 80, 65, 77, 71]
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("✅ The two groups are significantly different!")
else:
print("❌ No significant difference between groups.")
💡 Remember: A small p-value doesn't mean your result is important — it just means it's unlikely to be due to random chance. Always pair stats with context!
🧠 Quick Recap
| Concept | What it does | Python library |
|---|---|---|
| Descriptive Stats | Summarizes data |
numpy, pandas
|
| Normal Distribution | Models continuous data | scipy.stats |
| Binomial Distribution | Models yes/no outcomes | scipy.stats |
| T-Test | Compares means | scipy.stats |
🚀 What's Next?
Now that you've got the basics down, here's where to go next:
- 📈 Regression Analysis — predict one variable from another
- 🤖 Intro to Machine Learning — use stats to build models
- 📊 Exploratory Data Analysis (EDA) — combine all of the above on real datasets
🙌 Final Thoughts
Statistics isn't about memorizing formulas — it's about asking the right questions about your data. Python gives you the tools; curiosity gives you the direction.
If this helped you, drop a like and follow for more beginner-friendly data science content! 🔔
Tags: #Python #Statistics #DataScience #BeginnerFriendly #MachineLearning #Pandas #NumPy #SciPy
Top comments (0)