Kaarthik Sekar

Posted on Apr 18

Statistics in Python for Absolute Beginners 🐍📊

#datascience #python #beginners #tutorial

Introduction

Ever looked at a dataset and thought "what does this even mean?" — That's exactly where statistics comes in.

Statistics helps us summarize, understand, and draw conclusions from data. And Python makes it ridiculously easy to do all of that with just a few lines of code.

In this blog, we'll cover three core pillars:

📦 Descriptive Statistics
🎲 Probability Distributions
🔬 Hypothesis Testing

No prior stats knowledge needed. Let's go!

🛠️ Setup

pip install numpy pandas scipy matplotlib seaborn

📦 Part 1: Descriptive Statistics

Descriptive statistics are the first thing you do with any dataset — they describe and summarize your data.

Think of it like meeting someone new. You'd ask basic questions: How old are you? Where are you from? That's descriptive stats for data.

Key Concepts

Term	What it means
Mean	Average value
Median	Middle value when sorted
Mode	Most frequently occurring value
Std Dev	How spread out the data is
Variance	Spread squared

Code Example

import numpy as np
import pandas as pd

data = [23, 45, 67, 23, 89, 45, 23, 56, 78, 90]

print("Mean:", np.mean(data))        # 53.9
print("Median:", np.median(data))    # 50.5
print("Std Dev:", np.std(data))      # 23.6
print("Variance:", np.var(data))     # 557.29

# Mode using pandas
s = pd.Series(data)
print("Mode:", s.mode()[0])          # 23

Visualizing It

import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(data, kde=True, color='steelblue')
plt.title("Data Distribution")
plt.xlabel("Values")
plt.show()

💡 Tip: Always visualize your data before jumping to conclusions. A histogram tells you more than a single number ever will.

🎲 Part 2: Probability Distributions

A probability distribution tells you how likely different outcomes are.

Imagine rolling a dice — each number has an equal chance of appearing. That's a uniform distribution. But if you measure people's heights, most cluster around the average — that's a normal distribution.

Normal Distribution (The Bell Curve)

The most famous distribution in all of statistics. Most natural phenomena follow it — heights, exam scores, errors in measurements.

from scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt

# Generate data from a normal distribution
# mean=0, std=1
x = np.linspace(-4, 4, 100)
y = norm.pdf(x, loc=0, scale=1)

plt.plot(x, y, color='tomato', linewidth=2)
plt.fill_between(x, y, alpha=0.2, color='tomato')
plt.title("Normal Distribution (Mean=0, Std=1)")
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.show()

Binomial Distribution

Use this when you have yes/no outcomes — like flipping a coin 10 times and counting heads.

from scipy.stats import binom

n = 10      # number of trials
p = 0.5     # probability of success

# Probability of getting exactly 6 heads
print(binom.pmf(6, n, p))   # 0.205

# Plot
import matplotlib.pyplot as plt
x = range(0, 11)
y = [binom.pmf(k, n, p) for k in x]

plt.bar(x, y, color='mediumseagreen')
plt.title("Binomial Distribution (n=10, p=0.5)")
plt.xlabel("Number of Heads")
plt.ylabel("Probability")
plt.show()

💡 Rule of thumb: If your outcome is continuous (height, weight, price) → think Normal. If it's binary (yes/no, pass/fail) → think Binomial.

🔬 Part 3: Hypothesis Testing

This is where statistics gets powerful.

Hypothesis testing helps you answer questions like:

Is this new drug actually effective?
Did my website redesign improve conversions?
Are these two groups actually different?

The Core Idea

You start with two hypotheses:

H₀ (Null Hypothesis): Nothing is happening. No difference. Status quo.
H₁ (Alternative Hypothesis): Something IS happening. There IS a difference.

Then you calculate a p-value. If p < 0.05, you reject H₀ and say the result is statistically significant.

One-Sample T-Test

"Is the average height of my sample different from the national average?"

from scipy import stats

# Sample data (heights in cm)
sample = [165, 170, 168, 172, 160, 175, 163, 169, 171, 167]

# Test against national average of 170cm
t_stat, p_value = stats.ttest_1samp(sample, popmean=170)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("✅ Reject H₀ — significant difference found!")
else:
    print("❌ Fail to reject H₀ — no significant difference.")

Two-Sample T-Test

"Are Group A and Group B actually different?"

from scipy import stats

group_a = [78, 82, 85, 90, 88, 76, 95, 84]
group_b = [70, 74, 68, 72, 80, 65, 77, 71]

t_stat, p_value = stats.ttest_ind(group_a, group_b)

print(f"T-statistic: {t_stat:.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("✅ The two groups are significantly different!")
else:
    print("❌ No significant difference between groups.")

💡 Remember: A small p-value doesn't mean your result is important — it just means it's unlikely to be due to random chance. Always pair stats with context!

🧠 Quick Recap

Concept	What it does	Python library
Descriptive Stats	Summarizes data	`numpy`, `pandas`
Normal Distribution	Models continuous data	`scipy.stats`
Binomial Distribution	Models yes/no outcomes	`scipy.stats`
T-Test	Compares means	`scipy.stats`

🚀 What's Next?

Now that you've got the basics down, here's where to go next:

📈 Regression Analysis — predict one variable from another
🤖 Intro to Machine Learning — use stats to build models
📊 Exploratory Data Analysis (EDA) — combine all of the above on real datasets

🙌 Final Thoughts

Statistics isn't about memorizing formulas — it's about asking the right questions about your data. Python gives you the tools; curiosity gives you the direction.

If this helped you, drop a like and follow for more beginner-friendly data science content! 🔔

Tags: #Python #Statistics #DataScience #BeginnerFriendly #MachineLearning #Pandas #NumPy #SciPy

DEV Community