Bravim Purohit

Posted on Jan 9

Week 1: Statistics

#python #datascience #beginners #maths

Mastering Statistics in a Week: A Sarcastically Professional Dive

This week has been a deep dive into the core of statistics, tackling foundational concepts with technical rigor while sprinkling in a little sarcasm to keep it digestible. Here’s an exhaustive breakdown of my statistical odyssey, complete with detailed theory, practical examples, and Python implementations.

1. Descriptive Statistics: Summarizing the Data

Descriptive statistics are the tools that help us summarize and organize raw data to make it more interpretable. It's the first step in understanding the dataset and lays the groundwork for further analysis.

Types of Data

Nominal Data:
- Qualitative and unordered categories.
- Examples: Colors (red, green, blue), brands (Apple, Samsung).
- Operations: Counting, mode.
Ordinal Data:
- Qualitative data with a meaningful order, but differences between values are not measurable.
- Examples: Education levels (high school, bachelor’s, master’s), ratings (poor, fair, good).
- Operations: Ranking, median.
Interval Data:
- Quantitative data with meaningful differences but no true zero.
- Examples: Temperature (Celsius, Fahrenheit).
- Operations: Addition, subtraction.
Ratio Data:
- Quantitative data with a true zero, enabling all arithmetic operations.
- Examples: Weight, height, income.

Measures of Central Tendency

Mean: Arithmetic average of data values.
Median: Middle value in a sorted dataset.
Mode: Most frequently occurring value in a dataset.

Python Example:

import numpy as np
from scipy import stats

# Sample data
data = [12, 15, 14, 10, 12, 17, 18]

mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode[0]

print(f"Mean: {mean}, Median: {median}, Mode: {mode}")

2. Measures of Dispersion: Understanding Variability

While measures of central tendency provide a snapshot of the data’s center, measures of dispersion explain the spread or variability in the data.

Key Metrics

Variance (σ² for population, s² for sample):
- Measures the average squared deviation from the mean.
- Population formula: σ² = ∑(xᵢ - μ)² / N
- Sample formula: s² = ∑(xᵢ - x̄)² / (n-1)
Standard Deviation (σ for population, s for sample):
- Square root of variance; represents the spread in the same units as the data.
Skewness:
- Describes the asymmetry of the data distribution.
- Positive skew: Tail on the right.
- Negative skew: Tail on the left.

Python Example:

std_dev = np.std(data, ddof=1)  # Sample standard deviation
variance = np.var(data, ddof=1)  # Sample variance

print(f"Standard Deviation: {std_dev}, Variance: {variance}")

3. Probability Distributions: Modeling Data Behavior

Probability distributions describe how the values of a random variable are distributed.

Probability Functions

Probability Mass Function (PMF):
- For discrete random variables.
- Example: Rolling a die.
Probability Density Function (PDF):
- For continuous random variables.
- Example: Heights of individuals.
Cumulative Distribution Function (CDF):
- Represents the probability that a variable takes a value less than or equal to x.

Python Example:

from scipy.stats import norm

# PDF and CDF for a normal distribution
x = np.linspace(-3, 3, 100)
pdf = norm.pdf(x, loc=0, scale=1)
cdf = norm.cdf(x, loc=0, scale=1)

print(f"PDF at x=1: {norm.pdf(1)}")
print(f"CDF at x=1: {norm.cdf(1)}")

Types of Distributions

Normal/Gaussian Distribution:
- Symmetrical, bell-shaped curve.
- Examples: Heights, exam scores.
Binomial Distribution:
- Number of successes in n independent Bernoulli trials.
- Example: Flipping a coin.
Poisson Distribution:
- Probability of a number of events occurring in a fixed interval.
- Example: Number of emails received per hour.
Log-Normal Distribution:
- A distribution of a variable whose logarithm is normally distributed.
Power Law Distribution:
- Examples: Wealth distribution, internet traffic.

Python Example for Normal Distribution:

samples = np.random.normal(0, 1, 1000)
plt.hist(samples, bins=30, density=True, alpha=0.6, color='g')
plt.title('Normal Distribution')
plt.show()

4. Inferential Statistics: Generalizing Insights

Inferential statistics enable us to make inferences about a population based on a sample.

Key Topics

Point Estimation:
- Single best guess for a parameter.
Confidence Intervals:
- Range of values within which the parameter is expected to lie.
Hypothesis Testing:
- Null Hypothesis (H₀): Default assumption.
- Alternate Hypothesis (Hₐ): What you’re trying to prove.
- P-Value: Probability of observing results as extreme as the current ones under H₀.
Student’s T-Distribution:
- For small sample sizes.

Python Example for Hypothesis Testing:

from scipy.stats import ttest_1samp

# Sample data
data = [1.83, 1.91, 1.76, 1.77, 1.89]
mean_population = 1.80

stat, p_value = ttest_1samp(data, mean_population)
print(f"T-statistic: {stat}, P-value: {p_value}")

5. Central Limit Theorem (CLT)

CLT states that the distribution of sample means approaches normality as the sample size increases, regardless of the population's distribution.

Python Example:

sample_means = [np.mean(np.random.randint(1, 100, 30)) for _ in range(1000)]
plt.hist(sample_means, bins=30, density=True, alpha=0.6, color='b')
plt.title('Central Limit Theorem')
plt.show()

Final Thoughts

This week has been a thorough exploration of the fascinating (and sometimes overwhelming) world of statistics. From summarizing data to understanding distributions and making inferences, it’s been an enlightening journey. Stay tuned as I continue to tackle data science’s many challenges, one Python snippet at a time.

DEV Community

Week 1: Statistics

Mastering Statistics in a Week: A Sarcastically Professional Dive

1. Descriptive Statistics: Summarizing the Data

Types of Data

Measures of Central Tendency

Python Example:

2. Measures of Dispersion: Understanding Variability

Key Metrics

Python Example:

3. Probability Distributions: Modeling Data Behavior

Probability Functions

Python Example:

Types of Distributions

Python Example for Normal Distribution:

4. Inferential Statistics: Generalizing Insights

Key Topics

Python Example for Hypothesis Testing:

5. Central Limit Theorem (CLT)

Python Example:

Final Thoughts

Top comments (0)

Read next

⏱️ 30 Productivity Articles to End the Year on a High Note (Dec 31, 2024)

Building a Smart Heater Controller with Python, Docker, and Bluetooth #1

Entry-Level Bing Wallpaper Scraper

Day 5: Operators in C++ – Building Blocks of Logic