DEV Community

Bravim Purohit
Bravim Purohit

Posted on

Week 1: Statistics

Mastering Statistics in a Week: A Sarcastically Professional Dive

This week has been a deep dive into the core of statistics, tackling foundational concepts with technical rigor while sprinkling in a little sarcasm to keep it digestible. Here’s an exhaustive breakdown of my statistical odyssey, complete with detailed theory, practical examples, and Python implementations.


1. Descriptive Statistics: Summarizing the Data

Descriptive statistics are the tools that help us summarize and organize raw data to make it more interpretable. It's the first step in understanding the dataset and lays the groundwork for further analysis.

Types of Data

  1. Nominal Data:

    • Qualitative and unordered categories.
    • Examples: Colors (red, green, blue), brands (Apple, Samsung).
    • Operations: Counting, mode.
  2. Ordinal Data:

    • Qualitative data with a meaningful order, but differences between values are not measurable.
    • Examples: Education levels (high school, bachelor’s, master’s), ratings (poor, fair, good).
    • Operations: Ranking, median.
  3. Interval Data:

    • Quantitative data with meaningful differences but no true zero.
    • Examples: Temperature (Celsius, Fahrenheit).
    • Operations: Addition, subtraction.
  4. Ratio Data:

    • Quantitative data with a true zero, enabling all arithmetic operations.
    • Examples: Weight, height, income.

Measures of Central Tendency

  • Mean: Arithmetic average of data values.
  • Median: Middle value in a sorted dataset.
  • Mode: Most frequently occurring value in a dataset.

Python Example:

import numpy as np
from scipy import stats

# Sample data
data = [12, 15, 14, 10, 12, 17, 18]

mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data).mode[0]

print(f"Mean: {mean}, Median: {median}, Mode: {mode}")
Enter fullscreen mode Exit fullscreen mode

2. Measures of Dispersion: Understanding Variability

While measures of central tendency provide a snapshot of the data’s center, measures of dispersion explain the spread or variability in the data.

Key Metrics

  1. Variance (σ² for population, s² for sample):

    • Measures the average squared deviation from the mean.
    • Population formula: σ² = ∑(xᵢ - μ)² / N
    • Sample formula: s² = ∑(xᵢ - x̄)² / (n-1)
  2. Standard Deviation (σ for population, s for sample):

    • Square root of variance; represents the spread in the same units as the data.
  3. Skewness:

    • Describes the asymmetry of the data distribution.
    • Positive skew: Tail on the right.
    • Negative skew: Tail on the left.

Python Example:

std_dev = np.std(data, ddof=1)  # Sample standard deviation
variance = np.var(data, ddof=1)  # Sample variance

print(f"Standard Deviation: {std_dev}, Variance: {variance}")
Enter fullscreen mode Exit fullscreen mode

3. Probability Distributions: Modeling Data Behavior

Probability distributions describe how the values of a random variable are distributed.

Probability Functions

  1. Probability Mass Function (PMF):

    • For discrete random variables.
    • Example: Rolling a die.
  2. Probability Density Function (PDF):

    • For continuous random variables.
    • Example: Heights of individuals.
  3. Cumulative Distribution Function (CDF):

    • Represents the probability that a variable takes a value less than or equal to x.

Python Example:

from scipy.stats import norm

# PDF and CDF for a normal distribution
x = np.linspace(-3, 3, 100)
pdf = norm.pdf(x, loc=0, scale=1)
cdf = norm.cdf(x, loc=0, scale=1)

print(f"PDF at x=1: {norm.pdf(1)}")
print(f"CDF at x=1: {norm.cdf(1)}")
Enter fullscreen mode Exit fullscreen mode

Types of Distributions

  1. Normal/Gaussian Distribution:

    • Symmetrical, bell-shaped curve.
    • Examples: Heights, exam scores.
  2. Binomial Distribution:

    • Number of successes in n independent Bernoulli trials.
    • Example: Flipping a coin.
  3. Poisson Distribution:

    • Probability of a number of events occurring in a fixed interval.
    • Example: Number of emails received per hour.
  4. Log-Normal Distribution:

    • A distribution of a variable whose logarithm is normally distributed.
  5. Power Law Distribution:

    • Examples: Wealth distribution, internet traffic.

Python Example for Normal Distribution:

samples = np.random.normal(0, 1, 1000)
plt.hist(samples, bins=30, density=True, alpha=0.6, color='g')
plt.title('Normal Distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode

4. Inferential Statistics: Generalizing Insights

Inferential statistics enable us to make inferences about a population based on a sample.

Key Topics

  1. Point Estimation:

    • Single best guess for a parameter.
  2. Confidence Intervals:

    • Range of values within which the parameter is expected to lie.
  3. Hypothesis Testing:

    • Null Hypothesis (H₀): Default assumption.
    • Alternate Hypothesis (Hₐ): What you’re trying to prove.
    • P-Value: Probability of observing results as extreme as the current ones under H₀.
  4. Student’s T-Distribution:

    • For small sample sizes.

Python Example for Hypothesis Testing:

from scipy.stats import ttest_1samp

# Sample data
data = [1.83, 1.91, 1.76, 1.77, 1.89]
mean_population = 1.80

stat, p_value = ttest_1samp(data, mean_population)
print(f"T-statistic: {stat}, P-value: {p_value}")
Enter fullscreen mode Exit fullscreen mode

5. Central Limit Theorem (CLT)

CLT states that the distribution of sample means approaches normality as the sample size increases, regardless of the population's distribution.

Python Example:

sample_means = [np.mean(np.random.randint(1, 100, 30)) for _ in range(1000)]
plt.hist(sample_means, bins=30, density=True, alpha=0.6, color='b')
plt.title('Central Limit Theorem')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

This week has been a thorough exploration of the fascinating (and sometimes overwhelming) world of statistics. From summarizing data to understanding distributions and making inferences, it’s been an enlightening journey. Stay tuned as I continue to tackle data science’s many challenges, one Python snippet at a time.

Billboard image

Synthetic monitoring. Built for developers.

Join Vercel, Render, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

Top comments (0)

The Most Contextual AI Development Assistant

Pieces.app image

Our centralized storage agent works on-device, unifying various developer tools to proactively capture and enrich useful materials, streamline collaboration, and solve complex problems through a contextual understanding of your unique workflow.

👥 Ideal for solo developers, teams, and cross-company projects

Learn more

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay