DEV Community

Sai Vishwa B
Sai Vishwa B

Posted on

πŸ” Comparing and Contrasting Popular Probability Distributions: A Practical Approach πŸ“Š

Understanding different statistical distributions and their properties is crucial for data analysis and modeling. In this blog, we'll explore several types of distributions using Python, including binomial, uniform, and log-normal distributions. We'll use libraries such as NumPy, Matplotlib, and Seaborn for this purpose. Let's dive in! πŸš€

A distribution model is a mathematical function that describes the probability of different outcomes or values in a dataset. It helps to understand the patterns and structure of data.

Why Distribution Models are Used in Machine Learning?

  • Understanding Data: Helps in summarizing and describing the dataset.
  • Data Generation: Creates synthetic data for testing algorithms.
  • Model Assumptions: Many algorithms assume specific data distributions (e.g., normal distribution in linear regression).
  • Feature Engineering: Transforms data to meet model assumptions (e.g., using logarithms for skewed data).
  • Probability-Based Models: Used in probabilistic methods like Naive Bayes.
  • Evaluation Metrics: Helps in evaluating and improving model performance by understanding error distributions.

Some of the distribution models are:

  1. Bernoulli distribution
  2. Uniform distribution
  3. Binomial distribution
  4. Normal distribution
  5. Poisson distribution

🎯Bernoulli distribution:

Represents the outcome of a single experiment with two possible outcomes: success (1) or failure (0).

Image description

Here one outcome is dependent on the other

Example: Flipping a coin once. If heads is considered a success (
𝑝=0.5), the probability of getting heads (success) is 0.5, and the probability of getting tails (failure) is also 0.5.

Sample implementation:

import numpy as np
import matplotlib.pyplot as plt

s = np.random.binomial(10, 0.5, 1000)
plt.hist(s, 16, color='g')
plt.title('Binomial Distribution')
plt.xlabel('Number of successes')
plt.ylabel('Frequency')
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ”€ Uniform Distribution

All outcomes are equally likely(each outcome of an experiment has an equal probability of occurring) within a certain range.

For discrete values:

Image description

where a and b are the minimum and maximum values.

For continuous values:

Image description

where n is the number of possible outcomes.

Rolling a fair six-sided die. Each number (1 through 6) has an equal probability of 1/6.

Sample implementation:

s = np.random.uniform(low=0, high=1, size=1000)
plt.hist(s, bins=30, color='r', edgecolor='black', alpha=0.75)
plt.title('Uniform Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Visualizing Binomial Distribution with Seaborn

Describes the number of successes in a fixed number of independent Bernoulli trials.

Image description

Binomial and Bernoulli may look similar but,

Bernoulli: You flip a coin once. The distribution tells you the probability of getting heads (success) or tails (failure).

Binomial: You flip a coin 10 times. The distribution tells you the probability of getting a certain number of heads (e.g., exactly 5 heads) out of 10 flips.

Sample implementation:

import seaborn as sns
from scipy.stats import binom

data = binom.rvs(n=17, p=0.7, loc=0, size=1010)
ax = sns.histplot(data, kde=True, color='g', bins=30, stat='density', element='step', linewidth=2.2, alpha=0.7)
plt.title('Binomial Distribution with Seaborn')
plt.xlabel('Number of successes')
plt.ylabel('Density')
plt.show()
Enter fullscreen mode Exit fullscreen mode

🌟 Normal Distribution

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a "bell curve" when graphed.

This distribution is characterized by its mean (average) and standard deviation (which measures the spread of data).

The mean, median, and mode are all equal and located at the center of the distribution. This equality is a result of the symmetrical bell-shaped curve of the normal distribution. Here's why:

Mean: The average value of all the data points.

Median: The middle value when all the data points are arranged in ascending order.

Mode: The most frequently occurring value in the data set.

Image description

Sample implementation:

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1234)
samples = np.random.lognormal(mean=1.0, sigma=0.4, size=10000)  # sigma = std value
shape, loc, scale = scipy.stats.lognorm.fit(samples, floc=0)
num_bins = 50
counts, edges, patches = plt.hist(samples, bins=num_bins, color='b')
plt.title('Log-Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
Enter fullscreen mode Exit fullscreen mode

🎲 Poisson Distribution

Describes the number of events occurring within a fixed interval of time or space, where these events happen with a known constant mean rate and independently of the time since the last event. 🌟

Image description

The number of emails a person receives in an hour. If a person receives an average of 4 emails per hour (πœ†=4), the probability of receiving exactly 2 emails in an hour is:

Image description

Sample implementation:

s = np.random.poisson(5, 10000)
plt.hist(s, 16, color='b')
plt.title('Poisson Distribution')
plt.xlabel('Number of events')
plt.ylabel('Frequency')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Colab notebook: https://colab.research.google.com/drive/1uKp3FCC5QmQy53fz83eS7hwengOhk9zx?usp=sharing

By understanding these distributions and their properties, we can better analyze and interpret data in various fields such as finance, science, and engineering. Happy analyzing! πŸ§ πŸ“ˆ

Top comments (0)