Understanding different statistical distributions and their properties is crucial for data analysis and modeling. In this blog, we'll explore several types of distributions using Python, including binomial, uniform, and log-normal distributions. We'll use libraries such as NumPy, Matplotlib, and Seaborn for this purpose. Let's dive in! π
A distribution model is a mathematical function that describes the probability of different outcomes or values in a dataset. It helps to understand the patterns and structure of data.
Why Distribution Models are Used in Machine Learning?
- Understanding Data: Helps in summarizing and describing the dataset.
- Data Generation: Creates synthetic data for testing algorithms.
- Model Assumptions: Many algorithms assume specific data distributions (e.g., normal distribution in linear regression).
- Feature Engineering: Transforms data to meet model assumptions (e.g., using logarithms for skewed data).
- Probability-Based Models: Used in probabilistic methods like Naive Bayes.
- Evaluation Metrics: Helps in evaluating and improving model performance by understanding error distributions.
Some of the distribution models are:
- Bernoulli distribution
- Uniform distribution
- Binomial distribution
- Normal distribution
- Poisson distribution
π―Bernoulli distribution:
Represents the outcome of a single experiment with two possible outcomes: success (1) or failure (0).
Here one outcome is dependent on the other
Example: Flipping a coin once. If heads is considered a success (
π=0.5), the probability of getting heads (success) is 0.5, and the probability of getting tails (failure) is also 0.5.
Sample implementation:
import numpy as np
import matplotlib.pyplot as plt
s = np.random.binomial(10, 0.5, 1000)
plt.hist(s, 16, color='g')
plt.title('Binomial Distribution')
plt.xlabel('Number of successes')
plt.ylabel('Frequency')
plt.show()
π Uniform Distribution
All outcomes are equally likely(each outcome of an experiment has an equal probability of occurring) within a certain range.
For discrete values:
where a and b are the minimum and maximum values.
For continuous values:
where n is the number of possible outcomes.
Rolling a fair six-sided die. Each number (1 through 6) has an equal probability of 1/6.
Sample implementation:
s = np.random.uniform(low=0, high=1, size=1000)
plt.hist(s, bins=30, color='r', edgecolor='black', alpha=0.75)
plt.title('Uniform Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
π Visualizing Binomial Distribution with Seaborn
Describes the number of successes in a fixed number of independent Bernoulli trials.
Binomial and Bernoulli may look similar but,
Bernoulli: You flip a coin once. The distribution tells you the probability of getting heads (success) or tails (failure).
Binomial: You flip a coin 10 times. The distribution tells you the probability of getting a certain number of heads (e.g., exactly 5 heads) out of 10 flips.
Sample implementation:
import seaborn as sns
from scipy.stats import binom
data = binom.rvs(n=17, p=0.7, loc=0, size=1010)
ax = sns.histplot(data, kde=True, color='g', bins=30, stat='density', element='step', linewidth=2.2, alpha=0.7)
plt.title('Binomial Distribution with Seaborn')
plt.xlabel('Number of successes')
plt.ylabel('Density')
plt.show()
π Normal Distribution
Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a "bell curve" when graphed.
This distribution is characterized by its mean (average) and standard deviation (which measures the spread of data).
The mean, median, and mode are all equal and located at the center of the distribution. This equality is a result of the symmetrical bell-shaped curve of the normal distribution. Here's why:
Mean: The average value of all the data points.
Median: The middle value when all the data points are arranged in ascending order.
Mode: The most frequently occurring value in the data set.
Sample implementation:
import scipy.stats
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1234)
samples = np.random.lognormal(mean=1.0, sigma=0.4, size=10000) # sigma = std value
shape, loc, scale = scipy.stats.lognorm.fit(samples, floc=0)
num_bins = 50
counts, edges, patches = plt.hist(samples, bins=num_bins, color='b')
plt.title('Log-Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
π² Poisson Distribution
Describes the number of events occurring within a fixed interval of time or space, where these events happen with a known constant mean rate and independently of the time since the last event. π
The number of emails a person receives in an hour. If a person receives an average of 4 emails per hour (π=4), the probability of receiving exactly 2 emails in an hour is:
Sample implementation:
s = np.random.poisson(5, 10000)
plt.hist(s, 16, color='b')
plt.title('Poisson Distribution')
plt.xlabel('Number of events')
plt.ylabel('Frequency')
plt.show()
Colab notebook: https://colab.research.google.com/drive/1uKp3FCC5QmQy53fz83eS7hwengOhk9zx?usp=sharing
By understanding these distributions and their properties, we can better analyze and interpret data in various fields such as finance, science, and engineering. Happy analyzing! π§ π
Top comments (0)