Mukesh Mithrakumar

Posted on Jul 24, 2019

Common Probability Distributions with Tensorflow 2.0

#probability #tensorflow #deeplearning #machinelearning

A probability distribution is a function that describes how likely you will obtain the different possible values of the random variable.

Following are a few examples of popular distributions.

3.9.1 Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable. It is controlled by a single parameter ϕ∈[0,1], which gives the probability of the random variable being equal to 1. It has the following properties:

P(x=1) = ϕ
P(x=0) = 1−ϕ
P(x=x) =ϕ^x (1−ϕ)^{1−x}
Ex[x] = ϕ
Var_x(x) = ϕ(1−ϕ)

The Bernoulli distribution is a special case of the Binomial distribution where there is only one trial. A binomial distribution is the sum of independent and identically distributed Bernoulli random variables. For example, let's say you do a single coin toss, the probability of getting heads is p. The random variable that represents your winnings after one coin toss is a Bernoulli random variable. So, what is the probability that you land heads in 100 tosses, this is where you use the Bernoulli trials, in general, if there are n Bernoulli trials, then the sum of those trials is a binomial distribution with parameters n and p. Below, we will see an example for 1000 trials and the resulting Binomial distribution is plotted.

import tensorflow_probability as tfp
tfd = tfp.distributions

# Create a Bernoulli distribution with a probability .5 and sample size of 1000


bernoulli_distribution = tfd.Bernoulli(probs=.5)
bernoulli_trials = bernoulli_distribution.sample(1000)

# Plot Bernoulli Distribution
sns.distplot(bernoulli_trials, color=color_b)

# Properties of Bernoulli distribution
property_1 = bernoulli_distribution.prob(1)
print("P(x = 1) = {}".format(property_1))

property_2 = bernoulli_distribution.prob(0)
print("P(x = 0) = 1 - {} = {}".format(property_1, property_2))

print("Property three is a generalization of property 1 and 2")

print("For Bernoulli distribution The expected value of a Bernoulli random variable  X is p (E[X] = p)")

# Variance is calculated as Var = E[(X - E[X])**2]
property_5 = bernoulli_distribution.variance()
print("Var(x) = {0} (1 - {0})".format(property_5))

P(x = 1) = 0.5
P(x = 0) = 1 - 0.5 = 0.5
Property three is a generalization of property 1 and two
For Bernoulli distribution The expected value of a Bernoulli random variable  X is p (E[X] = p)
Var(x) = 0.25 (1 - 0.25)

3.9.2 Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite. The multinoulli distribution is a special case of the multinomial distribution, which is a generalization of Binomial distribution. A multinomial distribution is the distribution over vectors in 0,⋯,n^k representing how many times each of the k categories visited when n samples are drawn from a multinoulli distribution.

# For a fair dice
p = [1/6.]*6

# Multinoulli distribution with 60 trials and sampled once
multinoulli_distribution = tfd.Multinomial(total_count=60., probs=p)
multinoulli_pdf = multinoulli_distribution.sample(1)

print("""Dice throw values: {}
In sixty trials, index 0 represents the times the dice landed on 1 (= {} times) and
index 1 represents the times the dice landed on 2 (= {} times)\n""".format(multinoulli_pdf,
                                                                           multinoulli_pdf[0][0],
                                                                           multinoulli_pdf[0][1]))

g = sns.distplot(multinoulli_pdf, color=color_b)
plt.grid()

Dice throw values: [[ 8. 10. 13. 12.  9.  8.]]
In sixty trials, index 0 represents the times the dice landed on 1 (= 8.0 times) and
index 1 represents the times the dice landed on 2 (= 10.0 times)

There are other discrete distributions like:

Hypergeometric Distribution: models sampling without replacement
Poisson Distribution: expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.
Geometric Distribution: counts the number of Bernoulli trials needed to get one success.

Since this will not be an exhaustive introduction to distributions, I presented only the major ones and for the curious ones, if you want to learn more, you can take a look at the references I mention at the end of the notebook.

Next, we will take a look at some continuous distributions.

3.9.3 Gaussian Distribution

The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:

The two parameters μ∈R and σ∈(0,∞) control the normal distribution. The parameter μ gives the coordinate of the central peak. This is also the mean of the distribution: E[x]=μ. The standard deviation of the distribution is given by σ, and the variance by σ^2.

# We use linespace to create a range of values starting from -8 to 8 with incremants (= stop - start / num - 1)
rand_x= tf.linspace(start=-8., stop=8., num=150)

# Gaussian distribution with a standard deviation of 1 and mean 0
sigma = float(1.)
mu = float(0.)
gaussian_pdf = tfd.Normal(loc=mu, scale=sigma).prob(rand_x)

# convert tensors into numpy ndarrays for plotting
[rand_x_, gaussian_pdf_] = evaluate([rand_x, gaussian_pdf])

# Plot of the Gaussian distribution
plt.plot(rand_x_, gaussian_pdf_, color=color_b)
plt.fill_between(rand_x_, gaussian_pdf_, color=color_b)
plt.grid()

Normal distributions are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons.

Many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed.
Out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model.

The normal distribution generalizes to R^n, in which case it is known as the multivariate normal distribution. It may be parameterized with a positive definite symmetric matrix Σ:

The parameter μ still gives the mean of the distribution, though now it is vector valued. The parameter Σ gives the covariance matrix of the distribution.

# We create a multivariate normal distribution with two distributions with mean 0. and std.deviation of 2.
mvn = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag = [2., 2.])

# we take 1000 samples from the distribution
samples = mvn.sample(1000)

# Plot of multi variate distribution
g = sns.jointplot(samples[:, 0], samples[:, 1], kind='scatter', color=color_b)
plt.show()

3.9.4 Exponential and Laplace Distributions

In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exponential distribution:

p(x;λ) = λ1_(x≥0) exp(−λx)

The exponential distribution uses the indicator function 1_(x≥0) to assign probability zero to all negative values of x.

# We use linespace to create a range of values starting from 0 to 4 with incremants (= stop - start / num - 1)
a = tf.linspace(start=0., stop=4., num=41)

# the tf.newaxis expression is used to increase the dimension of the existing array by one more dimension
a = a[..., tf.newaxis]
lambdas = tf.constant([1.])

# We create a Exponential distribution and calculate the PDF for a
expo_pdf = tfd.Exponential(rate=1.).prob(a)

# convert tensors into numpy ndarrays for plotting
[a_, expo_pdf_] = evaluate([a,expo_pdf])

# Plot of Exponential distribution
plt.figure(figsize=(12.5, 4))
plt.plot(a_.T[0], expo_pdf_.T[[0]][0], color=color_sb)
plt.fill_between(a_.T[0], expo_pdf_.T[[0]][0],alpha=.33, color=color_b)
plt.title(r"Probability density function of Exponential distribution with $\lambda$ = 1")
plt.grid()

A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point μ is the Laplace distribution:

# We use linespace to create a range of values starting from 0 to 4 with incremants (= stop - start / num - 1)
a = tf.linspace(start=0., stop=4., num=41)

# the tf.newaxis expression is used to increase the dimension of the existing array by one more dimension
a = a[..., tf.newaxis]
lambdas = tf.constant([1.])

# We create a Laplace distribution and calculate the PDF for a
laplace_pdf = tfd.Laplace(loc=1, scale=1).prob(a)

# convert tensors into numpy ndarrays for plotting
[a_, laplace_pdf_] = evaluate([a, laplace_pdf])

# Plot of laplace distribution
plt.figure(figsize=(12.5, 4))
plt.plot(a_.T[0], laplace_pdf_.T[[0]][0], color=color_sb)
plt.fill_between(a_.T[0], laplace_pdf_.T[[0]][0],alpha=.33, color=color_b)
plt.title(r"Probability density function of Laplace distribution")
plt.grid()

3.9.5 The Dirac Distribution and Empirical Distribution

In some cases, we wish to specify that all the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using the Dirac delta function, δ(x):

p(x) = δ(x−μ)

The Dirac delta function is defined such that it is zero valued everywhere except 0, yet integrates to 1. We can think of the Dirac delta function as being the limit point of a series of functions that put less and less mass on all points other than zero.

By defining p(x) to be δ shifted by −μ we obtain an infinitely narrow infinitely high peak of probability mass where x=μ

"""
There is no dirac distribution in tensorflow, you will be able to plot using the fast fourier transform in
the tf.signal but that would take us outside the scope of the book so we use the normal distribution
to plot a dirac distribution. Play around with the delta and mu values to see how the distribution moves.
"""

# We use linespace to create a range of values starting from -8 to 8 with incremants (= stop - start / num - 1)
rand_x= tf.linspace(start=-8., stop=8., num=150)

# Gaussian distribution with a standard deviation of 1/6 and mean 2
delta = float(1./6.)
mu = float(2.)
dirac_pdf = tfd.Normal(loc=mu, scale=delta).prob(rand_x)

# convert tensors into numpy ndarrays for plotting
[rand_x_, dirac_pdf_] = evaluate([rand_x, dirac_pdf])

# Plot of the dirac distribution
plt.plot(rand_x_, dirac_pdf_, color=color_sb)
plt.fill_between(rand_x_, dirac_pdf_, color=color_b)
plt.grid()

A common use of the Dirac delta distribution is as a component of an empirical distribution:

which puts probability mass 1/m on each of the m points x^(1),⋯,x^(m), forming a given data set or collection of sample.

The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables.

For discrete variables, the situation is simpler: an empirical distribution can be conceptualized as a multinoulli distribution, with a probability associated with each possible input value that is simply equal to the empirical frequency of that value in the training set.

We can view the empirical distribution formed from a dataset of training examples as specifying the distribution that we sample from when we train a model on this dataset. Another important perspective on the empirical distribution is that it is the probability density that maximizes the likelihood of the training data.

3.9.6 Mixtures of Distributions

One common way of combining simpler distributions to define probability distribution is to construct a mixture distribution. A mixture distribution is made up of several component distributions. On each trial, the choice of which component distribution should generate the sample is determined by sampling a component identity from a multinoulli distribution:

where P(c) is the multinoulli distribution over component identities.

"""
We will be creating two variable with two components to plot the mixture of distributions.

The tfd.MixtureSameFamily distribution implements a batch of mixture distribution where all components are from
different parameterizations of the same distribution type. In our example, we will be using tfd.Categorical to
manage the probability of selecting components. Followed by tfd.MultivariateNormalDiag as components.
The MultivariateNormalDiag constructs Multivariate Normal distribution on R^k
"""

num_vars = 2        # Number of variables (`n` in formula).
var_dim = 1         # Dimensionality of each variable `x[i]`.
num_components = 2  # Number of components for each mixture (`K` in formula).
sigma = 5e-2        # Fixed standard deviation of each component.

# Set seed.
tf.random.set_seed(77)

# categorical distribution
categorical = tfd.Categorical(logits=tf.zeros([num_vars, num_components]))

# Choose some random (component) modes.
component_mean = tfd.Uniform().sample([num_vars, num_components, var_dim])

# component distribution for the mixture family
components = tfd.MultivariateNormalDiag(loc=component_mean, scale_diag=[sigma])

# create the mixture same family distribution
distribution_family = tfd.MixtureSameFamily(mixture_distribution=categorical, components_distribution=components)

# Combine the distributions
mixture_distribution = tfd.Independent(distribution_family, reinterpreted_batch_ndims=1)

# Extract a sample from the distribution
samples = mixture_distribution.sample(1000).numpy()

# Plot the distributions
g = sns.jointplot(x=samples[:, 0, 0], y=samples[:, 1, 0], kind="scatter", color=color_b, marginal_kws=dict(bins=50))
plt.show()

The mixture model allows us to briefly glimpse a concept that will be of paramount importance later—the latent variable. A latent variable is a random variable that we cannot observe directly. Latent variables may be related to x through the joint distribution.

A very powerful and common type of mixture model is the Gaussian mixture model, in which the components p(x|c=i) are Gaussians. Each component has a separate parametrized mean μ^(i) and covariance Σ^(i). As with a single Gaussian distribution, the mixture of Gaussians might constrain the covariance matrix for each component to be diagonal or isotropic. A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components.

Some of the other continuous distribution functions include:

Erlang Distribution: In a Poisson process of rate λ the waiting times between k events have an Erlang distribution.
Gamma Distribution: In a Poisson process with rate λ the gamma distribution gives the time to the k^{th} event.
Beta Distribution: represents a family of probabilities and is a versatile way to represent outcomes for percentages or proportions.
Dirichlet Distribution: is a multivariate generalization of the Beta distribution. Dirichlet distributions are commonly used as prior distributions in Bayesian statistics.

This is section nine of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

03.00 - Probability and Information Theory
03.01 - Why Probability?
03.02 - Random Variables
03.03 - Probability Distributions
03.04 - Marginal Probability
03.05 - Conditional Probability
03.06 - The Chain Rule of Conditional Probabilities
03.07 - Independence and Conditional Independence
03.08 - Expectation, Variance and Covariance
03.09 - Common Probability Distributions
03.10 - Useful Properties of Common Functions
03.11 - Bayes' Rule
03.12 - Technical Details of Continuous Variables
03.13 - Information Theory
03.14 - Structured Probabilistic Models

at Deep Learning With TF 2.0: 03.00- Probability and Information Theory. You can get the code for this article and the rest of the chapter here. Links to the notebook in Google Colab and Jupyter Binder are at the end of the notebook.