Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal. In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply.

The basic intuition behind the information theory is that a likely event should have low information content, less likely events should have higher information content and independent events should have additive information.

Let me give you a simple example, lets say you have a male friend, and he is head over heels in love with this girl, so he asks this girl out pretty much every week and there's a 99% chance she says no, so you being his best friend, he texts you everytime after he asks the girl out to let you know what happened, he texts, "Hey guess what she said, NO 😭😭😭", this is of course wasteful, considering he has a very low chance so it makes more sense for your friend to just send "😭" but if she says yes then he can of course send a longer text, so this way, the number of bits used to convey the message (and your corresponding data bill) will be minimized. P.S don't tell your friend he has a low chance, that's how you lose friends 😬.

To satisfy these properties, we define the **self-information** of an event **x**=x to be:

I(x) =− log P(x)

In this book, we always use log to mean the natural logarithm, with base e. Our definition of **I(x)** is therefore written in units of **nats**. One nat is the amount of information gained by observing an event of probability 1/e. Other texts use base-2 logarithms and units called **bits** or **shannons**; information measured in bits is just a rescaling of information measured in nats.

```
"""
No matter what combination of toss you get the Entropy remains the same but if you change the probability of the
trial, the entropy changes, play around with the probs and see how the entropy is changing and see if the increase
or decrease makes sense.
"""
import tensorflow_probability as tfp
tfd = tfp.distributions
coin_entropy = [0] # creating the coin entropy list
for i in range(10, 11):
coin = tfd.Bernoulli(probs=0.5) # Bernoulli distribution
coin_sample = coin.sample(i) # we take 1 sample
coin_entropy.append(coin.entropy()) # append the coin entropy
sns.distplot(coin_entropy, color=color_o, hist=False, kde_kws={"shade": True}) # Plot of the entropy
print("Entropy of 10 coin tosses in nats: {} \nFor tosses: {}".format(coin_entropy[1], coin_sample))
plt.grid()
Entropy of 10 coin tosses in nats: 0.6931471824645996
For tosses: [0 1 1 1 0 1 1 1 0 1]
```

Self information deals only with a single outcome. We can quantify the amount of uncertainty in an entire probability distribution using the **Shannon entropy**:

H(x) = E_(x∼P)[I(x)] =− E_(x∼P)[log P(x)]

also denoted as **H(P)**.

Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn from a distribution P. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy. When **x** is continuous, the Shannon entropy is known as the **differential entropy**.

```
"""
Note here since we are using the Bernoulli distribution to find the expectation we simply use mean,
if you change the distribution, you need to find the Expectation accordingly
"""
def shannon_entropy_func(p):
"""Calculates the shannon entropy.
Arguments:
p (int) : probability of event.
Returns:
shannon entropy.
"""
return -tf.math.log(p.mean())
# Create a Bernoulli distribution
bernoulli_distribution = tfd.Bernoulli(probs=.5)
# Use TFPs entropy method to calculate the entropy of the distribution
shannon_entropy = bernoulli_distribution.entropy()
print("TFPs entropy: {} matches with the Shannon Entropy Function we wrote: {}".format(shannon_entropy,
shannon_entropy_func(bernoulli_distribution)))
TFPs entropy: 0.6931471824645996 matches with the Shannon Entropy Function we wrote: 0.6931471824645996
```

Entropy isn't remarkable for its interpretation, but for its properties. For example, entropy doesn't care about the actual *x* values like variance, it only considers their probability. So if we increase the number of values *x* may take then the entropy will increase and the probabilities will be less concentrated.

```
# You can see below by changing the values of x we increase the entropy
shannon_list = []
for i in range(1, 20):
uniform_distribution = tfd.Uniform(low=0.0, high=i) # We create a uniform distribution
shannon_entropy = uniform_distribution.entropy() # Calculate the entropy of the uniform distribution
shannon_list.append(shannon_entropy) # Append the results to the list
# Plot of Shannon Entropy
plt.hist(shannon_list, color=color_b)
plt.grid()
```

If we have two separate probability distributions P(x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the **Kullback-Leibler (KL) divergence**:

D_(KL)(P∥Q)=E_(x∼P)[log P(x)/Q(x)]=E_(x∼P)[log P(x)−log Q(x)]

In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize

the length of messages drawn from probability distribution Q.

```
def kl_func(p, q):
"""Calculates the KL divergence of two distributions.
Arguments:
p : Distribution p.
q : Distribution q.
Returns:
the divergence value.
"""
r = p.loc - q.loc
return (tf.math.log(q.scale) - tf.math.log(p.scale) -.5 * (1. - (p.scale**2 + r**2) / q.scale**2))
# We create two normal distributions
p = tfd.Normal(loc=1., scale=1.)
q = tfd.Normal(loc=0., scale=2.)
# Using TFPs KL Divergence
kl = tfd.kl_divergence(p, q)
print("TFPs KL_Divergence: {} matches with the KL Function we wrote: {}".format(kl, kl_func(p, q)))
TFPs KL_Divergence: 0.4431471824645996 matches with the KL Function we wrote: 0.4431471824645996
```

The KL divergence has many useful properties, most notably being nonnegative. The KL divergence is 0 if and only if P and Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables.

A quantity that is closely related to the KL divergence is the **cross-entropy** H(P,Q)=H(P)+D_(KL)(P∥Q), which is similar to the KL divergence but lacking the term on the left:

H(P,Q) =− E_(x∼P) log Q(x)

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not participate in the omitted term.

```
"""
The cross_entropy computes the Shannons cross entropy defined as:
H[P, Q] = E_p[-log q(X)] = -int_F p(x) log q(x) dr(x)
"""
# We create two normal distributions
p = tfd.Normal(loc=1., scale=1.)
q = tfd.Normal(loc=0., scale=2.)
# Calculating the cross entropy
cross_entropy = q.cross_entropy(p)
print("TFPs cross entropy: {}".format(cross_entropy))
TFPs cross entropy: 3.418938636779785
```

This is section thirteen of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

03.00 - Probability and Information Theory

03.01 - Why Probability?

03.02 - Random Variables

03.03 - Probability Distributions

03.04 - Marginal Probability

03.05 - Conditional Probability

03.06 - The Chain Rule of Conditional Probabilities

03.07 - Independence and Conditional Independence

03.08 - Expectation, Variance and Covariance

03.09 - Common Probability Distributions

03.10 - Useful Properties of Common Functions

03.11 - Bayes' Rule

03.12 - Technical Details of Continuous Variables

03.13 - Information Theory

03.14 - Structured Probabilistic Models

at Deep Learning With TF 2.0: 03.00- Probability and Information Theory. You can get the code for this article and the rest of the chapter here. Links to the notebook in Google Colab and Jupyter Binder are at the end of the notebook.

## Top comments (0)