DEV Community: Mukesh Mithrakumar

Common Probability Distributions with Tensorflow 2.0

Mukesh Mithrakumar — Wed, 24 Jul 2019 00:12:06 +0000

A probability distribution is a function that describes how likely you will obtain the different possible values of the random variable.

Following are a few examples of popular distributions.

3.9.1 Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable. It is controlled by a single parameter ϕ∈[0,1], which gives the probability of the random variable being equal to 1. It has the following properties:

P(x=1) = ϕ
P(x=0) = 1−ϕ
P(x=x) =ϕ^x (1−ϕ)^{1−x}
Ex[x] = ϕ
Var_x(x) = ϕ(1−ϕ)

The Bernoulli distribution is a special case of the Binomial distribution where there is only one trial. A binomial distribution is the sum of independent and identically distributed Bernoulli random variables. For example, let's say you do a single coin toss, the probability of getting heads is p. The random variable that represents your winnings after one coin toss is a Bernoulli random variable. So, what is the probability that you land heads in 100 tosses, this is where you use the Bernoulli trials, in general, if there are n Bernoulli trials, then the sum of those trials is a binomial distribution with parameters n and p. Below, we will see an example for 1000 trials and the resulting Binomial distribution is plotted.

import tensorflow_probability as tfp
tfd = tfp.distributions

# Create a Bernoulli distribution with a probability .5 and sample size of 1000


bernoulli_distribution = tfd.Bernoulli(probs=.5)
bernoulli_trials = bernoulli_distribution.sample(1000)

# Plot Bernoulli Distribution
sns.distplot(bernoulli_trials, color=color_b)

# Properties of Bernoulli distribution
property_1 = bernoulli_distribution.prob(1)
print("P(x = 1) = {}".format(property_1))

property_2 = bernoulli_distribution.prob(0)
print("P(x = 0) = 1 - {} = {}".format(property_1, property_2))

print("Property three is a generalization of property 1 and 2")

print("For Bernoulli distribution The expected value of a Bernoulli random variable  X is p (E[X] = p)")

# Variance is calculated as Var = E[(X - E[X])**2]
property_5 = bernoulli_distribution.variance()
print("Var(x) = {0} (1 - {0})".format(property_5))

P(x = 1) = 0.5
P(x = 0) = 1 - 0.5 = 0.5
Property three is a generalization of property 1 and two
For Bernoulli distribution The expected value of a Bernoulli random variable  X is p (E[X] = p)
Var(x) = 0.25 (1 - 0.25)

3.9.2 Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete variable with k different states, where k is finite. The multinoulli distribution is a special case of the multinomial distribution, which is a generalization of Binomial distribution. A multinomial distribution is the distribution over vectors in 0,⋯,n^k representing how many times each of the k categories visited when n samples are drawn from a multinoulli distribution.

# For a fair dice
p = [1/6.]*6

# Multinoulli distribution with 60 trials and sampled once
multinoulli_distribution = tfd.Multinomial(total_count=60., probs=p)
multinoulli_pdf = multinoulli_distribution.sample(1)

print("""Dice throw values: {}
In sixty trials, index 0 represents the times the dice landed on 1 (= {} times) and
index 1 represents the times the dice landed on 2 (= {} times)\n""".format(multinoulli_pdf,
                                                                           multinoulli_pdf[0][0],
                                                                           multinoulli_pdf[0][1]))

g = sns.distplot(multinoulli_pdf, color=color_b)
plt.grid()

Dice throw values: [[ 8. 10. 13. 12.  9.  8.]]
In sixty trials, index 0 represents the times the dice landed on 1 (= 8.0 times) and
index 1 represents the times the dice landed on 2 (= 10.0 times)

There are other discrete distributions like:

Hypergeometric Distribution: models sampling without replacement
Poisson Distribution: expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.
Geometric Distribution: counts the number of Bernoulli trials needed to get one success.

Since this will not be an exhaustive introduction to distributions, I presented only the major ones and for the curious ones, if you want to learn more, you can take a look at the references I mention at the end of the notebook.

Next, we will take a look at some continuous distributions.

3.9.3 Gaussian Distribution

The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:

The two parameters μ∈R and σ∈(0,∞) control the normal distribution. The parameter μ gives the coordinate of the central peak. This is also the mean of the distribution: E[x]=μ. The standard deviation of the distribution is given by σ, and the variance by σ^2.

# We use linespace to create a range of values starting from -8 to 8 with incremants (= stop - start / num - 1)
rand_x= tf.linspace(start=-8., stop=8., num=150)

# Gaussian distribution with a standard deviation of 1 and mean 0
sigma = float(1.)
mu = float(0.)
gaussian_pdf = tfd.Normal(loc=mu, scale=sigma).prob(rand_x)

# convert tensors into numpy ndarrays for plotting
[rand_x_, gaussian_pdf_] = evaluate([rand_x, gaussian_pdf])

# Plot of the Gaussian distribution
plt.plot(rand_x_, gaussian_pdf_, color=color_b)
plt.fill_between(rand_x_, gaussian_pdf_, color=color_b)
plt.grid()

Normal distributions are a sensible choice for many applications. In the absence of prior knowledge about what form a distribution over the real numbers should take, the normal distribution is a good default choice for two major reasons.

Many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed.
Out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers. We can thus think of the normal distribution as being the one that inserts the least amount of prior knowledge into a model.

The normal distribution generalizes to R^n, in which case it is known as the multivariate normal distribution. It may be parameterized with a positive definite symmetric matrix Σ:

The parameter μ still gives the mean of the distribution, though now it is vector valued. The parameter Σ gives the covariance matrix of the distribution.

# We create a multivariate normal distribution with two distributions with mean 0. and std.deviation of 2.
mvn = tfd.MultivariateNormalDiag(loc=[0., 0.], scale_diag = [2., 2.])

# we take 1000 samples from the distribution
samples = mvn.sample(1000)

# Plot of multi variate distribution
g = sns.jointplot(samples[:, 0], samples[:, 1], kind='scatter', color=color_b)
plt.show()

3.9.4 Exponential and Laplace Distributions

In the context of deep learning, we often want to have a probability distribution with a sharp point at x = 0. To accomplish this, we can use the exponential distribution:

p(x;λ) = λ1_(x≥0) exp(−λx)

The exponential distribution uses the indicator function 1_(x≥0) to assign probability zero to all negative values of x.

# We use linespace to create a range of values starting from 0 to 4 with incremants (= stop - start / num - 1)
a = tf.linspace(start=0., stop=4., num=41)

# the tf.newaxis expression is used to increase the dimension of the existing array by one more dimension
a = a[..., tf.newaxis]
lambdas = tf.constant([1.])

# We create a Exponential distribution and calculate the PDF for a
expo_pdf = tfd.Exponential(rate=1.).prob(a)

# convert tensors into numpy ndarrays for plotting
[a_, expo_pdf_] = evaluate([a,expo_pdf])

# Plot of Exponential distribution
plt.figure(figsize=(12.5, 4))
plt.plot(a_.T[0], expo_pdf_.T[[0]][0], color=color_sb)
plt.fill_between(a_.T[0], expo_pdf_.T[[0]][0],alpha=.33, color=color_b)
plt.title(r"Probability density function of Exponential distribution with $\lambda$ = 1")
plt.grid()

A closely related probability distribution that allows us to place a sharp peak of probability mass at an arbitrary point μ is the Laplace distribution:

# We use linespace to create a range of values starting from 0 to 4 with incremants (= stop - start / num - 1)
a = tf.linspace(start=0., stop=4., num=41)

# the tf.newaxis expression is used to increase the dimension of the existing array by one more dimension
a = a[..., tf.newaxis]
lambdas = tf.constant([1.])

# We create a Laplace distribution and calculate the PDF for a
laplace_pdf = tfd.Laplace(loc=1, scale=1).prob(a)

# convert tensors into numpy ndarrays for plotting
[a_, laplace_pdf_] = evaluate([a, laplace_pdf])

# Plot of laplace distribution
plt.figure(figsize=(12.5, 4))
plt.plot(a_.T[0], laplace_pdf_.T[[0]][0], color=color_sb)
plt.fill_between(a_.T[0], laplace_pdf_.T[[0]][0],alpha=.33, color=color_b)
plt.title(r"Probability density function of Laplace distribution")
plt.grid()

3.9.5 The Dirac Distribution and Empirical Distribution

In some cases, we wish to specify that all the mass in a probability distribution clusters around a single point. This can be accomplished by defining a PDF using the Dirac delta function, δ(x):

p(x) = δ(x−μ)

The Dirac delta function is defined such that it is zero valued everywhere except 0, yet integrates to 1. We can think of the Dirac delta function as being the limit point of a series of functions that put less and less mass on all points other than zero.

By defining p(x) to be δ shifted by −μ we obtain an infinitely narrow infinitely high peak of probability mass where x=μ

"""
There is no dirac distribution in tensorflow, you will be able to plot using the fast fourier transform in
the tf.signal but that would take us outside the scope of the book so we use the normal distribution
to plot a dirac distribution. Play around with the delta and mu values to see how the distribution moves.
"""

# We use linespace to create a range of values starting from -8 to 8 with incremants (= stop - start / num - 1)
rand_x= tf.linspace(start=-8., stop=8., num=150)

# Gaussian distribution with a standard deviation of 1/6 and mean 2
delta = float(1./6.)
mu = float(2.)
dirac_pdf = tfd.Normal(loc=mu, scale=delta).prob(rand_x)

# convert tensors into numpy ndarrays for plotting
[rand_x_, dirac_pdf_] = evaluate([rand_x, dirac_pdf])

# Plot of the dirac distribution
plt.plot(rand_x_, dirac_pdf_, color=color_sb)
plt.fill_between(rand_x_, dirac_pdf_, color=color_b)
plt.grid()

A common use of the Dirac delta distribution is as a component of an empirical distribution:

which puts probability mass 1/m on each of the m points x^(1),⋯,x^(m), forming a given data set or collection of sample.

The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables.

For discrete variables, the situation is simpler: an empirical distribution can be conceptualized as a multinoulli distribution, with a probability associated with each possible input value that is simply equal to the empirical frequency of that value in the training set.

We can view the empirical distribution formed from a dataset of training examples as specifying the distribution that we sample from when we train a model on this dataset. Another important perspective on the empirical distribution is that it is the probability density that maximizes the likelihood of the training data.

3.9.6 Mixtures of Distributions

One common way of combining simpler distributions to define probability distribution is to construct a mixture distribution. A mixture distribution is made up of several component distributions. On each trial, the choice of which component distribution should generate the sample is determined by sampling a component identity from a multinoulli distribution:

where P(c) is the multinoulli distribution over component identities.

"""
We will be creating two variable with two components to plot the mixture of distributions.

The tfd.MixtureSameFamily distribution implements a batch of mixture distribution where all components are from
different parameterizations of the same distribution type. In our example, we will be using tfd.Categorical to
manage the probability of selecting components. Followed by tfd.MultivariateNormalDiag as components.
The MultivariateNormalDiag constructs Multivariate Normal distribution on R^k
"""

num_vars = 2        # Number of variables (`n` in formula).
var_dim = 1         # Dimensionality of each variable `x[i]`.
num_components = 2  # Number of components for each mixture (`K` in formula).
sigma = 5e-2        # Fixed standard deviation of each component.

# Set seed.
tf.random.set_seed(77)

# categorical distribution
categorical = tfd.Categorical(logits=tf.zeros([num_vars, num_components]))

# Choose some random (component) modes.
component_mean = tfd.Uniform().sample([num_vars, num_components, var_dim])

# component distribution for the mixture family
components = tfd.MultivariateNormalDiag(loc=component_mean, scale_diag=[sigma])

# create the mixture same family distribution
distribution_family = tfd.MixtureSameFamily(mixture_distribution=categorical, components_distribution=components)

# Combine the distributions
mixture_distribution = tfd.Independent(distribution_family, reinterpreted_batch_ndims=1)

# Extract a sample from the distribution
samples = mixture_distribution.sample(1000).numpy()

# Plot the distributions
g = sns.jointplot(x=samples[:, 0, 0], y=samples[:, 1, 0], kind="scatter", color=color_b, marginal_kws=dict(bins=50))
plt.show()

The mixture model allows us to briefly glimpse a concept that will be of paramount importance later—the latent variable. A latent variable is a random variable that we cannot observe directly. Latent variables may be related to x through the joint distribution.

A very powerful and common type of mixture model is the Gaussian mixture model, in which the components p(x|c=i) are Gaussians. Each component has a separate parametrized mean μ^(i) and covariance Σ^(i). As with a single Gaussian distribution, the mixture of Gaussians might constrain the covariance matrix for each component to be diagonal or isotropic. A Gaussian mixture model is a universal approximator of densities, in the sense that any smooth density can be approximated with any specific nonzero amount of error by a Gaussian mixture model with enough components.

Some of the other continuous distribution functions include:

Erlang Distribution: In a Poisson process of rate λ the waiting times between k events have an Erlang distribution.
Gamma Distribution: In a Poisson process with rate λ the gamma distribution gives the time to the k^{th} event.
Beta Distribution: represents a family of probabilities and is a versatile way to represent outcomes for percentages or proportions.
Dirichlet Distribution: is a multivariate generalization of the Beta distribution. Dirichlet distributions are commonly used as prior distributions in Bayesian statistics.

This is section nine of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

03.00 - Probability and Information Theory
03.01 - Why Probability?
03.02 - Random Variables
03.03 - Probability Distributions
03.04 - Marginal Probability
03.05 - Conditional Probability
03.06 - The Chain Rule of Conditional Probabilities
03.07 - Independence and Conditional Independence
03.08 - Expectation, Variance and Covariance
03.09 - Common Probability Distributions
03.10 - Useful Properties of Common Functions
03.11 - Bayes' Rule
03.12 - Technical Details of Continuous Variables
03.13 - Information Theory
03.14 - Structured Probabilistic Models

at Deep Learning With TF 2.0: 03.00- Probability and Information Theory. You can get the code for this article and the rest of the chapter here. Links to the notebook in Google Colab and Jupyter Binder are at the end of the notebook.

Information Theory with Tensorflow 2.0

Mukesh Mithrakumar — Tue, 16 Jul 2019 18:32:20 +0000

Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal. In the context of machine learning, we can also apply information theory to continuous variables where some of these message length interpretations do not apply.

The basic intuition behind the information theory is that a likely event should have low information content, less likely events should have higher information content and independent events should have additive information.

Let me give you a simple example, lets say you have a male friend, and he is head over heels in love with this girl, so he asks this girl out pretty much every week and there's a 99% chance she says no, so you being his best friend, he texts you everytime after he asks the girl out to let you know what happened, he texts, "Hey guess what she said, NO 😭😭😭", this is of course wasteful, considering he has a very low chance so it makes more sense for your friend to just send "😭" but if she says yes then he can of course send a longer text, so this way, the number of bits used to convey the message (and your corresponding data bill) will be minimized. P.S don't tell your friend he has a low chance, that's how you lose friends 😬.

To satisfy these properties, we define the self-information of an event x=x to be:

I(x) =− log P(x)

In this book, we always use log to mean the natural logarithm, with base e. Our definition of I(x) is therefore written in units of nats. One nat is the amount of information gained by observing an event of probability 1/e. Other texts use base-2 logarithms and units called bits or shannons; information measured in bits is just a rescaling of information measured in nats.

"""
No matter what combination of toss you get the Entropy remains the same but if you change the probability of the
trial, the entropy changes, play around with the probs and see how the entropy is changing and see if the increase
or decrease makes sense.
"""

import tensorflow_probability as tfp
tfd = tfp.distributions

coin_entropy = [0]                                                                     # creating the coin entropy list

for i in range(10, 11):
    coin = tfd.Bernoulli(probs=0.5)                                                    # Bernoulli distribution
    coin_sample = coin.sample(i)                                                       # we take 1 sample
    coin_entropy.append(coin.entropy())                                                # append the coin entropy
    sns.distplot(coin_entropy, color=color_o, hist=False, kde_kws={"shade": True})     # Plot of the entropy

print("Entropy of 10 coin tosses in nats: {} \nFor tosses: {}".format(coin_entropy[1], coin_sample))
plt.grid()

Entropy of 10 coin tosses in nats: 0.6931471824645996
For tosses: [0 1 1 1 0 1 1 1 0 1]

Self information deals only with a single outcome. We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:

H(x) = E_(x∼P)[I(x)] =− E_(x∼P)[log P(x)]

also denoted as H(P).

Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn from a distribution P. Distributions that are nearly deterministic (where the outcome is nearly certain) have low entropy; distributions that are closer to uniform have high entropy. When x is continuous, the Shannon entropy is known as the differential entropy.

"""
Note here since we are using the Bernoulli distribution to find the expectation we simply use mean,
if you change the distribution, you need to find the Expectation accordingly
"""

def shannon_entropy_func(p):
    """Calculates the shannon entropy.
    Arguments:
        p (int)        : probability of event.
    Returns:
        shannon entropy.
    """

    return -tf.math.log(p.mean())

# Create a Bernoulli distribution
bernoulli_distribution = tfd.Bernoulli(probs=.5)

# Use TFPs entropy method to calculate the entropy of the distribution
shannon_entropy = bernoulli_distribution.entropy()

print("TFPs entropy: {} matches with the Shannon Entropy Function we wrote: {}".format(shannon_entropy,
                                                                                       shannon_entropy_func(bernoulli_distribution)))

TFPs entropy: 0.6931471824645996 matches with the Shannon Entropy Function we wrote: 0.6931471824645996

Entropy isn't remarkable for its interpretation, but for its properties. For example, entropy doesn't care about the actual x values like variance, it only considers their probability. So if we increase the number of values x may take then the entropy will increase and the probabilities will be less concentrated.

# You can see below by changing the values of x we increase the entropy

shannon_list = []

for i in range(1, 20):
    uniform_distribution = tfd.Uniform(low=0.0, high=i)    # We create a uniform distribution
    shannon_entropy = uniform_distribution.entropy()       # Calculate the entropy of the uniform distribution
    shannon_list.append(shannon_entropy)                   # Append the results to the list

# Plot of Shannon Entropy
plt.hist(shannon_list, color=color_b)
plt.grid()

If we have two separate probability distributions P(x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence:

D_(KL)(P∥Q)=E_(x∼P)[log P(x)/Q(x)]=E_(x∼P)[log P(x)−log Q(x)]

In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize
the length of messages drawn from probability distribution Q.

def kl_func(p, q):
    """Calculates the KL divergence of two distributions.
    Arguments:
        p    : Distribution p.
        q    : Distribution q.
    Returns:
        the divergence value.
    """

    r = p.loc - q.loc
    return (tf.math.log(q.scale) - tf.math.log(p.scale) -.5 * (1. - (p.scale**2 + r**2) / q.scale**2))

# We create two normal distributions
p = tfd.Normal(loc=1., scale=1.)
q = tfd.Normal(loc=0., scale=2.)

# Using TFPs KL Divergence
kl = tfd.kl_divergence(p, q)

print("TFPs KL_Divergence: {} matches with the KL Function we wrote: {}".format(kl, kl_func(p, q)))

TFPs KL_Divergence: 0.4431471824645996 matches with the KL Function we wrote: 0.4431471824645996

The KL divergence has many useful properties, most notably being nonnegative. The KL divergence is 0 if and only if P and Q are the same distribution in the case of discrete variables, or equal “almost everywhere” in the case of continuous variables.

A quantity that is closely related to the KL divergence is the cross-entropy H(P,Q)=H(P)+D_(KL)(P∥Q), which is similar to the KL divergence but lacking the term on the left:

H(P,Q) =− E_(x∼P) log Q(x)

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not participate in the omitted term.

"""
The cross_entropy computes the Shannons cross entropy defined as:
H[P, Q] = E_p[-log q(X)] = -int_F p(x) log q(x) dr(x)
"""

# We create two normal distributions
p = tfd.Normal(loc=1., scale=1.)
q = tfd.Normal(loc=0., scale=2.)

# Calculating the cross entropy
cross_entropy = q.cross_entropy(p)

print("TFPs cross entropy: {}".format(cross_entropy))

TFPs cross entropy: 3.418938636779785

This is section thirteen of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

Probability Distributions with Tensorflow 2.0

Mukesh Mithrakumar — Mon, 08 Jul 2019 23:28:25 +0000

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states. The way we describe probability distributions depends on whether the variables are discrete or continuous.

3.3.1 Discrete Variables and Probability Mass functions

A probability distribution over discrete variables may be described using a probability mass function (PMF). A probability mass function maps from a state of a random variable to the probability of that random variable taking on that state.

For example, the roll of a dice is random and a discrete variable means the roll can only have 1, 2, 3, 4, 5 or 6 on a die and no values inbetween.

We denote probability mass functions with P, where we denote a PMF equation as P(X = x). Here x can be a number on the dice when X is the event of rolling the dice.

"""
In a fair 6 sided dice, when you roll, each number has a chance of 1/6 = 16.7% of landing and we can show
this by running long enough rolls. So in this example, we do 10000 rolls and we verify that P(X=4) = 16.7%.
In short, the probability from a PMF says what chance x has. Play around with the different x values, number of rolls and sides and see what kind of probability you get and see if it makes sense.
"""

def single_dice(x, sides, rolls):
    """Calculates and prints the probability of rolls.
    Arguments:
        x (int)        : is the number you want to calculate the probability for.
        sides (int)    : Number of sides for the dice.
        rolls (int)    : Number of rolls.
    Returns:
        a printout.
    """

    result = roll(sides, rolls)
    for i in range(1, sides +1):
        plt.bar(i, result[i] / rolls)
    print("P(X = {}) = {}%".format(x, tf.divide(tf.multiply(result[x], 100), rolls)))

def roll(sides, rolls):
    """Returns a dictionary of rolls and the sides of each roll.
    Arguments:
        sides (int)    : Number of sides for the dice.
        rolls (int)    : Number of rolls.
    Returns:
        a dictionary.
    """

    d = defaultdict(int)                    # creating a default dictionary
    for _ in range(rolls):
        d[random.randint(1, sides)] += 1    # The random process
    return d


single_dice(x=6, sides=6, rolls=10000)

P(X = 6) = 16.43%

To be a PMF on a random variable x, a function P must satisfy the following properties:

The domain of P must be the set of all possible states of x. In our example above the possible states of x are from 1-6, try plugging in 7 for x and see what value you get.
∀ x ∈, 0≤P(x)≤1. An impossible event has probability 0, and no state can be less probable than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can have a greater chance of occurring. If you tried plugging in 7 for our example above, you would have seen the probability of obtaining a 7 would be zero, that is an impossible event because 7 is not in our set.
∑_x∈x P(x)=1. Normalized property that prevents from obtaining probabilities greater than one. Meaning if you add all the individual values of our dice probabilities, it should sum to 1 or 100%.

Probability mass functions can act on many variables at the same time. Such a probability distribution over many variables is known as a joint probability mass function. P(x=x;y=y) = P(x)P(y) denotes the probability that x=x and y=y simultaneously.

"""
In this example, we are rolling two dices, there are ways to simplify the code so it's not this long but
I wanted to show that we are rolling two dice 1000 times, and in the example we are calculating the probability
of rolling x=4 and y=1, this can be easily calculated by multiplying the individual probabilities of x and y."""

def multi_dice(x, y, sides, rolls, plot=True):
    """Calculates the joint probability of two dice.
    Arguments:
        x (int)        : is the number you want to calculate the probability for.
        y (int)        : is the number you want to calculate the probability for.
        sides (int)    : Number of sides for the dice.
        rolls (int)    : Number of rolls.
        plot (bool)    : Whether you want to plot the data or not.
    Returns:
        probabilities (float).
    """

    result1 = roll(sides, rolls)                         # first result from the rolls
    result2 = roll(sides, rolls)                         # second result from the rolls
    prob_x = tf.divide(result1[x], rolls)                # calculates the probability of x
    prob_y = tf.divide(result2[y], rolls)                # calculates the probability of y
    joint_prob = tf.multiply(prob_x, prob_y)             # calculates the joint probability of x&y by multiplying

    if plot:
        for i in range(1, sides +1):
            plt.title("Dice 1 {} Rolls".format(rolls))
            plt.bar(i, result1[i] / rolls, color=color_b)
        plt.show()
        for i in range(1, sides +1):
            plt.title("Dice 2 {} Rolls".format(rolls))
            plt.bar(i, result2[i] / rolls, color=color_o)
        plt.show()

    return prob_x, prob_y, joint_prob


prob_x, prob_y, joint_prob = multi_dice(x=4, y=1, sides=6, rolls=10000, plot=True)
print("P(x = {:.4}%), P(y = {:.4}%), P(x = {}; y = {}) = {:.4}%\n\n".format(tf.multiply(prob_x, 100),
                                                                 tf.multiply(prob_y, 100),
                                                                 4, 1, tf.multiply(joint_prob, 100)))

P(x = 16.9%), P(y = 16.39%), P(x = 4; y = 1) = 2.77%

3.3.2 Continuous Variables and Probability Density Functions

When working with continuous random variables, we describe probability distributions using a probability density function (PDF).

Let's play a game shall we, what if I ask you to guess the integer that I am thinking of between 1 to 10, regardless of the number you pick, the probability of each of the options is the same (1/10) because you have 10 options and the probabilities must add up to 1.

But what if I told you to guess the real number I am thinking between 0 and 1. Now this gets tricky, I can be thinking of 0.2, 0.5, 0.0004 and it can go on and on and the possibilities are endless. So we run into problems like how are we going to describe the probability of each option since there are infinite numbers. This is where PDF comes to help, instead of asking the exact probability, we look for a probability that is close to a single number.

"""
In our guessing game example, I told you how difficult it would be for you to guess a real number I am thinking of
between 0 and 1 and below, we plot such a graph with minval of 0 and maxval of 1 and we "guess" the values 500
times and the resulting distribution is plotted.
"""

# Outputs random values from a uniform distribution
continuous = tf.random.uniform([1, 500], minval=0, maxval=1, dtype=tf.float32)
g = sns.distplot(continuous, color=color_b)
plt.grid()

To be a probability density function, a function p must satisfy the
following properties:

The domain of p must be the set of all possible states of x
∀ x∈x, p(x)≥0. Note that we do not require p(x)≤1
∫p(x)dx=1

A probability density function p(x) does not give the probability of a specific state directly; instead the probability of landing inside an infinitesimal region with volume δx is given by p(x)δx

"""
Below is the same histogram plot of our continuous random variable, note that the values of y axis looks different
between the seaborn distplot and the histogram plot because the sns distplot is also drawing a density plot.
You can turn it off by setting ‘kde=False’ and you will get the same plot as you see below.
The goal of the following plot is to show you that if you want to calculate the p(0.3) then you would need to
calculate the volume of the region delta x
"""

n, bins, patches = plt.hist(continuous, color=color_b)
patches[3].set_fc(color_o)
plt.grid()

We can integrate the density function to find the actual probability mass of a set of points. Specifically, the probability that x lies in some set S is given by the integral of p(x) over that set ( ∫_[a,b]p(x)dx )

Tensorflow Probability Distribution Library

From here onwards, we will be using TFP distributions module often and we will be calling it as tfd (=tfp.distributions). So, before getting started, let me explain a few things about the module.

The TF Probability uses distribution subclasses to represent stochastic, random variables. Recall the first cause of uncertainty, inherent stochasticity. This means that even if we knew all the values of the variables' parameters, it would still be random. We would see examples of these distributions in Section 9. In the previous example, we created the distribution using a random variable but extracting samples from it and manipulating those will not be as intuitive as it would when you are using the tfp distributions library. We usually start by creating a distribution and then when we draw samples from it, those samples become tensorflow tensors which can be deterministically manipulated.

Some common methods in tfd:

sample(sample_shape=(), seed=None): Generates a specified sample size
mean(): Calculates the mean
mode(): Calculates the mode
variance(): Calculates the variance
stddev(): Calculates the standard deviation
prob(value): Calculates both the Probability density/mass function
log_prob(value): Calculates the Log probability density/mass function.
entropy(): Shannon entropy in nats.

"""
Let's say we want to find the probability of 1.5 (p(1.5)) from a continuous distribution. We can ofcourse
do the integral and find it but in tensorflow probability you have "prob()" which allows you to calculate
both Probability Mass Function and Probability Density Function.
For tfp.distributions.Normal "loc" is the mean and "scale" is the std deviation. Don't worry if you don't
understand those, we will go through distributions in Section 9. And I recommend you come back and go through
these examples again after you finish section 9.

Also, there's nothing special about these numbers, play around with the scale, p(x) values and the k limits to
get a better understanding.
"""
import tensorflow_probability as tfp
tfd = tfp.distributions

# creating an x axis
samples = tf.range(-10, 10, 0.001)

# Create a Normal distribution with mean 0 and std deviation 3
normal_distribution = tfd.Normal(loc=0., scale=3)

# Then we calculate the PDFs of drawing 1.25
pdf_x = normal_distribution.prob(1.5)

# We can't plot tensors so evaluate is a helper function to convert to ndarrays
[pdf_x_] = evaluate([pdf_x])


# Finally, we plot both the PDF of the samples and p(1.25)
plt.plot(samples, normal_distribution.prob(samples), color=color_b)
plt.fill_between(samples, normal_distribution.prob(samples), color=color_b)
plt.bar(1.5, pdf_x_, color=color_o)
plt.grid()

print("Probability of drawing 1.5 = {:.4}% from the normal distribution".format(pdf_x*100))

Probability of drawing 1.5 = 11.74% from the normal distribution

This is section three of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

What is a Random Variable?

Mukesh Mithrakumar — Mon, 01 Jul 2019 19:35:43 +0000

A random variable is a variable that can take on different values randomly. On its own, a random variable is just a description of the states that are possible (you can think of these like functions), which must be coupled with a probability distribution that specifies how likely each of these states is.

Well, if that doesn't make sense, let me give you an example, when I first heard about random variables, I thought this must work like a random number generator spitting out random values at each call, this is partly correct, let me clear it up. So, random number generators have two main components, a sampler, which is nothing more than a happy soul that flips a coin over and over again, reporting the results. And after this sampler, we have a random variable, the job of the random variable is to translate these Heads or Tails events into numbers based on our rules.

Random variables can be discrete or continuous. A discrete random variable is one that has a finite or countably infinite number of states. Note that these states are not necessarily the integers; they can also just be named states that are not considered to have any numerical value. For example, gender (male, female, etc), for which we use an indicator function I to map non-numeric values to numbers, e.g. male=0, female=1. A continuous random variable is associated with real value.

"""
The Rademacher and Rayleigh are two types of distributions we will use to generate our samples.

Rademacher: is a discrete probability distribution where a random variate X has a 50% chance of being +1 and a
50% chance of being -1.

Rayleigh: is a continuous probability distribution for non-negative valued random variables.

Do not worry about what probability distributions mean, we will be looking at it in the next section, for now,
you can think of Rademacher as the sampler, the happy guy who tosses coins over and over again where
heads represent +1 and tails -1.
And Rayleigh is the guy who works at a gas/petrol station who helps you to fill the tank and notes down how much
you filled your tank (eg. 1.2l, 4.5l) which are continuous values.
"""

import tensorflow_probability as tfp

# Discrete random variable
rademacher = tfp.math.random_rademacher([1, 100], dtype=tf.int32)

# Continuous random variable
rayleigh = tfp.math.random_rayleigh([1, 100], dtype=tf.float32)

# Plot discrete random variable 1 and -1
plt.title("Rademacher Discrete Random Variables")
plt.hist(rademacher, color=color_b)
plt.show()

# Plot continuous random variable
plt.title("Rayleigh Continuous Random Variables")
plt.hist(rayleigh, color=color_o)
plt.show()

This is section two of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

Why Probability for Deep Learning?

Mukesh Mithrakumar — Wed, 26 Jun 2019 05:39:33 +0000

Unlike the world of computer scientists and software engineers where things are entirely deterministic and certain, the world of machine learning must always deal with uncertain quantities and sometimes stochastic (non-deterministic or randomly determined) quantities.

There are three possible sources of uncertainty:

Inherent stochasticity: These are systems that have inherent randomness. Like using the python rand() function which outputs random numbers each time you run, or the dynamics of subatomic particles in quantum mechanics which are described as probabilistic in quantum mechanics.
Incomplete observability: The best example for this is the Monty Hall problem, the one in the movie 21 Jim Sturgess gets asked, there are three doors and there's a ferrari behind one door and the other two lead to a goat. Watch the scene to understand how to solve the Monty Hall problem. In this even though the contestant's choice is deterministic, but from the contestant's point of view the outcome is uncertain and deterministic systems appear to be stochastic when you can't observe all the variables.
Incomplete modeling: Spoiler Warning, well at this point I doubt it's a spoiler! Well, at the end of End Game, when Iron man snapped away all of Thanos' forces, (I know, still recovering from the scene), we are left to wonder what happened to Gamora right, was she snapped away because she was with Thanos's forces initially or was she saved because she turned against Thanos. When we discard some information about the model the discarded information in this case whether Tony knew Gamora was good or bad results in an uncertainty in the model's predictions, in this case we don't know for certain if she is alive or not.

Okay, swear, last Avengers reference.

When Dr. Strange said we have 1 in 14 million chances of winning the war, he practically saw those 14 million futures, this is called frequentist probability, which defines an event's probability as the limit of its relative frequency in a large number of trials. But not always do we have Dr. Strange's time stone to see all the possible futures or events that are repeatable, in this case we turn to Bayesian probability, which uses probability to represent a degree of belief for certain events, with 1 indicating absolute certainty and 0 indicating absolute uncertainty.

Even though the frequentist probability is related to rates at which events occur and Bayesian probability is related to qualitative levels of certainty, we treat both of them as behaving the same and we use the exact same formulas to compute the probability of events.

This is section one of the Chapter on Probability and Information Theory with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

Here's Everything you need to know about Facebooks' Cryptocurrency Libra

Mukesh Mithrakumar — Tue, 18 Jun 2019 21:48:55 +0000

This is a summary of the 29 page Libra whitepaper released today (Tuesday18th) by Facebook.

The document starts with Libras' mission:

To enable a simple global currency and financial infrastructure that empowers billions of people.

And outlines the plans for a new decentralized blockchain, a low-volatility cryptocurrency, and a smart contract platform that together aim to create a new opportunity for responsible financial services innovation.

Problem Statement

If you are reading this, you probably know why Blockchain and cryptocurrency is a revolutionary technology but for those who don't and to guide the rest of paper, the main motivating factor is that 1.7 billion adults globally remain outside of the financial system even though a billion have a mobile phone and nearly half a billion have internet access. People point out not having enough funds, high and unpredictable fees, and banks being too far away and lacking the necessary documentation as a reason for being "unbanked".

The Opportunity

This section is on the beliefs of Libra being able to give more people access to financial services and inherent control of their labor, yadda yadda, I am sorry, I really want to know how this works but if you want to read the opportunity, please see the screenshot below, its the usual benefits of blockchain and a decentralized system, but if you are impatient like me, you can skip to the next section:

Introducing Libra

Libra is made up of three parts that work together to create a more inclusive financial system:

It is built on a secure, scalable and reliable blockchain;
It is backed by a reserve of assets designed to give it intrinsic value;
It is governed by the independent Libra Association tasked with evolving the ecosystem.

The Libra currency is built on the "Libra Blockchain". Because it is intended to address a global audience, the software that implements the Libra Blockchain is open source. This is pretty cool, it means if you are a blockchain developer you can build on the Libra but the only caveat is, you have to learn "Move", which is the language they used to build the Blockchain, so yep, one more language to learn 😂. But the advantage is, according to the paper, the Blockchain was built ground up and to prioritize scalability, security, efficiency in storage and throughput, and future adaptability so we might end up having a truly reliable blockchain. Btw there is a technical paper on the Libra Blockchain, which is another 29 pages long and I plan to summarize it hopefully in the coming days so keep an eye out.

Now, the answer to my question on how they are going to create a stable coin that doesn't fluctuate like bitcoin, which btw has risen by 146 percent against the U.S. dollar from around $4,000 to $9,150 to date (18th June). And not only Bitcoin but other coins like Ethereum are also on the rise so, it's a good day to do some trading before it obviously crashes or highly unlikely plateaus at this rate after few weeks or so. But hey, I am no financial advisory so, back to Libra. Libra's price will remain stable because it will be backed by a reserve of real assets, a basket of bank deposits and short-term government securities will be held in the Libra Reserve for every Libra that is created.

Now, to the part I was worried about, is Facebook going to be handling all our financial data and the good news is, it's not. Once the Libra network launches in the first half of 2020, Facebook and its affiliates will have the same commitments, privileges, and financial obligations as any other Founding Member, which I think is pretty cool and they have also created Calibra, a regulated subsidiary, to ensure separation between social and financial data and to build and operate services on its behalf on top of the Libra network. But see, I am a very skeptic guy, what kind of services will Calibra offer on the Libra network under Facebook, I mean, how are they going to make money? If you have any thoughts, take it to the comments, would love to hear.

Anyways, the good things is, it boils down to the Libra Association Council, a not-for-profit membership organization headquartered in Geneva, Switzerland to coordinate and provide a framework for the governance of Libra. All decisions are brought to the council, and major policy or technical decisions require the consent of two-thirds of the votes.

Oh and btw, the affiliates, please see below for a breakdown by the industry:

And another good news is that one of the directives of the association is to move towards permissionless Libra blockchain within five years of the launch where anyone who meets the technical requirements can run a validator node. But currently, since there aren't any reliable solutions for a permissionless blockchain, Libra will start as a permissioned blockchain where you need to have access to act as a validator node.

The Libra Blockchain

The three requirements the Libra Blockchain was built on:

Able to scale to billions of accounts, which requires high transaction throughput, low latency, and an efficient, high-capacity storage system.
Highly secure, to ensure the safety of funds and financial data.
Flexible, so it can power the Libra ecosystem’s governance as well as future innovation in financial services.

Oh yay, the fun stuff, the actual technology 😍.

The three decisions regarding the Libra Blockchain:

Designing and using the Move programming language.
Using a Byzantine Fault Tolerant (BFT) consensus approach.
Adopting and iterating on widely adopted blockchain data structures.

Move was designed with safety and security as the highest priorities because of Libra's goal to one day serve billions of people. And the paper goes on to detail some of the features of the language as:

Easier to write smart contracts.
Prevents cloning of assets by enabling "resource types" that constrain digital assets to the same properties as physical assets.
Facilitates automatic proofs that transactions satisfy certain properties.

One thing I am not sure is the ability for developers to create contracts, the paper says it will be opened up over time so if you have any idea of when, please let me know in the comments.

Next, the BFT approach, Libra Blockchain is using the LibraBFT consensus protocol because they are designed to function correctly even if some validator nodes — up to one-third of the network — are compromised or fail.

In order to securely store transactions, data on the Libra Blockchain is protected by Merkle trees, a data structure used by other blockchains that enable the detection of any changes to existing data. Unlike previous blockchains, which view the blockchain as a collection of blocks of transactions, the Libra Blockchain is a single data structure that records the history of transactions and states over time. This implementation simplifies the work of applications accessing the blockchain, allowing them to read any data from any point in time and verify the integrity of that data using a unified framework.

The Libra Currency and Reserve

Since Libra will be based on real assets, you can convert the digital currency into local fiat currency based on an exchange rate. It is important to mention that Libra will not always be able to convert into the same amount of a given local currency (i.e., Libra is not a “peg” to a single currency). Rather, as the value of the underlying assets moves, the value of one Libra in any local currency may fluctuate. However, the reserve assets are being chosen to minimize volatility, so holders of Libra can trust the currency’s ability to preserve value over time. The assets in the Libra Reserve will be held by a geographically distributed network of custodians with investment-grade credit rating to provide both security and decentralization of the assets.

Interest on the reserve assets will be used to cover the costs of the system, ensure low transaction fees, pay dividends to investors who provided capital to jumpstart the ecosystem, so there you go, that's how they make money.

What’s Next for Libra?

Over the coming months, the association will work with the community to gather feedback on theLibra Blockchain prototype and bring it to a production-ready state. In particular, this work will focus on ensuring the security, performance, and scalability of the protocol and implementation.

Okay, answer to my second question, third parties can create smart contracts once the language development has stabilized after launch so, long time 🙁.

How to Get Involved

If you are a researcher or protocol developer, an early preview of the Libra testnet is available under the Apache 2.0 Open Source License, with accompanying documentation.

If your organization is interested in becoming a Founding Member or applying for social impact grants from the Libra Association, read more here.

Conclusion

I mean, I was very skeptical of this, I still am to an extent, but after reading the paper, I have answers to most of my questions and I think it will be a good thing for the crypto community and blockchain. You get free marketing from facebook and a lot of awareness so, hey that's a huge thing blockchain companies and developers lack, awareness of the public.

So, I plan on summarizing the technical Blockchain paper in the coming days, if you can't wait, you can read the whole paper here.

And if I missed anything or if you like to know more about anything I mentioned, please let me know in the comments.

Peace out ✌🏼.

Principal Components Analysis with Tensorflow 2.0

Mukesh Mithrakumar — Mon, 17 Jun 2019 04:13:16 +0000

PCA is a complexity reduction technique that tries to reduce a set of variables down to a smaller set of components that represent most of the information in the variables. This can be thought of as for a collection of data points applying lossy compression, meaning storing the points in a way that require less memory by trading some precision. At a conceptual level, PCA works by identifying sets of variables that share variance, and creating a component to represent that variance.

Earlier, when we were doing transpose or the matrix inverse, we relied on using Tensorflow's built in functions but for PCA, there is no such function, except one in the Tensorflow Extended (tft).

There are multiple ways you can implement a PCA in Tensorflow but since this algorithm is such an important one in the machine learning world, we will take the long route.

The reason for having PCA under Linear Algebra is to show that PCA could be implemented using the theorems we studied in this Chapter.

# To start working with PCA, let's start by creating a 2D data set

x_data = tf.multiply(5, tf.random.uniform([100], minval=0, maxval=100, dtype = tf.float32, seed = 0))
y_data = tf.multiply(2, x_data) + 1 + tf.random.uniform([100], minval=0, maxval=100, dtype = tf.float32, seed = 0)

X = tf.stack([x_data, y_data], axis=1)

plt.rc_context({'axes.edgecolor':'orange', 'xtick.color':'red', 'ytick.color':'red'})
plt.plot(X[:,0], X[:,1], '+', color='b')
plt.grid()

We start by standardizing the data. Even though the data we created are on the same scales, its always a good practice to start by standardizing the data because most of the time the data you will be working with will be in different scales.

def normalize(data):
    # creates a copy of data
    X = tf.identity(data)
    # calculates the mean
    X -=tf.reduce_mean(data, axis=0)
    return X

normalized_data = normalize(X)
plt.plot(normalized_data[:,0], normalized_data[:,1], '+', color='b')
plt.grid()

Recall that PCA can be thought of as applying lossy compression to a collection of x data points. The way we can minimize the loss of precision is by finding some decoding function f(x) ≈ c where c will be the corresponding vector.

PCA is defined by our choice of this decoding function. Specifically, to make the decoder very simple, we chose to use matrix multiplication to map c and define g(c) = Dc. Our goal is to minimize the distance between the input point x to its reconstruction and to do that we use L^2 norm. Which boils down to our encoding function c = D^T x.

Finally, to reconstruct the PCA we use the same matrix D to decode all the points and to solve this optimization problem, we use eigendecomposition.

Please note that the following equation is the final version of a lot of matrix transformations. I don't provide the derivatives because the goal is to focus on the mathematical implementation, rather than the derivation. But for the curious, You can read about the derivation in Chapter 2 Section 11.

d^* = argmax_d Tr(d^T X^T Xd) subject to dd^T = 1

To find d we can calculate the eigenvectors X^T X.

# Finding the Eigne Values and Vectors for the data
eigen_values, eigen_vectors = tf.linalg.eigh(tf.tensordot(tf.transpose(normalized_data), normalized_data, axes=1))

print("Eigen Vectors: \n{} \nEigen Values: \n{}".format(eigen_vectors, eigen_values))

Eigen Vectors:
[[-0.8908606  -0.45427683]
[ 0.45427683 -0.8908606 ]]
Eigen Values:
[   16500.715 11025234.   ]

The eigenvectors (principal components) determine the directions of the new feature space, and the eigenvalues determine their magnitude.

Now, let's use these Eigenvectors to rotate our data. The goal of the rotation is to end up with a new coordinate system where data is uncorrelated and thus where the basis axes gather all the variance. Thereby reducing the dimension.

Recall our encoding function c = D^T x, where D is the matrix containing the eigenvectors that we have calculated before.

X_new = tf.tensordot(tf.transpose(eigen_vectors), tf.transpose(normalized_data), axes=1)

plt.plot(X_new[0, :], X_new[1, :], '+', color='b')
plt.xlim(-500, 500)
plt.ylim(-700, 700)
plt.grid()

That is the transformed data.

This is section twelve of the Chapter on Linear Algebra with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

02.01 — Scalars, Vectors, Matrices, and Tensors
02.02 — Multiplying Matrices and Vectors
02.03 — Identity and Inverse Matrices
02.04 — Linear Dependence and Span
02.05 — Norms
02.06 — Special Kinds of Matrices and Vectors
02.07 — Eigendecomposition
02.08 — Singular Value Decomposition
02.09 — The Moore-Penrose Pseudoinverse
02.10 — The Trace Operator
02.11 — The Determinant
02.12 — Example: Principal Components Analysis

at Deep Learning With TF 2.0: 02.00- Linear Algebra. You can get the code for this article and the rest of the chapter here. Links to the notebook in Google Colab and Jupyter Binder are at the end of the notebook.

Singular Value Decomposition with Tensorflow 2.0

Mukesh Mithrakumar — Fri, 14 Jun 2019 04:16:11 +0000

The singular value decomposition (SVD) provides another way to factorize a matrix into singular vectors and singular values. The SVD enables us to discover some of the same kind of information as the eigendecomposition reveals, however, the SVD is more generally applicable. Every real matrix has a singular value decomposition, but the same is not true of the eigenvalue decomposition. SVD can be written as:

A = UDV^T

Suppose A is an m x n matrix, then U is defined to be an m x m rotation matrix, D to be an m x n matrix scaling & projecting matrix, and V to be an n x n rotation matrix.

Each of these matrices is defined to have a special structure. The matrices U and V are both defined to be orthogonal matrices U^T = U^(-1) and V^T = V^(-1). The matrix D is defined to be a diagonal matrix.

The elements along the diagonal of D are known as the singular values of the matrix A. The columns of U are known as the left-singular vectors. The columns of V are known as the right-singular vectors.

# mxn matrix A
svd_matrix_A = tf.constant([[2, 3], [4, 5], [6, 7]], dtype=tf.float32)
print("Matrix A: \n{}\n".format(svd_matrix_A))

# Using tf.linalg.svd to calculate the singular value decomposition where d: Matrix D, u: Matrix U and v: Matrix V
d, u, v = tf.linalg.svd(svd_matrix_A, full_matrices=True, compute_uv=True)
print("Diagonal D: \n{} \n\nMatrix U: \n{} \n\nMatrix V^T: \n{}".format(d, u, v))

Matrix A:
[[2. 3.]
 [4. 5.]
 [6. 7.]]

Diagonal D:
[11.782492    0.41578525]

Matrix U:
[[ 0.30449855 -0.86058956  0.40824753]
 [ 0.54340035 -0.19506174 -0.81649673]
 [ 0.78230214  0.47046405  0.40824872]]

Matrix V^T:
[[ 0.63453555  0.7728936 ]
 [ 0.7728936  -0.63453555]]

# Lets see if we can bring back the original matrix from the values we have

# mxm orthogonal matrix U
svd_matrix_U = tf.constant([[0.30449855, -0.86058956, 0.40824753], [0.54340035, -0.19506174, -0.81649673], [0.78230214, 0.47046405, 0.40824872]])
print("Orthogonal Matrix U: \n{}\n".format(svd_matrix_U))

# mxn diagonal matrix D
svd_matrix_D = tf.constant([[11.782492, 0], [0, 0.41578525], [0, 0]], dtype=tf.float32)
print("Diagonal Matrix D: \n{}\n".format(svd_matrix_D))

# nxn transpose of matrix V
svd_matrix_V_trans = tf.constant([[0.63453555, 0.7728936], [0.7728936, -0.63453555]], dtype=tf.float32)
print("Transpose Matrix V: \n{}\n".format(svd_matrix_V_trans))

# UDV(^T)
svd_RHS = tf.tensordot(tf.tensordot(svd_matrix_U, svd_matrix_D, axes=1), svd_matrix_V_trans, axes=1)

predictor = tf.reduce_all(tf.equal(tf.round(svd_RHS), svd_matrix_A))
def true_print(): print("It WORKS. \nRHS: \n{} \n\nLHS: \n{}".format(tf.round(svd_RHS), svd_matrix_A))
def false_print(): print("Condition FAILED. \nRHS: \n{} \n\nLHS: \n{}".format(tf.round(svd_RHS), svd_matrix_A))

tf.cond(predictor, true_print, false_print)

Orthogonal Matrix U:
[[ 0.30449855 -0.86058956  0.40824753]
 [ 0.54340035 -0.19506174 -0.81649673]
 [ 0.78230214  0.47046405  0.40824872]]

Diagonal Matrix D:
[[11.782492    0.        ]
 [ 0.          0.41578525]
 [ 0.          0.        ]]

Transpose Matrix V:
[[ 0.63453555  0.7728936 ]
 [ 0.7728936  -0.63453555]]

It WORKS.
RHS:
[[2. 3.]
 [4. 5.]
 [6. 7.]]

LHS:
[[2. 3.]
 [4. 5.]
 [6. 7.]]

Matrix A can be seen as a linear transformation. This transformation can be decomposed into three sub-transformations:

Rotation,
Re-scaling and projecting,
Rotation.

These three steps correspond to the three matrices U, D and V

Let's see how these transformations are taking place in order

# Let's define a unit square
svd_square = tf.constant([[0, 0, 1, 1],[0, 1, 1, 0]], dtype=tf.float32)

# a new 2x2 matrix
svd_new_matrix = tf.constant([[1, 1.5], [0, 1]])

# SVD for the new matrix
new_d, new_u, new_v = tf.linalg.svd(svd_new_matrix, full_matrices=True, compute_uv=True)

# lets' change d into a diagonal matrix
new_d_marix = tf.linalg.diag(new_d)

# Rotation: V^T for a unit square
plot_transform(svd_square, tf.tensordot(new_v, svd_square, axes=1), "$Square$", "$V^T \cdot Square$", "Rotation", axis=[-0.5, 3.5 , -1.5, 1.5])
plt.show()

# Scaling and Projecting: DV^(T)
plot_transform(tf.tensordot(new_v, svd_square, axes=1), tf.tensordot(new_d_marix, tf.tensordot(new_v, svd_square, axes=1), axes=1), "$V^T \cdot Square$", "$D \cdot V^T \cdot Square$", "Scaling and Projecting", axis=[-0.5, 3.5 , -1.5, 1.5])
plt.show()

# Second Rotation: UDV^(T)
trans_1 = tf.tensordot(tf.tensordot(new_d_marix, new_v, axes=1), svd_square, axes=1)
trans_2 = tf.tensordot(tf.tensordot(tf.tensordot(new_u, new_d_marix, axes=1), new_v, axes=1), svd_square, axes=1)
plot_transform(trans_1, trans_2,"$U \cdot D \cdot V^T \cdot Square$", "$D \cdot V^T \cdot Square$", "Second Rotation", color=['#1190FF', '#FF9A13'], axis=[-0.5, 3.5 , -1.5, 1.5])
plt.show()

<img src="https://raw.githubusercontent.com/adhiraiyan/DeepLearningWithTF2.0/master/notebooks/figures/ch02/output_79_0.png" alt="Rotation" class="center-image">


<img src="https://raw.githubusercontent.com/adhiraiyan/DeepLearningWithTF2.0/master/notebooks/figures/ch02/output_79_1.png" alt="Scaling and Projecting" class="center-image">


<img src="https://raw.githubusercontent.com/adhiraiyan/DeepLearningWithTF2.0/master/notebooks/figures/ch02/output_79_2.png" alt="Second Rotation" class="center-image">

The above sub transformations can be found for each matrix as follows:

U corresponds to the eigenvectors of A A^T
V corresponds to the eigenvectors of A^T A
D corresponds to the eigenvalues A A^T or A^T A which are the same.

As an exercise try proving this is the case.

Perhaps the most useful feature of the SVD is that we can use it to partially generalize matrix inversion to nonsquare matrices, as we will see in the next section.

This is section eight of the Chapter on Linear Algebra with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

at Deep Learning With TF 2.0: 02.00- Linear Algebra. You can get the code for this article and the rest of the chapter here. Links to the notebook in Google Colab and Jupyter Binder is at the end of the notebook.

Eigendecomposition with Tensorflow 2.0

Mukesh Mithrakumar — Mon, 10 Jun 2019 21:52:51 +0000

We can represent a number, for example 12 as 12 = 2 x 2 x 3. The representation will change depending on whether we write it in base ten or in binary but the above representation will always be true and from that, we can conclude that 12 is not divisible by 5 and that any integer multiple of 12 will be divisible by 3.

Similarly, we can also decompose matrices in ways that show us information about their functional properties that are not obvious from the representation of the matrix as an array of elements. One of the most widely used kinds of matrix decomposition is called eigendecomposition, in which we decompose a matrix into a set of eigenvectors and eigenvalues.

An eigenvector of a square matrix A is a nonzero vector v such that multiplication by A alters only the scale of v, in short, this is a special vector that doesn't change the direction of the matrix when applied to it:

Av = λv

The scale λ is known as the eigenvalue corresponding to this eigenvector.

# Let's see how we can compute the eigen vectors and values from a matrix
e_matrix_A = tf.random.uniform([2, 2], minval=3, maxval=10, dtype=tf.float32, name="matrixA")
print("Matrix A: \n{}\n\n".format(e_matrix_A))

# Calculating the eigen values and vectors using tf.linalg.eigh, if you only want the values you can use eigvalsh
eigen_values_A, eigen_vectors_A = tf.linalg.eigh(e_matrix_A)
print("Eigen Vectors: \n{} \n\nEigen Values: \n{}\n".format(eigen_vectors_A, eigen_values_A))

# Now lets plot our Matrix with the Eigen vector and see how it looks
Av = tf.tensordot(e_matrix_A, eigen_vectors_A, axes=0)
vector_plot([tf.reshape(Av, [-1]), tf.reshape(eigen_vectors_A, [-1])], 10, 10)

Matrix A:
[[5.450138 9.455662]
 [9.980919 9.223391]]

Eigen Vectors:
[[-0.76997876 -0.6380696 ]
 [ 0.6380696  -0.76997876]]

Eigen Values:
[-2.8208985 17.494429 ]

If v is an eigenvector of A, then so is any rescaled vector sv for s ⋹ R, s ≠ 0.

# Lets us multiply our eigen vector by a random value s and plot the above graph again to see the rescaling
sv = tf.multiply(5, eigen_vectors_A)
vector_plot([tf.reshape(Av, [-1]), tf.reshape(sv, [-1])], 10, 10)

Suppose that a matrix A has n linearly independent eigenvectors v^(1),..., v^(n) with corresponding eigenvalues λ_(1),..., λ_n. We may concatenate all the eigenvectors to form a matrix V with one eigenvector per column: V = [v^1 ,..., v^n ]. Likewise, we can concatenate the eigenvalues to form a vector λ = [λ_1,..., λ_n]^T. The eigendecomposition of A is then given by

A = V diag(λ)V^(-1)

# Creating a matrix A to find it's decomposition
eig_matrix_A = tf.constant([[5, 1], [3, 3]], dtype=tf.float32)
new_eigen_values_A, new_eigen_vectors_A = tf.linalg.eigh(eig_matrix_A)

print("Eigen Values of Matrix A: {} \n\nEigen Vector of Matrix A: \n{}\n".format(new_eigen_values_A, new_eigen_vectors_A))

# calculate the diag(lamda)
diag_lambda = tf.linalg.diag(new_eigen_values_A)
print("Diagonal of Lambda: \n{}\n".format(diag_lambda))

# Find the eigendecomposition of matrix A
decomp_A = tf.tensordot(tf.tensordot(eigen_vectors_A, diag_lambda, axes=1), tf.linalg.inv(new_eigen_vectors_A), axes=1)

print("The decomposition Matrix A: \n{}".format(decomp_A))

Eigen Values of Matrix A: [0.8377223 7.1622777]

Eigen Vector of Matrix A:
[[-0.5847103   0.81124216]
 [ 0.81124216  0.5847103 ]]

Diagonal of Lambda:
[[0.8377223 0.       ]
 [0.        7.1622777]]

The decomposition Matrix A:
[[-3.3302479 -3.195419 ]
 [-4.786382  -2.7909322]]

Not every matrix can be decomposed into eigenvalues and eigenvectors. In some cases, the decomposition exists but involves complex rather than real numbers.

In this book, we usually need to decompose only a specific class of matrices that have a simple decomposition. Specifically, every real symmetric matrix can be decomposed into an expression using only real-valued eigenvectors and eigenvalues:

A = Q λ Q^T

where Q is an orthogonal matrix composed of eigenvectors of A and λ is a diagonal matrix. The eigenvalue λ_{i,i} is associated with the eigenvector in column i of Q, denoted as Q_{:, i}. Because Q is an orthogonal matrix, we can think of A as scaling space by λ_i in direction v^(i).

# In section 2.6 we manually created a matrix to verify if it is symmetric, but what if we don't know the exact values and want to create a random symmetric matrix
new_matrix_A = tf.Variable(tf.random.uniform([2,2], minval=1, maxval=10, dtype=tf.float32))

# to create an upper triangular matrix from a square one
X_upper = tf.linalg.band_part(new_matrix_A, 0, -1)
sym_matrix_A = tf.multiply(0.5, (X_upper + tf.transpose(X_upper)))
print("Symmetric Matrix A: \n{}\n".format(sym_matrix_A))

# create orthogonal matrix Q from eigen vectors of A
eigen_values_Q, eigen_vectors_Q = tf.linalg.eigh(sym_matrix_A)
print("Matrix Q: \n{}\n".format(eigen_vectors_Q))

# putting eigen values in a diagonal matrix
new_diag_lambda = tf.linalg.diag(eigen_values_Q)
print("Matrix Lambda: \n{}\n".format(new_diag_lambda))

sym_RHS = tf.tensordot(tf.tensordot(eigen_vectors_Q, new_diag_lambda, axes=1), tf.transpose(eigen_vectors_Q), axes=1)

predictor = tf.reduce_all(tf.equal(tf.round(sym_RHS), tf.round(sym_matrix_A)))
def true_print(): print("It WORKS. \nRHS: \n{} \n\nLHS: \n{}".format(sym_RHS, sym_matrix_A))
def false_print(): print("Condition FAILED. \nRHS: \n{} \n\nLHS: \n{}".format(sym_RHS, sym_matrix_A))

tf.cond(predictor, true_print, false_print)

Symmetric Matrix A:
[[4.517448  3.3404353]
 [3.3404353 7.411926 ]]

Matrix Q:
[[-0.8359252 -0.5488433]
 [ 0.5488433 -0.8359252]]

Matrix Lambda:
[[2.3242188 0.       ]
 [0.        9.605155 ]]

It WORKS.
RHS:
[[4.5174475 3.340435 ]
 [3.340435  7.4119253]]

LHS:
[[4.517448  3.3404353]
 [3.3404353 7.411926 ]]

The eigendecomposition of a matrix tells us many useful facts about the matrix. The matrix is singular if and only if any of the eigenvalues are zero. The eigendecomposition of a real symmetric matrix can also be used to optimize quadratic expressions of the form f(x) = x^T Ax subject to |x|_2 = 1.

The above equation can be solved as following, we know that if x is an Eigenvector of A and λ is the corresponding eigenvalue, then Ax = λ x, therefore f(x) = x^T Ax = x^T λ x = x^T x λ and since |x|_2 = 1 and x^T x =1, the above equation boils down to f(x) = λ

Whenever x is equal to an eigenvector of A, f takes on the value of the corresponding eigenvalue and its minimum value within the constraint region is the minimum eigenvalue.

A matrix whose eigenvalues are all positive is called positive definite. A matrix whose eigenvalues are all positive or zero valued is called positive semidefinite. Likewise, if all eigenvalues are negative, the matrix is negative definite, and if all eigenvalues are negative or zero valued, it is negative semidefinite. Positive semidefinite matrices are interesting because they guarantee that ∀ x, x^T Ax ≥ 0. Positive definite matrices additionally guarantee that x^T Ax = 0 ⇒ x=0.

This is section Seven of the Chapter on Linear Algebra with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

Scalars, Vectors, Matrices and Tensors with Tensorflow 2.0

Mukesh Mithrakumar — Sat, 01 Jun 2019 22:22:18 +0000

Scalars: are just a single number. For example temperature, which is denoted by just one number.

Vectors: are an array of numbers. The numbers are arranged in order and we can identify each individual number by its index in that ordering. We can think of vectors as identifying points in space, with each element giving the coordinate along a different axis. In simple terms, a vector is an arrow representing a quantity that has both magnitude and direction wherein the length of the arrow represents the magnitude and the orientation tells you the direction. For example wind, which has a direction and magnitude.

Matrices: A matrix is a 2D-array of numbers, so each element is identified by two indices instead of just one. If a real valued matrix A has a height of m and a width of n, then we say that A in R^(m x n). We identify the elements of the matrix as A_(m,n) where m represents the row and n represents the column.

Tensors: In the general case, are an array of numbers arranged on a regular grid with a variable number of axes is knows as a tensor. We identify the elements of a tensor A at coordinates(i, j, k) by writing A_(i, j, k). But to truly understand tensors, we need to expand the way we think of vectors as only arrows with a magnitude and direction. Remember that a vector can be represented by three components, namely the x, y and z components (basis vectors). If you have a pen and a paper, let's do a small experiment, place the pen vertically on the paper and slant it by some angle and now shine a light from top such that the shadow of the pen falls on the paper, this shadow, represents the x component of the vector "pen" and the height from the paper to the tip of the pen is the y component. Now, let's take these components to describe tensors, imagine, you are Indiana Jones or a treasure hunter and you are trapped in a cube and there are three arrows flying towards you from the three faces (to represent x, y, z axis) of the cube 😬, I know this will be the last thing you would think in such a situation but you can think of those three arrows as vectors pointing towards you from the three faces of the cube and you can represent those vectors (arrows) in x, y and z components, now that is a rank 2 tensor (matrix) with 9 components. Remember that this is a very very simple explanation of tensors. Following is a representation of a tensor:

We can add matrices to each other as long as they have the same shape, just by adding their corresponding elements:

C = A + B where C_(i,j) = A_(i,j) + B_(i,j)

If you have trouble viewing the equations in the browser you can also read the chapter in Jupyter nbviewer in its entirety. If not, let's continue.

In tensorflow a:

Rank 0 Tensor is a Scalar
Rank 1 Tensor is a Vector
Rank 2 Tensor is a Matrix
Rank 3 Tensor is a 3-Tensor
Rank n Tensor is a n-Tensor

# let's create a ones 3x3 rank 2 tensor
rank_2_tensor_A = tf.ones([3, 3], name='MatrixA')
print("3x3 Rank 2 Tensor A: \n{}\n".format(rank_2_tensor_A))

# let's manually create a 3x3 rank two tensor and specify the data type as float
rank_2_tensor_B = tf.constant([[1, 2, 3], [4, 5, 6], [7, 8, 9]], name='MatrixB', dtype=tf.float32)
print("3x3 Rank 2 Tensor B: \n{}\n".format(rank_2_tensor_B))

# addition of the two tensors
rank_2_tensor_C = tf.add(rank_2_tensor_A, rank_2_tensor_B, name='MatrixC')
print("Rank 2 Tensor C with shape={} and elements: \n{}".format(rank_2_tensor_C.shape, rank_2_tensor_C))

3x3 Rank 2 Tensor A:
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]

3x3 Rank 2 Tensor B:
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

Rank 2 Tensor C with shape=(3, 3) and elements:
[[ 2.  3.  4.]
 [ 5.  6.  7.]
 [ 8.  9. 10.]]

# Let's see what happens if the shapes are not the same
two_by_three = tf.ones([2, 3])
try:
    incompatible_tensor = tf.add(two_by_three, rank_2_tensor_B)
except:
    print("""Incompatible shapes to add with two_by_three of shape {0} and 3x3 Rank 2 Tensor B of shape {1}
    """.format(two_by_three.shape, rank_2_tensor_B.shape))

Incompatible shapes to add with two_by_three of shape (2, 3) and 3x3 Rank 2 Tensor B of shape (3, 3)

We can also add a scalar to a matrix or multiply a matrix by a scalar, just by performing that operation on each element of a matrix:

D = a.B + c where D_(i,j) = a.B_(i,j) + c

# Create scalar a, c and Matrix B
rank_0_tensor_a = tf.constant(2, name="scalar_a", dtype=tf.float32)
rank_2_tensor_B = tf.constant([[1, 2, 3], [4, 5, 6], [7, 8, 9]], name='MatrixB', dtype=tf.float32)
rank_0_tensor_c = tf.constant(3, name="scalar_c", dtype=tf.float32)

# multiplying aB
multiply_scalar = tf.multiply(rank_0_tensor_a, rank_2_tensor_B)
# adding aB + c
rank_2_tensor_D = tf.add(multiply_scalar, rank_0_tensor_c, name="MatrixD")

print("""Original Rank 2 Tensor B: \n{0} \n\nScalar a: {1}
Rank 2 Tensor for aB: \n{2} \n\nScalar c: {3} \nRank 2 Tensor D = aB + c: \n{4}
""".format(rank_2_tensor_B, rank_0_tensor_a, multiply_scalar, rank_0_tensor_c, rank_2_tensor_D))

Original Rank 2 Tensor B:
[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

Scalar a: 2.0
Rank 2 Tensor for aB:
[[ 2.  4.  6.]
 [ 8. 10. 12.]
 [14. 16. 18.]]

Scalar c: 3.0
Rank 2 Tensor D = aB + c:
[[ 5.  7.  9.]
 [11. 13. 15.]
 [17. 19. 21.]]

One important operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a diagonal line, called the main diagonal. We denote the transpose of a matrix A as A^T and is defined as such: A^T (i, j) = A(j, i)

# Creating a Matrix E
rank_2_tensor_E = tf.constant([[1, 2, 3], [4, 5, 6]])
# Transposing Matrix E
transpose_E = tf.transpose(rank_2_tensor_E, name="transposeE")

print("""Rank 2 Tensor E of shape: {0} and elements: \n{1}\n
Transpose of Rank 2 Tensor E of shape: {2} and elements: \n{3}""".format(rank_2_tensor_E.shape, rank_2_tensor_E, transpose_E.shape, transpose_E))

Rank 2 Tensor E of shape: (2, 3) and elements:
[[1 2 3]
 [4 5 6]]

Transpose of Rank 2 Tensor E of shape: (3, 2) and elements:
[[1 4]
 [2 5]
 [3 6]]

In deep learning we allow the addition of matrix and a vector, yielding another matrix where C_(i, j) = A_(i, j) + b_(j). In other words, the vector b is added to each row of the matrix. This implicit copying of b to many locations is called broadcasting

# Creating a vector b
rank_1_tensor_b = tf.constant([[4.], [5.], [6.]])
# Broadcasting a vector b to a matrix A such that it yields a matrix F = A + b
rank_2_tensor_F = tf.add(rank_2_tensor_A, rank_1_tensor_b, name="broadcastF")

print("""Rank 2 tensor A: \n{0}\n \nRank 1 Tensor b: \n{1}
\nRank 2 tensor F = A + b:\n{2}""".format(rank_2_tensor_A, rank_1_tensor_b, rank_2_tensor_F))

Rank 2 tensor A:
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]

Rank 1 Tensor b:
[[4.]
 [5.]
 [6.]]

Rank 2 tensor F = A + b:
[[5. 5. 5.]
 [6. 6. 6.]
 [7. 7. 7.]]

This is section two of the Chapter on Linear Algebra with Tensorflow 2.0 of the Book Deep Learning with Tensorflow 2.0.

You can read this section and the following topics:

02.01 — Scalars, Vectors, Matrices and Tensors
02.02 — Multiplying Matrices and Vectors
02.03 — Identity and Inverse Matrices
02.04 — Linear Dependence and Span
02.05 — Norms
02.06 — Special Kinds of Matrices and Vectors
02.07 — Eigendecomposition
02.08 — Singular Value Decomposition
02.09 — The Moore-Penrose Pseudoinverse
02.10 — The Trace Operator
02.11 — The Determinant
02.12 — Example: Principal Components Analysis

Deep Learning With TF 2.0: 01.00- Introduction

Mukesh Mithrakumar — Tue, 21 May 2019 13:48:22 +0000

01.00 — Preface

Deep Learning

The Deep Learning textbook is written by Ian Goodfellow, Yoshua Bengio and Aaron Courville, and this book is intended to help students and practitioners enter the field of machine learning in general and deep learning in particular. This is an excellent, comprehensive textbook on deep learning that I found so far and is presented elegantly and rigorously throughout. This is the book that explains what is going on and why so that you will be able to make principled decisions and not just blindly implement things.

The book is mainly separated into three parts:

Part I: Applied Math and Machine Learning Basics
Part II: Modern Practical Deep Networks
Part III: Deep Learning Research

For a detailed view of the chapters see Index.

This book includes almost everything you need to know to understand deep learning algorithms but this book can be challenging for two reasons.

One, this is a highly theoretical book and is written as an academic text, even though you have a whole part on the Applied Math background, this book still requires additional math background and the authors acknowledge that.

Two, the best way to learn these concepts is by practicing it, working on problems and solving programming examples and after scouring the internet there is no complete exercises or programming guide to this great book.

Which eventually led me to write Deep Learning with Tensorflow 2.0. My goal is to provide explanations for sections that may seem too complex, summarize those that are not and finally provide programming examples in Tensorflow 2.0.

Tensorflow

The main bottleneck of Tensorflow 1.x was that it had a high learning curve and the declarative type of programming was not very intuitive to those who are used to programming in an imperative programming language like Python.

Tensorflow 2.0 comes with major changes and due to these changes if you are just starting out with Tensorflow, then you are in the best place. You can jump right in and start learning without worrying about Tensorflow 1.x but what is that old saying, those who cannot remember the past are condemned to repeat it. Not to sound ominous but knowing what changed from Tensorflow 1.x to Tensorflow 2.0 will help the new user understand and learn the framework better and in case you end up working with the Tensorflow 1.x code then you need to know how to upgrade to Tensorflow 2.0.

And for those users who had to struggle, like me, with Tensorflow 1.x to learn the framework, I am sorry my friends, you will have to re-learn how to use the new framework and rewrite your codebase but a small consolation is, Tensorflow 1.x is not completely abandoning us, the TensorFlow team has created the tf_upgrade_v2 utility to help transition legacy code to the new API. But conversion tools are not perfect so you still might have to manually change some code. In short TensorFlow 2.0 is backward-incompatible.

The main problem with TensorFlow 1.x was its difficulty in learning, applying and debugging and TensorFlow 2.0 solves it by using:

Eager Execution: which is an imperative programming environment that evaluates operations immediately, without building graphs, unlike in Tensorflow 1.x which uses Python as a declarative metaprogramming tool for graphs. In short, the Graph and the graph runtime are both abstracted away, which means no session and no global graph state.

You can read more about the changes between Tensorflow 1.x and Tensorflow 2.0 here.

If you didn’t understand what graph runtime means, I think this is your first time with Tensorflow so you are lucky.

Now, that we took a look at the past, what’s in store for the future. Tensorflow has a comprehensive, flexible ecosystem of tools including TensorFlow.js to create new machine learning models and deploy existing models with JavaScript, TensorFlow Lite to run inference on mobile and embedded devices like Android, iOS, Edge TPU, and Raspberry Pi, TensorFlow Extended to deploy a production-ready machine learning pipeline for training and inference using. This lets researchers push the state-of-the-art in machine learning and developers easily build and deploy and scale machine learning powered applications. Note scale, and who better than Google to teach us about scale.

01.01 — Introduction

Humans have been long dreaming about creating machines that think. The desire dates back to at least the time of ancient Greece to figures like Daedalus and Hephaestus and to artificial life like Galatea, Talos, and the famous Pandora. Not the planet in Avatar but the Pandora's box mythos 😉.

And even before programmable computers were invented, people were dreaming about software to automate routine labor, understand speech or images, make diagnoses in medicine.

The first reference to the Artificial Intelligence in Hollywood was back in 1951 in the movie “The day the Earth stood still”.

But this is really the worlds first victim of automation:

From early days till now, computers excel at tasks that humans find difficult and the true challenge still remains are those tasks that humans find easy and feel automatic, like recognizing spoken words or driving.

This book is about a solution to these more intuitive problems.

Depending on your source for learning about Artificial Intelligence and Machine Learning, you may not even be familiar with Machine Learning and Deep Learning as a separate subject because these phrases are often tossed around interchangeably.

Deep Learning is one of the approaches to AI. Read about where Deep Learning fits into AI here.

In short, we allow computers to learn from experience and understand the world in terms of a hierarchy of concepts. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers.

The following figure shows how a deep learning system can represent the concept of a person by combining simpler concepts, such as corners and contours which are in turn defined in terms of edges.

01.02 — Who should read this book

The book was initially written for two target audience in mind:

University students (undergraduate or graduate) learning about machine learning, including those who are beginning a career in deep learning and artificial intelligence research.
Software engineers who do not have a machine learning or statistics background but want to rapidly acquire one and begin using deep learning in their product or platform.

My goal is to expand the audience to anyone interested in starting to learn Deep Learning with limited machine learning, statistics, python, and tensorflow background. Please note that I assume you have a basic understanding of Python and when we go deeper into the material the problems we will solve may end up python intensive and during those sections, I will refer to further resources which you can use to learn Python.

Given below is the high-level organization of the book. An arrow from one chapter to another indicates that the former chapter is prerequisite material for understanding the latter.

If you are familiar with certain sections, feel free to skip those.

01.03 — A Short History of Deep Learning

Through history, deep learning has been called many names, sounds like the beginning of a mystery novel and for the most part, it was and to an extent, it still is a mystery. During 1940s-1960s deep learning was known as Cybernetics, during 1980s-1990s people called it Connectionism, and it was resurrected back in 2006 with the name Deep Learning.

The origin of Deep learning can be roughly traced back to 1943 when William McCulloch and Walter Pitts published “A Logical Calculus of Ideas Immanent in Nervous Activity” which first outlined the computational model of a neural network, meaning models of how learning happens or could happen in the brain. As a result, one of the names that deep learning has gone by is Artificial Neural Networks (ANNs).

So, if Deep Learning has been around since the 1940s, then why is it only now reaching the mainstream computing audience?

The main two reasons are the availability of enormous amounts of data and the increasing power of affordable graphical processing units (GPUs).

This is by far a complete history of Deep Learning which in itself will take a book so I urge the interested readers to read the chapter Introduction but if you want to jump right in, let’s get started with Linear Algebra.

❤️ Next Post: Linear Algebra with Tensorflow 2.0

Read about Linear Algebra with Tensorflow 2.0.

I would love to hear from you, if you need more explanations, have any doubts or questions, you can comment below or reach out to me personally via Facebook.

This is Chapter 1 of my Book Deep Learning with Tensorflow 2.0 and I will be posting biweekly so make sure to check out my blog for updates and Github for the codes.

DEV Community: Mukesh Mithrakumar

Common Probability Distributions with Tensorflow 2.0

3.9.1 Bernoulli Distribution

3.9.2 Multinoulli Distribution

3.9.3 Gaussian Distribution

3.9.4 Exponential and Laplace Distributions

3.9.5 The Dirac Distribution and Empirical Distribution

3.9.6 Mixtures of Distributions

Information Theory with Tensorflow 2.0

Probability Distributions with Tensorflow 2.0

3.3.1 Discrete Variables and Probability Mass functions

3.3.2 Continuous Variables and Probability Density Functions

What is a Random Variable?

Why Probability for Deep Learning?

Here's Everything you need to know about Facebooks'​ Cryptocurrency Libra

Problem Statement

The Opportunity

Introducing Libra

The Libra Blockchain

The Libra Currency and Reserve

What’s Next for Libra?

How to Get Involved

Conclusion

Principal Components Analysis with Tensorflow 2.0

Singular Value Decomposition with Tensorflow 2.0

Eigendecomposition with Tensorflow 2.0

Scalars, Vectors, Matrices and Tensors with Tensorflow 2.0

Deep Learning With TF 2.0: 01.00- Introduction

01.00 — Preface

Deep Learning

Tensorflow

01.01 — Introduction

01.02 — Who should read this book

01.03 — A Short History of Deep Learning

❤️ Next Post: Linear Algebra with Tensorflow 2.0

Here's Everything you need to know about Facebooks' Cryptocurrency Libra