Cross Entropy Derivatives, Part 1: Differentiating Softmax Outputs

#machinelearning #ai

In the previous article, we just touched upon the idea of cross entropy, in the coming articles, we will be going deeper.

Consider our previous neural network.

It takes petal and sepal width measurements as inputs and, using a softmax layer at the end, outputs predicted probabilities for the species of iris we measure.

Since we use softmax during training, we evaluate how well the neural network fits the data using cross entropy.

If we want to optimize the network parameters using backpropagation, we need to take the derivative of the cross-entropy equation with respect to the different weights and biases in the neural network.

To demonstrate the basic principles behind backpropagation with cross entropy, we will go through an example where we optimize the bias ( b_3 ).

How the neural network makes predictions

First, we feed values between 0 and 1 into the inputs for petal and sepal widths.
These values are modified by weights and biases and passed through an activation function to create a blue bent surface.
1. This surface is scaled by a weight of (-0.1).
Similarly, we create an orange bent surface using another activation function.
1. This surface is scaled by a weight of (1.5).
The blue and orange surfaces are combined to form a green crinkled surface.
Finally, we add a bias term ( b_3 ), which produces the final green crinkled surface.

The same process applies to Versicolor.

Add the blue bent surface.
Add the orange bent surface.
Add a bias term.
This results in a red crinkled surface.

Finally, to compute the raw output value for Virginica, we add the blue bent surface to the orange bent surface and include a bias term ( b_5 ). This produces a purple crinkled surface.

Softmax and cross entropy

The raw output values are then passed through the softmax function to obtain the final predicted probabilities for Setosa, Versicolor, and Virginica.

Given training data, we feed the inputs through the network to compute the raw output values, apply the softmax function, and obtain predicted probabilities. For example, a value of 0.57 may represent the predicted probability for Setosa.

To evaluate how good this prediction is, we plug the predicted probability into the cross-entropy equation. At this point, we know how to compute the cross-entropy value.

Now we examine how changes in the bias ( b_3 ) affect the cross-entropy loss.

The bias ( b_3 ) finalizes the green crinkled surface and therefore determines the raw output for Setosa. By changing ( b_3 ), we change the raw output value for Setosa, which directly affects its softmax probability.

This happens because the raw output appears in both the numerator and the denominator of the softmax expression. Changing ( b_3 ) also affects the softmax probabilities for Versicolor and Virginica, since it appears in their denominators as well.

In the next step, we will look at how these predicted probabilities are plugged into the cross-entropy equation for each class and how this leads to the gradient used in backpropagation.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: