Cross Entropy Derivatives, Part 2: Setting Up the Derivative with Respect to a Bias

#ai #machinelearning

In the previous article we reviewed the key ideas needed to work with derivatives of cross entropy.

This article, we will set up the derivative step by step.

When we plug a predicted probability into the cross-entropy equation, the form of the equation depends on the observed species.

Because the observed species is Setosa, the corresponding predicted probability is the predicted probability for Setosa.

Using the softmax function, the predicted probability for Setosa is

Substituting this into the cross-entropy equation gives

If the observed species were Virginica, then the softmax equation for Virginica would give the corresponding predicted probability:

Each of these cases leads to a slightly different equation. As a result, we obtain different derivatives of the cross entropy with respect to the bias ( b_3 ).

We can summarize the derivatives as follows:

Derivative of cross entropy for Setosa with respect to b3

Let us begin with the derivative of the cross entropy for Setosa with respect to b3

First, we examine the predicted probability for Setosa more closely. The cross-entropy loss is defined as

The predicted probability for Setosa comes directly from the softmax function:

The inputs to the softmax function are the raw output values for Setosa, Versicolor, and Virginica.

Only the green surface corresponding to Setosa is directly influenced by the bias ( b_3 ). This raw output is formed by combining the blue and orange surfaces and then adding ( b_3 ).

To optimize ( b_3 ) using gradient descent, we must compute the derivative of the cross entropy with respect to ( b_3 ). The cross-entropy loss is connected to ( b_3 ) through the predicted probability for Setosa and the raw output value for Setosa.

To compute this derivative, we apply the chain rule. The chain rule states that the derivative of the cross entropy with respect to ( b_3 ) is equal to the derivative of the cross entropy with respect to the predicted probability for Setosa, multiplied by the derivative of the predicted probability with respect to the raw output for Setosa, multiplied by the derivative of the raw output for Setosa with respect to ( b_3 ).

We will compute each of these terms step by step in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: