The goal of this blog post is to delve into the exploding gradients problem for neural networks and its causes. While the subject has extensive exposure through various courses and blog posts on the web, i found it difficult to find sources with the right kind of depth. Throughout this post, things will be explained by making an explicit link with the neural networks graph structure, something that a lot of people will be familiar with. Finally, there are two questions that will be answered:
1) How exactly are exploding activations linked to exploding gradients?
2) When gradients explode, do they explode equally much throughout the network?
We will start by covering the background for the exploding/vanishing gradients problem.This builds on a lot of concepts and notations, which we cannot cover in a single post. In such cases, links will be added to relevant material. I also want to give a special mention to one specific source. They have a detailed overview of the everything related to neural network initialization, and most importantly, they offer an amazing embedded visualisation tool.
A future post will build on this work to explain the specifics of He vs Xavier initialization
Neural networks background
Let's use the following not so deep neural network as a leading example: .
During training, we determine how to slightly nudge our weights, based on a cost function that measures how good of a prediction we made compared to some labeled samples. To determine how much each weight should be nudged, we look at how a tiny change in said weight would impact the cost function. Then we alter the weight in the direction that decreases the cost function for those samples. Applied iteratively, this is a layman's terms explanation for gradient descent. Modern day neural network optimizers, such as Adams, have a bit more whistles and bells, but this covers the absolute basics. More information on the concepts that lead to Adams can be found here
For this blog post, I'll simplify things further by assuming that we update weights for every single sample, which is called stochastic gradient descent. This is not a good idea in practice, as learning will be noisy and slow, but it doesn't really impact this discussion.
In the figure, the gradient or partial derivative to w00[2] tells us how to update that weight for the next iteration. Nodes and edges that are not greyed out all contribute to the impact of a weight change on the final cost function. In order to calculate this partial derivative, we use the chain rule of derivatives, which breaks it up into a multiplication (note that we use J(x), to represent some cost function)
The last part, which we will call the forward part, can be expanded as:
Writing it out explicitly like this shows an important property:
We can rewrite this as a sum over products, where each term of the product represents a path from an input node through w00[2] up to the output layer. Each term has a number of factors that scales the with the depth of the neural network 2*(L), where L is the amount of layers in the network. The total amount of paths going through w00[2], is
where K is the current layer and ni the amount of nodes in layer i.
Each path also has a number of terms that scales with the depth of the network. For the forward part, we have one term per layer plus one term for the input. For the backwards part, we have two terms per layer, because we have both the linear computation of the neuron, and the non-linear activation function. Finally, there is also the derivative of the cost function that adds an additional term.
Exploding gradients
The sample network has just 5 layers in total. In our sum over products view, that would give
- 2*2*1 paths (the final layer has just one neuron) going through w00[2]
- individual terms (i.e. paths) consisting of 9 factors (2 for the forward part, 1 for the cost derivative, and 6 from the 3 layers in the backwards path)
If we disregard the activation functions for a moment - Note that the power of the network becomes equivalent to that of linear regression -, we get paths consisting of 6 factors.
Let's explore the effects of changing some network parameters. Suppose both inputs, as well as each weight is one. Note that the derivative of the cost function to the final activation is the same for each path, we will call it dc.
- If we choose 1 for all weights, the gradient for w00[2] is 4*dc.
- If we decide to change the width of the hidden layers to 100, our calculation becomes 2 * 100 * 1 * dc = 200*dc
- If we change the depth to 100 total layers, our calculation becomes 2**97 * 1 * dc. In this case, changing the depth causes the number of paths, and subsequently the gradient to explode.
The key takeaway is that changing the depth has an exponential effect, both in the amount of paths or terms, as in the amount of factors per term. Depth is directly tied to the importance of a good initialization.
What if we take the activation functions into the equation? First note the following property
The partial derivatives to weights in earlier layers are impacted more by activation functions. More precisely, each path has L-K activation factors, where K is the current layer and L the total amount of layers (inlcuding the input layer).
Let's examine the effect of both ReLu and Sigmoid activations.
- When using only ReLu activations, they will either do nothing, or drop the contribution of a path if it is negative, thus they will make the partial derivatives for weights in earlier layers higher.
- When using only Sigmoids activations, they will reduce the impact of paths, whether they are positive or negative. The partial derivative will move closer to 0.
Conclusion
Now we have all the elements to answer the initial questions.
1) How exactly are exploding activations linked to exploding gradients?
Activations are part of the gradient formula. The further to the right of the network, the more the gradient correlates with the activation.
2) When gradients explode, do they explode equally much at the back and the front of the network?
As a consequence of the activations, no. The precise effect depends on the used activation functions. If the final layer uses Sigmoid, and all the hidden layers use ReLu (a common setup), both positive and negative paths can be dropped (depending on the sign of the weights in the final layer).
The next question is, can we initialize weights in a systematic way, so that we can avoid exploding or vanishing gradients? In the next post of this series, we will explore how to do this by using statistics to keep the variance and mean of activations in check, and subsequently - as we have explored in this post - also the gradients. We will see how this results in different initialization methods for each activation function, more precisely, He- and Xavier initialization.
Top comments (0)