Understanding RNNs – Part 4: The Vanishing and Exploding Gradient Problem

#ai #machinelearning

In the previous article, we understood the concept of unrolling a network. In this article, we will examine the downsides of doing this in real-world scenarios.

The problem when unrolling

One big problem is that the more we unroll a recurrent neural network, the harder it is to train.

This problem is called the vanishing or exploding gradient problem.

In this example, the vanishing or exploding gradient problem is related to the weight along the connection that we copy each time we unroll the network.

To make this easier to understand, we will ignore the other weights and biases and focus only on w2.

As we did earlier, when we optimize a neural network using backpropagation, we first find the derivatives, or the gradients, for each parameter.

Then we plug those gradients into the gradient descent algorithm to find the parameter values that minimize the loss function, such as the sum of squared residuals.

Understanding Exploding Gradients

Now we will see how a gradient can explode.

In this example, the gradient will explode when we set w2 to any value greater than 1.

So let us set w2 = 2.

When we use Input1 and unroll the network, we calculate:

Input × 2 × 2 × 2 × 2

Since we unrolled it four times, we multiply by 2 four times.

So this becomes:

Input1 × 2⁴

We can express this more generally as:

Input × w2^num_unroll

So the input is amplified 16 times before it reaches the final copy of the network.

Now think in more realistic terms, such as 50 days of stock market data. Then we would have to unroll the network 50 times.

According to the formula, the amplification would be 2⁵⁰, which is a huge number.

This huge number is why we call it the exploding gradient problem.

That large value will make its way into the gradients, making it difficult to take small steps and find the optimal weights and biases.

When trying to find the parameter values that give the lowest value for the loss function, we usually want to take relatively small steps.

But when the gradient contains a huge number, we take relatively large steps.

Instead of finding the optimal parameters, we end up bouncing around.

Understanding Vanishing Gradients

One way to prevent the exploding gradient problem is to make the value of w2 less than 1.

However, this leads to another issue: the vanishing gradient problem.

Let us set w2 = 0.5.

If we have 50 days of data and unroll the network 50 times, then the input value becomes:

Input × 0.5⁵⁰

This will be a very small number, extremely close to 0.

This is called the vanishing gradient problem.

Now, when optimizing a parameter, instead of taking steps that are too large, we take steps that are too small.

As a result, we may reach the maximum number of allowed steps before finding the optimal value.

To address the vanishing and exploding gradient problems, a solution called Long Short-Term Memory networks is used, which we will explore in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?*
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: