DEV Community

Cover image for Understanding LSTMs – Part 4: How LSTM Decides What to Forget
Rijul Rajesh
Rijul Rajesh

Posted on

Understanding LSTMs – Part 4: How LSTM Decides What to Forget

In the previous article, we completed the first part of the LSTM and obtained the result from the calculation.

Let us continue.

Earlier, when the input was 1, we obtained the following result:

Now, if we change the input to a relatively large negative number, such as −10, then after calculating the x-axis value, the output of the sigmoid activation function will be close to 0.

The long-term memory will be completely forgotten, because anything multiplied by 0 is 0.

Since the sigmoid activation function converts any input into a value between 0 and 1, its output determines what percentage of the long-term memory is retained.

So, the first stage of the LSTM determines what percentage of the long-term memory is remembered. This part is called the forget gate.

Now that we understand what the first stage does, let us explore the second stage.

In the second stage, the block on the right combines the short-term memory and the input to create a potential long-term memory.

The block on the left then determines what percentage of that potential memory should be added to the long-term memory.

Let us plug in the numbers to see how a potential memory is created and how much of it is added to the long-term memory.

We will continue exploring this in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name
Enter fullscreen mode Exit fullscreen mode

… and you’re done! 🚀

Installerpedia Screenshot

🔗 Explore Installerpedia here

Top comments (1)

Collapse
 
klement_gunndu profile image
klement Gunndu

The sigmoid output as a percentage-of-retention framing is much cleaner than the gate metaphor I usually see — calling the forget gate output a "retention coefficient" would make the math more intuitive. Do you plan to cover the interaction between forget and input gates in the same update step later in the series?