DEV Community

Vishwajeet Pratap Singh
Vishwajeet Pratap Singh

Posted on

Understanding LSTMs, GRUs and Attention Blocks

Traditional neural networks failed when it comes to handling sequence problems, because they didn't have memory. To resolve this issue the concept of RNNs (Recurrent Neural Networks were introduced). RNNs are simply a fully connected layer with a loop.

But, RNNs are so poor at handling long-term dependencies that they practically fail for large sentences.

LSTMs

This issue was addressed by LSTM (Long Short Term Memory) networks. LSTMs are special RNNs but can remember long-term dependencies.

Their flow is similar to RNNs but the difference is in the LSTM cells.

LSTM has certain layers that help in maintaining the memory of these networks.

  1. Forget gate - This layer forgets the immediate information required to maintain the long-term dependency.

  2. Input gate - This helps decide what new information is to be added to the context vector.

  3. Update gate - Again adds the new information which was removed in the last steps.

  4. Output gate - The output gate provides the new hidden state vector.

GRU

GRU (Gater Recurrent Unit) is a variant of LSTM and has following modifications.

  • It combines the forget and input gates into a single "update gate".
  • It also merges the cell state and hidden state
  • It updates the memory twice, the first time (using old state and new input, called Reset Gate) and the second time (as final output).
  • Old cell state or hidden state (with input) is used for its own update as well as for deciding

Attention in Encoder-Decoder Architecture

When we talk about sequence-to-sequence models they consist of encoder-decoder architecture. The encoder processes each item, compiles it into a vector (also called as context) and passes it to the decoder. The decoder then produces the output sequence. The issue with these models is that the context vector is a bottleneck for them.

Attention models handles this issue. The encoder passes all the hidden states to the decoder instead of passing the last hidden state.
Also the decoder multiplies the hidden state to a softmax score as a weight of the hidden state so that the scores are amplified which have high score and the hidden states with low scores are diminished.

Top comments (0)