DEV Community: Vishwajeet Pratap Singh

Detailed Explanation to 'Attention is all you need'

Vishwajeet Pratap Singh — Thu, 04 Feb 2021 18:45:50 +0000

This blog is an attempted explanation to paper Attention is all you need. You can find the paper here

You can find the source code here

RNNs and CNNs have been used dominantly in sequence to sequence tasks, but the use of attention has proved much better results.

Here we will focus on self-attention where we essentially find the co-relation between the embeddings of a sequence.
But, Before moving to the attention part, lets take a look at word-embeddings.

Word Embedding is a vector representation of a particular word. So, even for words representing multiple meanings the embedding is same.
For example words like - apple - it could be a fruit or name of a company.
The embedding would be same for both the meanings. Lets assume that the embedding representing 'apple' is a 50 dimension vector represented below:
[0.2, -0.7, 0.6 ... 0.1. 0.3]
What could this vector possibly represent ?

The values in the vector represent concepts, the value at 'index 1' may represent the concept of a fruit and then the values at 35th index may represent something related to electronics or may be a combination of multiple indices represent such concepts.

When RNNs and CNNs were used the neural networks couldn't decide which concepts to focus on. This is where self attention makes the difference and is able to find the corelation between concepts.

For example - if the network encounters a sequence 'apple iphone' then it is able to identify the context of an electronic device, and if it encounters something as 'apple pie' then it catches the context of something eatable.

Lets look at the architecture of the network. In the paper they train the network for machine translation.
The network is an encoder decoder architecture which uses attention layers or we can say blocks.
We will look into each block one by one.

The left and right blocks represent the encoder and decoder respectively.

Let's study at the encoder first.

Encoder

The input to the encoder is the source sentence. The source is converted to a embedding.
The embedding is concatenated with positional encoding. Lets call this as input embedding.
The input embedding is passed to three different fully connected layers and the three outputs are passed to the multi-head attention block. (We will learn attention and then multi-head attention in detail)
These outputs are named query, key and value. Using three different fully connected layers proved to give better result as compared to using a single fully connected layer and repeating its output thrice.
We add the input embedding to the output of multi head attention layers as a skip connection and normalize the sum.
We pass this output to a feed forward layer which is literally fully connected layers. We again add skip connection and normalize the output before calling it as encoder output.

That's essentially the work of the encoder. Now lets look at the attention mechanism.

Self-Attention

Let's take an example of input sequence - 'walk by river bank'. As discussed earlier the embeddings contain concepts for various meanings. Here the bank embedding is related to river bank. Lets see how self attention identifies the context.

The input to the attention is query, key and value as discussed.
Note : the input embedding is passed to three fc layers. That is not shown in the diagram above for simplicity.
The embedding in the diagram are some N dimensional vector.
The first thing attention does is to find the corelation between embeddings. If the embeddings relate to similar concepts the corelation value is higher otherwise lower.
We take the query and key which are essentially the same, and find the corelation between every pair of words.
This gives us a square matrix which has the corelation value. The square matrix is result of matrix multiplication of the query and key.
Here we will see that the words which are conceptually related will have higher value. For example the word 'river' and 'bank' are somehow related to a concept of water body. The dimension which represent the context of a water body will multiply and contribute to higher corelation value.
Then we take the column wise softmax so that network focuses more on required concepts. It makes the bright blocks brighter and the dark blocks darker.
Every time we scale or normalize we do so to prevent the issue of exploding gradients.
The value is then multiplied with the corelation matrix to output the contexualized embedding.
This contexulized embedding is different from the input as we can see the embeddings now just dont represent just their own concepts but also focus on similar words which share the same concepts.

This is called as self attention.

Now lets take a look at multi-head attention layer. This would be more intuitive now.

It has the same mechanism, what we did above, just that the input embedding is divided into multiple equal size blocks which are called head.
This is done so that the network focuses more on the concepts which are required. The head which focuses on concepts that do not contributes is suppressed by the network.

Now you may read again the encoder part to have a better understanding of the entire flow.

Decoder

The decoder is almost similar to the encoder with the major difference that it has one more multi-head attention layer which takes input from the encoder.
The input to the decoder is target sentence.
The steps are same what we did earlier in encoder.
In the multi-head attention layer we only consider the lower triangle of the corelation matrix, because we dont want the network to see learn the concepts of the next word of the sequence, because that is what we want the network to predict. Think again on this part to have a better understanding.
In the next multihead attention layer the input is the encoder key, encoder value and the target is used as query.
This output is passed to fully connected layers to as we did in encoder. This is basically to increase the capacity of the network.
Finally predictions are made by passing through the softmax layer.

Convolutions in Sequence to Sequence networks

Vishwajeet Pratap Singh — Thu, 28 Jan 2021 21:27:46 +0000

Sequence to Sequence networks often use RNNs and its derivatives along with attention to solve problems in machine translation, question-answer models, text sumarization etc.

Solving the same problems using convolutions proved better results. Here we will take a deep dive into how to design a convolutional sequence to sequence network.

You can find the source code here

Lets get started.

The networks uses encoder decoder architecture along with attention.

The components of the models are -

Encoder
Decoder
Attention Block Lets see them one by one.

But before moving into these lets see how convolution works here.

In RNNs we would take a hidden state or a context vector from the previous time step, but that only makes the model cram the next word for a given sequence has catches very less contextual information or contextual relation between the words.

On the other hand CNNs try to capture the spatial information hidden between the within the sentences.
The model gets trained such a way that each dimensions of embedding store a certain concepts within itself.
For example lets say we have a 256 dim embedding and we are given two sentences.

I am using apple.
I am eating apple.

Here we see that the first sentence has a context of a device and second has a context of an edible item.
When convolutions run between the embeddings of 'using apple' and 'eating apple' they try to capture their respective contexts (with the help of attention)
Lets assume the 50th dim of embedding has the context of a device and 100th dim has the context of an edible. The attention will only weight them if the context matches and therefore the model learns better.

Now lets move the encoder.

Encoder

First, we pad the sentence from both ends.
The encoder takes the token embedding along with the positional embedding and then performs their elements wise sum.
This is now fed to a Fully connected layers to convert into a desired embedding. The output of FC layer is fed to the convolutional blocks.
The output of the convolutional blocks is now again passed into a FC layer.
Here we add a skip connection with the elementwise sum output ( positional and token embedding) to give a final output. This is called conved ouput.
We send two outputs to the decoder. One is conved output and other is combined output(conved + embedded).

Encoder Convolution Blocks

The input to the convolution blocks is embedding of the length of the sentence.
We use a odd sized kernel and pad the inputs at each convolution.
We use GLU activation function.

Decoder

The text input to the decoder is similar to the encoder. We take element wise sum of the token and positional embeddings.
This sum is passed to a FC layer and further passed to the convolutional blocks.
This time the convolutional blocks accept the elementwise sum as skip connection and also the conved output and combined output from the encoder.
The output of the convolutional block is again to a FC layer and then sent out to make the predictions.

Decoder Convolutional blocks.

The difference in decoder convolutional block is that the input is padded twice at the front.
The logic behind this is straightforward that we are using a kernel of size 3 and padding the sentence before the token prevents the decoder from catching the context of the next prediction. Not using this will fail the decoder from learning.

Attention

The attention module is used to calculate weight of the features so that the models is able to focus on the necessary features.

The module takes the conved output from the encoder and combined embedded output from the decoder and performs a weighted sum over them.

Understanding LSTMs, GRUs and Attention Blocks

Vishwajeet Pratap Singh — Fri, 04 Dec 2020 00:24:27 +0000

Traditional neural networks failed when it comes to handling sequence problems, because they didn't have memory. To resolve this issue the concept of RNNs (Recurrent Neural Networks were introduced). RNNs are simply a fully connected layer with a loop.

But, RNNs are so poor at handling long-term dependencies that they practically fail for large sentences.

LSTMs

This issue was addressed by LSTM (Long Short Term Memory) networks. LSTMs are special RNNs but can remember long-term dependencies.

Their flow is similar to RNNs but the difference is in the LSTM cells.

LSTM has certain layers that help in maintaining the memory of these networks.

Forget gate - This layer forgets the immediate information required to maintain the long-term dependency.
Input gate - This helps decide what new information is to be added to the context vector.
Update gate - Again adds the new information which was removed in the last steps.
Output gate - The output gate provides the new hidden state vector.

GRU

GRU (Gater Recurrent Unit) is a variant of LSTM and has following modifications.

It combines the forget and input gates into a single "update gate".
It also merges the cell state and hidden state
It updates the memory twice, the first time (using old state and new input, called Reset Gate) and the second time (as final output).
Old cell state or hidden state (with input) is used for its own update as well as for deciding

Attention in Encoder-Decoder Architecture

When we talk about sequence-to-sequence models they consist of encoder-decoder architecture. The encoder processes each item, compiles it into a vector (also called as context) and passes it to the decoder. The decoder then produces the output sequence. The issue with these models is that the context vector is a bottleneck for them.

Attention models handles this issue. The encoder passes all the hidden states to the decoder instead of passing the last hidden state.
Also the decoder multiplies the hidden state to a softmax score as a weight of the hidden state so that the scores are amplified which have high score and the hidden states with low scores are diminished.