DEV Community: sdev2030

Sequence to Sequence model using Multi-head Attention Transformer Architecture

sdev2030 — Thu, 04 Feb 2021 20:16:10 +0000

In this blog we describe the different components of a transformer model used for sequence to sequence tasks like language translation. It was introduced in 2017 landmark paper "Attention is all you need" and is widely adopted by machine learning practitioners to achieve SOTA results on various tasks.

Diagram at the beginning of this blog shows its encode-decoder architecture. Left side of the diagram is encoder and right side is decoder.

Encoder:

The input data is passed through embedding layer and result is sent as Query, Key and value tensors into block of encoding layers that has multi-head attention layer. In each encoding layer block, output of the multi-head attention layer along with input embedding is added together and fed into feed forward layers to get the encodings.

Initialization Function:

Parameters accepted for instantiating the encoder object are input_dim (Input dimension), hid_dim(Hidden Layer dimension), n_layers(Number of encoding Layers) , n_heads (number of heads for the encoding layers), pf_dim(positionwise feed forward layer imput dimension), dropout(drop out in decimals), device(Device type to be used) and max_length(maximum length of input sequence)

Set the device to the input device passed to the encoder. Set scale value to square root of hid_dim and move it to device. This will be used in attention layers to prevent overflow.
Instantiate token embedding function using nn.Embedding with input_dim as input and hid_dim as output dimension. Instantiate position embedding function using nn.Embedding with max_length as input and hid_dim as output dimension. Both the token and position embeddings will have the same size hid_dim as they will be added together.

Instantiate Layers, which will have EncoderLayer, using nn.ModuleList function to match the input parameter n_layers. EncoderLayer takes hid_dim, n_heads, pf_dim, dropout and device as input parameters.Instantiate dropout with nn.Dropout layer with input parameter passed to the function.

Forward Function:

Parameters accepted are the source(src) tensor and its mask(src_mask).

We will get batch_size from the src tensor using shape function. Index “0” will give the batch size as the batch first parameter is set in the iterator definition. Source length(src_len) is also gotten from src tensor using shape function. Index “1” will give source length.

Using source length(src_len) , create position tensor that assign numbers 0 to src_len for all the words. Repeat this batch_size times to get the final position tensor(pos). Create embedding for source(src) and position(pos) tensors. Combine the two embeddings by element wise summing. Apply dropout to the result(embedded) as we are building the embedding from scratch.

This will result in a tensor of shape [batch size, src len, hid dim]. This will be used as input to EncoderLayer blocks.

Encoder Layer:

Parameters accepted are hid_dim(hidden dimension), n_heads(number of heads for multi-head attention), pf_dim(pointwise feedforward dimension), dropout(dropout fraction) and device(device to be used).

During initialization, normalization layer is defined with nn.LayerNorm using hid_dim parameter. Attention layer is defined with MultiHeadAttention using hid_dim, n_heads, dropout and device parameters. Positionwise_feedforward layer is defined with PositionwiseFeedForward using hid_dim, pf_dim and dropout.

During Forward function, for each Encode Layer block, apply Multi-Head attention layer to encoder layer input by passing the same input as query, key and value tensors along with source mask. Add the result from attention layer to the encoder layer input. Then, apply layer norm function to the result to get _src. Apply pointwise feedforward layer to the _src to get result. Add result, _src and apply layer norm to get final embedding.

Multi-Head Attention:

Attention is a mechanism that allows a model to focus on the necessary parts of the input sequence as per the demands of the task at hand.

Researchers at google like to look at everything as an information retrieval problem. Therefore the "Attention is all you need" paper tries to look at attention in terms of "Query", "Keys" and "Values". A search engine accepts a "Query" and tries to match it up with Indices(i.e. Keys) in order to get appropriate values as results for the query. Similarly one can think of attention as a mechanism in which the query vector and key vector work towards getting the right attention weights(i.e. values).

When multiple channels(or heads) of attention are applied in parallel to a single source, it is known as multi-headed attention. This increases the learning capacity of the model and therefore leads to better results.

We define a MultiHeadAttentionLayer class that is responsible for applying the multi-headed attention mechanism within the transformer. It accepts the query, key, and value tensors as input and uses fully connected layers to preprocess them. These are then split into multiple heads along the 'hid_dim' axis to give rise to the required queries, keys, and values for applying multi-headed attention. The attention energies are generated by multiplying the multi-head queries and keys together. These energies are then passed into a softmax function to give rise to attention weights that are applied to the values tensor. This helps the model focus on the necessary aspects of the values tensor. This value vector is resized to its original dimension and returned as an output of the Multi-Head Attention operation.

Decoder:

In the decoder the output data is first shifted right and then passed through the output embedding layer and result is sent as Query, Key and value tensors into a block of encoding layers that has a multi-head attention(MHA) layer. In each decoding layer block, similar to the encoding block, output of the MHA layer along with output embedding is added together and fed into feed forward layers. The only difference is that the decode block has two MHA layers, one being the normal MHA layer the other is a masked MHA layer. The target is fed to the masked MHA layer. Also for the MHA layer we use the decoder's masked MHA layer output as the query and the encoder's output as the key and value.

The reason behind having two different attention layers is to understand the self attention(between target words) and encoder attention(attention between target and input words).

Initialization Function:

Parameters accepted for instantiating the decoder object are output_dim (Output dimension), hid_dim(Hidden Layer dimension), n_layers(Number of decoding Layers) , n_heads (number of heads for the decoding MHA layers), pf_dim(position-wise feed forward layer input dimension), dropout(drop out in decimals), device(Device type to be used) and max_length(maximum length of target sequence)

Similar to the encoder we set the device to the input device passed to the encoder. Set scale value to the square root of hid_dim and move it to the device. This will be used in attention layers to prevent overflow.
As said the decoder for the most part is same as the encoder, so we again instantiate token embeddings using nn.Embedding, but with output_dim as input and hid_dim as output dimension and similarly position embeddings using nn.Embedding with max_length(target sequence) as input and hid_dim as output dimension. Similar to the encoder, the token and position embeddings in decoder block will be of hid_dim size as they will be added together.

Instantiate Layers, which will have DecoderLayer, using nn.ModuleList function to create n_layers of the decoder block. Similar to EncoderLayer, DecoderLayer takes hid_dim, n_heads, pf_dim, dropout and device as parameters.Instantiate dropout with nn.Dropout layer with dropout value passed to the function.

Forward Function:

Parameters accepted are the target(trg) and its mask (trg_mask) along with encoded source(enc_src) tensor and its mask(src_mask).

We will get batch_size from the trg tensor using shape function. Index “0” will give the batch size as the batch first parameter is set in the iterator definition. Target length(trg_len) is also gotten from trg tensor using shape function. Index “1” will give target length.

Again following what we did in the encoder using target length(trg_len) , we create position tensor that assign numbers 0 to trg_len for all the words. Repeat this batch_size times to get the final position tensor(pos). Create embedding for source(trg) and position(pos) tensors. Scale up the tok_embedding and then Combine the two embeddings by element wise summing. We apply Dropout to get our target embeddings of shape [batch size, trg len, hid dim]. This will be used as input to Decoder Layer blocks.

Decoder Layer:

Parameters accepted are hid_dim(hidden dimension), n_heads(number of heads for MHA layers), pf_dim(pointwise feedforward dimension), dropout(dropout fraction) and device(device to be used). As previously mentioned, the decoder block is mostly the same except for the two attention layers. So let's see what changes are required specific to those layers. The first MHA layer is a masked variant of the normal MHA layer which is used for self-attention, similar to the encoder, but for the target sequence. For this, the target embedding is used as the query, key and value.

After this we apply dropout and the target embedding is again added to this using a residual connection which is followed by layer normalization. This layer uses the trg_mask to prevent the decoder from cheating as it constrains the decoder to pay attention to words that are after the current word.
Another thing to note is how we actually feed the enc_src, into our decoder. We feed it to the MHA layer in which the queries are the output from the masked MHA layer and the keys and values are the encoder output. The src_mask is used to prevent the MHA layer from attending the tokens in the src. We again apply dropout and the output from the masked MHA is added to this using a residual connection which is followed by layer normalization.

This is then passed through the position-wise feedforward layer and then again dropout and then we add the output of the MHA layer using a residual connection and then apply layer normalization.

Sequence to Sequence using Convolution

sdev2030 — Thu, 28 Jan 2021 18:46:34 +0000

We will walk through the encoder-decoder architecture for sequence to sequence model using convolutional layers and attention mechanism. This blog has three main sections explaining Encoder, Attention mechanism and decoder functions.

Encoder:

Encoder Initialization :

Parameters accepted for instantiating the encoder object are input_dim (Input dimension), emb_dim(embedding dimension), hid_dim(Hidden Layer dimension), n_layers(Number of Convolution Layers) , kernel_size(kernal size used for convoluting) , dropout(drop out in decimals), device(Device type to be used) and max_length(maximum length of input sequence)

First, we will check for the kerner size being an odd number. We can divide the kernel size by 2 and make sure that the remainder is equal to 1. For Encoder, the kernel size must be an odd number, whereas for the decoder it could be an odd or even number.

Set the device to the input device passed to the encoder. Set scale value to square root of 0.5 and move it to device. This will be used in convolution layers to prevent overflow.

Instantiate token embedding function using nn.Embedding with input_dim as input and emb_dim as output dimension. Instantiate position embedding function using nn.Embedding with max_length as input and emb_dim as output dimension. Both the token and position embedding will have the same size emb_dim as they will be added together.

Instantiate the embedding to hidden as nn.Linear layer with emb_dim as input size and hid_dim as output size. Instantiate the hidden to embedding as nn.Linear layer with hid_dim as input size and emb_dim as output size.

Instantiate convolution block with the nn.Conv1d layers to match the input parameter n_layers. Instantiate dropout with nn.Dropout layer with input parameter passed to the function.

Encoder Forward :

Parameter accepted is the source(src) tensor.

We will get batch_size from the src tensor using shape function. Index “0” will give the batch size as the first parameter is set in the iterator definition. Source length(src_len) is also from src tensor using shape function. Index “1” will give source length.
Using source length(src_len) , create position tensor that assign numbers 0 to src_len for all the words. Repeat this batch_size times to get the final position tensor(pos). Create embedding for source(src) and position(pos) tensors. Combine the two embeddings by element wise summing. Apply dropout to the result(embedded) as we are building the embedding from scratch.

Pass the embedding through the linear layer to convert from embedding dim to hidden dim. This will result in a tensor of shape [batch size, src len, hid dim]. Switch the 1st and 2nd position tensor values to get the final tensor of size [batchsize, hid dim, src len]. This will be used as input to convolution blocks.

Encoder Convolution block:

For each convolution block, apply dropout to convolution input. Apply GLU activation function to the result. Add the result, convolution input and scale the final result with the scaling factor defined in init function. Assign result to convolution input for next iteration.

After all the convolution layers, switch src len, hid dim dimension values and apply linear transformation to convert the hid dim resulting tensor to embedding dim tensor. The resulting tensor (conved) will have dimension of [batch size, src len, emb dim]
Element Wise sum output (conved) and input (embedded) to be used for attention.

Attention Mechanism :

The attention layer is applied at the end of each convolution block in the decoder.
The attention mechanism accepts the following inputs:
1) encoder’s conved and combined outputs.
2) decoder’s input embedding and GLU activations from the existing conv block.
The decoder’s glu activations are transformed into an embedding of size equal to that of the decoder input embedding. The glu activation embedding and decoder input embedding are then combined and scaled down to prevent explosion of values within the decoder network.
This combined decoder embedding is then multiplied with the encoder’s conved output to form the energy values of the attention mechanism. These energy values are then passed into the softmax function to give the attention weights that specify how important certain source token is with respect to a given decoder prediction.
Attention weights are multiplied with the combined encoder output to give rise to attention encodings that are then transformed into embeddings of size equal to that of the glu activations vector from the current conv block. The combination of these attention encoding vector and glu activations vector is the final output of the attention operation that is applied at the end of each conv block within the decoder.

Decoder

Decoder Initialization :

The Decoder block takes in output_dim(Output Dimensions), emb_dim(Embedding dimensions), hid_dim(Hidden dimensions), n_layers(Number of layers in the decoder Convolution block), kernel_size(Size of the 1D kernel for Convolution), dropout(Dropout Probability), trg_pad_idx(Token index of the Padding token), device(GPU/CPU), max_length(Maximum length of the sequence, used for positional encoder).

We start with setting the kernel size for the convolution operations which are basically filters that slides across the tokens within the sequence to encode the information.After this we set the trg_pad_idx to the pad token value which can be found using vocab.stoi which is a reverse map of words to token. Also we need to set the scale value to the square root of 0.5. This is used to ensure that the variance throughout the network does not change dramatically

Similar to the encoder we now instantiate the token embedding function using nn.Embedding with output_dim as input and emb_dim as output dimension. Similarity we create positional embedding using nn.Embedding with max_lenght(sentence length) as input and emb_dim as output dimension. Similar to the encoder both the token and position embedding will have the same size emb_dim as then will be added together. After this we create the linear layer for converting embedding to hidden representations using nn.Linear layer with emb_dim as input size and hid_dim as output size. We will also need another linear layer for converting hidden representations to embeddings using nn.Linear layer with hid_dim as input size and emb_dim as output size.

Now we will define the Decoder Convolution block. We will create the convolution block with the nn.Conv1d layers and then create multiple layers(equal to n_layers) using the ModuleList. One thing to note, in the decoder block the padding will be a little different than the encoder block as we will only be padding in the beginning(i.e initial positions) , not equally on both sides. In the encoder we padded equally on each side to ensure the length of the sentence stays the same throughout , In decoder we only pad at the beginning of the sentence. As we are processing all of the targets simultaneously in parallel, and not sequentially, we need a method of only allowing the filters translating token i to only look at tokens before word i. If they were allowed to look at token i+1 (the token they should be outputting), the model will simply learn to output the next word in the sequence by directly copying it, without actually learning how to translate.At last we also need to add the dropout layer with nn.Dropout.

Decoder Forward

Now we will define the forward pass for the decoder block. For this we will be using trg(target), encoder_conved(conved output from the encoder) and encoder_combined(Combined output from the encoder) as the parameters.

First we get the batch_size from the trg_tensor using shape function. Index “0” will give the batch size. Similarly we will get target length(trg_len) from trg tensor using shape function. Index “1” will give length.We use target length(trg_len) to create position tensor that assign numbers 0 to trg_len for all the words and then repeat this batch_size times to get the final position tensor.Similarly we get the hid_dim size from the conv_input using shape function later in the code.For that index “1” will give the hidden dimension. After that we create embedding for target(trg) and position(pos) tensors. Now we combine the two embeddings by doing element wise summation and then applying dropout to the result(embedded). Then we pass these embedding through the linear layer to convert from embedding dim to hidden dim. This will result in a tensor of shape [batch size, trg len, hid dim].

After this we Switch the first and second position tensor values to get the final tensor of size [batchsize, hid dim, trg len]. This will be used as input to decoder convolution blocks.

Decoder Convolution block :

Decoder Convolution block is similar to the Encoder convolution block with some minor changes. Similar to encoder, for each convolution block, apply dropout to conv_input. But in decoder when for padding we will concatenate a zero tensor with the conv_input rather than padding it equally on both sides so that the model can’t cheat and just copy the next word in the sequence.
After this we pass the padded conv_input through the conv block(which is similar to the conv layer in the encoder). The output of this will then be fed to GLU activation function .This is now pass through the attention block along with the embedding, outputs from the encoder i.e encode_conved and encode_combined.We then add the result to convolution input(residual connection) and scale the final result with the scaling factor defined in init function. This becomes the convolution input for the next loop iteration.

Similar to encoder after all the convolution layers, we switch trg_len and hid_dim dimension values and apply linear transformation to convert the hid dim resulting tensor to embedding dim tensor. The resulting tensor (conved) will have dimension of [batch size, trg len, emb dim]. The final output will be fed to a linear layer after applying dropout to the final linear layer(hid2emb) output.

Thanks to Brent Trevett notebook for explaining the intuition behind using convolutional layer for sequence to sequence modelling.

LSTM, GRU and Attention Mechanism

sdev2030 — Thu, 03 Dec 2020 19:52:26 +0000

Introduction

Recurrent Neural networks (RNN) are a class of artificial neural networks that
are helpful in modelling sequence data. The output from the previous step is fed
as input to the current step and this helps to learn about the dependencies in a
sequential data. They produce predictive results on sequential data that other
algorithms can’t.

Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two advanced
implementations of RNNs which helps to overcome the problem of vanishing
gradients during back propagation in RNNs.

Long Short Term Memory (LSTM)

In LSTM there are four layers instead of just one layer present in a standard
RNN. These layers and the corresponding operations help LSTM to keep or forget
information.

The core concept is the memory cell state (Ct) that runs through the whole
process and helps to keep the required information throughout the processing of
the sequence. The four main processing steps performed in an LSTM cell are as
follows:

Forget Gate:

In this step as the name suggests the layer decides which information should
be kept or discarded from the cell state. It takes as input concatenation of
the previous hidden state (ht-1) and the current input (xt), passes thru a
sigmoid function to get output between 0 and 1. Closer to 0 mean forget and
closer to 1 mean keep.

Input Gate:

In this step, we decide on what needs to be updated into cell state. There
are two layers used to achieve this. First the previous hidden state and
current input is passed through a sigmoid function to decide on which values
are important. Second the previous hidden state and current input are passed
thru tanh function to squish the values between -1 and 1. Then we multiply
tanh output with sigmoid output to decide on which information to keep.

Update Cell State:

In this step, we will calculate the new cell state. First we multiply the
previous cell state with the output from forget gate. Then we add the output
from the input gate to get the new cell state.

Output Gate:

In this step, we calculate the next hidden state. It is also used as output
for predictions. First we pass the previous hidden state and current input
to a sigmoid function. Then we pass the new cell state to a tanh function.
Then we multiply the tanh output with sigmoid output to derive the new
hidden state.

Gated Recurrent Unit (GRU)

GRU is simplified version of LSTM. They don’t have cell state but uses hidden
state to transfer information. They have only two gates instead of 3. Forget and
input gates are combined into a single update gate. Output gate is called Reset
gate as it resets the hidden state.

Attention Mechanism

In Seq2Seq RNN models, between the encoder and decoder networks, we introduce an
attention mechanism to help the decoder network to focus on specific areas of
the encoded inputs in each of its time step. Following diagram illustrates this
concept for the first time step.

Attention mechanism will learn the attention weights i01, i02, i03 and i04
utilizing a fully connected (FC) layer and a soft max function. It takes input
as the initial state vector S0 of the decoder network and the outputs h0, h1, h2,
h3 from encoder network. The context vector C1 is calculated by doing a sum of
dot product between the hidden states h0, h1, h2 and h3 with the attention
weights i01, i02, i03 and i04 respectively.