DEV Community

sdev2030
sdev2030

Posted on

Sequence to Sequence model using Multi-head Attention Transformer Architecture

In this blog we describe the different components of a transformer model used for sequence to sequence tasks like language translation. It was introduced in 2017 landmark paper "Attention is all you need" and is widely adopted by machine learning practitioners to achieve SOTA results on various tasks.

Diagram at the beginning of this blog shows its encode-decoder architecture. Left side of the diagram is encoder and right side is decoder.

Encoder:

The input data is passed through embedding layer and result is sent as Query, Key and value tensors into block of encoding layers that has multi-head attention layer. In each encoding layer block, output of the multi-head attention layer along with input embedding is added together and fed into feed forward layers to get the encodings.

Initialization Function:

Parameters accepted for instantiating the encoder object are input_dim (Input dimension), hid_dim(Hidden Layer dimension), n_layers(Number of encoding Layers) , n_heads (number of heads for the encoding layers), pf_dim(positionwise feed forward layer imput dimension), dropout(drop out in decimals), device(Device type to be used) and max_length(maximum length of input sequence)

Set the device to the input device passed to the encoder. Set scale value to square root of hid_dim and move it to device. This will be used in attention layers to prevent overflow.
Instantiate token embedding function using nn.Embedding with input_dim as input and hid_dim as output dimension. Instantiate position embedding function using nn.Embedding with max_length as input and hid_dim as output dimension. Both the token and position embeddings will have the same size hid_dim as they will be added together.

Instantiate Layers, which will have EncoderLayer, using nn.ModuleList function to match the input parameter n_layers. EncoderLayer takes hid_dim, n_heads, pf_dim, dropout and device as input parameters.Instantiate dropout with nn.Dropout layer with input parameter passed to the function.

Forward Function:

Parameters accepted are the source(src) tensor and its mask(src_mask).

We will get batch_size from the src tensor using shape function. Index “0” will give the batch size as the batch first parameter is set in the iterator definition. Source length(src_len) is also gotten from src tensor using shape function. Index “1” will give source length.

Using source length(src_len) , create position tensor that assign numbers 0 to src_len for all the words. Repeat this batch_size times to get the final position tensor(pos). Create embedding for source(src) and position(pos) tensors. Combine the two embeddings by element wise summing. Apply dropout to the result(embedded) as we are building the embedding from scratch.

This will result in a tensor of shape [batch size, src len, hid dim]. This will be used as input to EncoderLayer blocks.

Encoder Layer:

Parameters accepted are hid_dim(hidden dimension), n_heads(number of heads for multi-head attention), pf_dim(pointwise feedforward dimension), dropout(dropout fraction) and device(device to be used).

During initialization, normalization layer is defined with nn.LayerNorm using hid_dim parameter. Attention layer is defined with MultiHeadAttention using hid_dim, n_heads, dropout and device parameters. Positionwise_feedforward layer is defined with PositionwiseFeedForward using hid_dim, pf_dim and dropout.

During Forward function, for each Encode Layer block, apply Multi-Head attention layer to encoder layer input by passing the same input as query, key and value tensors along with source mask. Add the result from attention layer to the encoder layer input. Then, apply layer norm function to the result to get _src. Apply pointwise feedforward layer to the _src to get result. Add result, _src and apply layer norm to get final embedding.

Multi-Head Attention:

Attention is a mechanism that allows a model to focus on the necessary parts of the input sequence as per the demands of the task at hand.

Researchers at google like to look at everything as an information retrieval problem. Therefore the "Attention is all you need" paper tries to look at attention in terms of "Query", "Keys" and "Values". A search engine accepts a "Query" and tries to match it up with Indices(i.e. Keys) in order to get appropriate values as results for the query. Similarly one can think of attention as a mechanism in which the query vector and key vector work towards getting the right attention weights(i.e. values).

When multiple channels(or heads) of attention are applied in parallel to a single source, it is known as multi-headed attention. This increases the learning capacity of the model and therefore leads to better results.

We define a MultiHeadAttentionLayer class that is responsible for applying the multi-headed attention mechanism within the transformer. It accepts the query, key, and value tensors as input and uses fully connected layers to preprocess them. These are then split into multiple heads along the 'hid_dim' axis to give rise to the required queries, keys, and values for applying multi-headed attention. The attention energies are generated by multiplying the multi-head queries and keys together. These energies are then passed into a softmax function to give rise to attention weights that are applied to the values tensor. This helps the model focus on the necessary aspects of the values tensor. This value vector is resized to its original dimension and returned as an output of the Multi-Head Attention operation.

Decoder:

In the decoder the output data is first shifted right and then passed through the output embedding layer and result is sent as Query, Key and value tensors into a block of encoding layers that has a multi-head attention(MHA) layer. In each decoding layer block, similar to the encoding block, output of the MHA layer along with output embedding is added together and fed into feed forward layers. The only difference is that the decode block has two MHA layers, one being the normal MHA layer the other is a masked MHA layer. The target is fed to the masked MHA layer. Also for the MHA layer we use the decoder's masked MHA layer output as the query and the encoder's output as the key and value.

The reason behind having two different attention layers is to understand the self attention(between target words) and encoder attention(attention between target and input words).

Initialization Function:

Parameters accepted for instantiating the decoder object are output_dim (Output dimension), hid_dim(Hidden Layer dimension), n_layers(Number of decoding Layers) , n_heads (number of heads for the decoding MHA layers), pf_dim(position-wise feed forward layer input dimension), dropout(drop out in decimals), device(Device type to be used) and max_length(maximum length of target sequence)

Similar to the encoder we set the device to the input device passed to the encoder. Set scale value to the square root of hid_dim and move it to the device. This will be used in attention layers to prevent overflow.
As said the decoder for the most part is same as the encoder, so we again instantiate token embeddings using nn.Embedding, but with output_dim as input and hid_dim as output dimension and similarly position embeddings using nn.Embedding with max_length(target sequence) as input and hid_dim as output dimension. Similar to the encoder, the token and position embeddings in decoder block will be of hid_dim size as they will be added together.

Instantiate Layers, which will have DecoderLayer, using nn.ModuleList function to create n_layers of the decoder block. Similar to EncoderLayer, DecoderLayer takes hid_dim, n_heads, pf_dim, dropout and device as parameters.Instantiate dropout with nn.Dropout layer with dropout value passed to the function.

Forward Function:

Parameters accepted are the target(trg) and its mask (trg_mask) along with encoded source(enc_src) tensor and its mask(src_mask).

We will get batch_size from the trg tensor using shape function. Index “0” will give the batch size as the batch first parameter is set in the iterator definition. Target length(trg_len) is also gotten from trg tensor using shape function. Index “1” will give target length.

Again following what we did in the encoder using target length(trg_len) , we create position tensor that assign numbers 0 to trg_len for all the words. Repeat this batch_size times to get the final position tensor(pos). Create embedding for source(trg) and position(pos) tensors. Scale up the tok_embedding and then Combine the two embeddings by element wise summing. We apply Dropout to get our target embeddings of shape [batch size, trg len, hid dim]. This will be used as input to Decoder Layer blocks.

Decoder Layer:

Parameters accepted are hid_dim(hidden dimension), n_heads(number of heads for MHA layers), pf_dim(pointwise feedforward dimension), dropout(dropout fraction) and device(device to be used). As previously mentioned, the decoder block is mostly the same except for the two attention layers. So let's see what changes are required specific to those layers. The first MHA layer is a masked variant of the normal MHA layer which is used for self-attention, similar to the encoder, but for the target sequence. For this, the target embedding is used as the query, key and value.

After this we apply dropout and the target embedding is again added to this using a residual connection which is followed by layer normalization. This layer uses the trg_mask to prevent the decoder from cheating as it constrains the decoder to pay attention to words that are after the current word.
Another thing to note is how we actually feed the enc_src, into our decoder. We feed it to the MHA layer in which the queries are the output from the masked MHA layer and the keys and values are the encoder output. The src_mask is used to prevent the MHA layer from attending the tokens in the src. We again apply dropout and the output from the masked MHA is added to this using a residual connection which is followed by layer normalization.

This is then passed through the position-wise feedforward layer and then again dropout and then we add the output of the MHA layer using a residual connection and then apply layer normalization.

Top comments (0)