DEV Community: Shashank-Holla

Attention in Seq2Seq models

Shashank-Holla — Thu, 04 Feb 2021 19:39:24 +0000

In this post, we will discuss about transformer model, an attention based model which has significant boost in model training speed. In this sequence processing model, there is no recurrent layers or convolution layers being used. Instead, it is made of attention and fully connected layers.

Convolutional Sequence to Sequence learning - A Closer Look

Shashank-Holla — Thu, 28 Jan 2021 19:17:34 +0000

In this post we take a closer look into Convolution based Sequence to Sequence machine translation.

In simple words, Machine translation is the translation of texts from one language to another. Sequence to sequence (Seq2Seq), an encoder-decoder based architecture, is used to convert sequences from the source language to sequences in target language. Widely known approach for sequence to sequence translation is via recurrent neural network. Compared to RNN networks, convolution based network are less common but have certain advantages. In this post, we will look at the inner working of the encoder-decoder modules. We are considering Pytorch based implementation of German to English language translation as an example in this post.

Source code for this available here.

Compared to recurrent networks, convolutional models create fixed size context representations and this can be made larger by stacking convolutional layers on top of each other. This allows CNN control over the maximum length of of dependencies to be modeled. Similar to the idea of edges and gradients to textures to patterns to parts of objects being captured in multiple convolution blocks in image domain, Multi-layered CNN network creates hierarchical representation over the input sentence where nearby input elements interact at lower layers and distant elements interact at higher layers. Computations in convolutional network is applied over all elements in parallel in training and exploits the GPU hardware.

Seq2seq model employs Encoder-Decoder architecture for translation. Encoder's role is to encode the input sentence which is in the source language into a context vector. Encoder produces two context vectors per token. So if the input German sentence has 4 tokens, it would produce 8 context vectors. The two context vectors produced by the encoder are 'conved vector' and 'combined vector'. Conved vector is produced by passing each token through a few fully connected layers and the convolutional block. Combined vector is the sum of the convolved vector and the embedding of the token.
Decoder's role is to use the context vector to produce the output sentence in the target language. Decoder model, unlike recurrent models, predicts all the tokens in parallel in the target sentence. We'll look into the encoder and decoder's working seperately.

Encoding process

Below is the architecture diagram of the encoder block. We'll look into each segment of this block in detail.

1. Embedding vector

During data pre-processing, the input sentence in the source language was tokenized and indexed. Now, in the encoding layer, these tokens are passed through Embedding layer to create word embeddings. Unlike recurrent networks which processed each token sequentially, the CNN based model requires all the tokens to be processed simultaneously. Therefore, the model does not possess information about the position of the tokens within the sequence. To rectify this, information about the position of each token is passed along with the token embedding. This is done by passing the position of the token through the embedding layer to create positional embedding. This is then elementwise summed to create the embedding vector.

2. Fully connected layer-1

Embedding vector is now passed through a fully connected layer. This adds capacity to the model as wells as transforms the dimension of the embedding vector with the required hidden dimension size.

3. Convolutional block

a) Convolution

During convolution, kernel takes in n words from a sentence equal to kernel size and convolves over them to produce a feature map. Here, the length of the sentence after convolution reduces by (kernel size - 1). To maintain the length of the input sentence after convolution to be the same as before covolution, the input sentence is padded with padding element on each end of the sentence. The padding amount on each side is equal to (kernel size -1)/2. This is passed as argument in convolution function.

b) GLU activation

The convolution output is then passed through a activation function called Gated Linear Units (GLU). GLU function splits the input evenly into two tensors, calculates sigmoid for the second tensor and multiplies elementwise by the first tensor. As per the authors, gating mechanism allows selection of words or features that are important for predicting the next word. GLU splits and reduces the hidden dimension by a factor of 2. Therefore the hidden dimension size is doubled during convolution so as to maintain the hidden dimension size through GLU activation.

c) Residual Addition

Similar to residual path in resnet, output from GLU activation is elementwise summed with the same vector before it was passed through convolution layer.

4. Fully connected layer-2

Vector from the convolutional block is now fed into fully connected layer. This again adds capacity to the model and transforms the vector back from hidden dimension to embedding dimension. This vector is called conved vector.

5. Residual layer

Conved vector is elementwise summed with embedding vector via residual connection which bypassed the convolution block. This new vector is called combined vector.

Conved vector and combined vector is generated for each token in the input sentence.

Decoder process

The decoder takes in the actual target word and tries to predict it. Shared below, are the segments that make up the decoder block.

1. Embedding vector

As in encoder, embedding is calculated for the target tokens and the positions and elementwise summed.

2. Fully connected layer-1

Embedding vector is passed through fully connected layer which converts the input from embedded dimension to hidden dimension.

3. Convolution block

a) Convolution

Unlike in encoder where padding was applied equally on both ends of the sentence, padding in decoder is applied only at the beginning of the sentence. This is to ensure the kernel only looks at the previous and current word for processing and prevent it from looking at the next word (token that needs to be predicted). This helps in preventing the model from copying the next word and not learning to translate. Apart from this change, the convolution is same as of encoder's.

b) GLU activation and attention

GLU activation is similar to the one applied in encoder. But, after the GLU activation, attention is calculated by using encoder output as well as the embedding of the current word. The convolution output's dimension is transformed by a fully connected layer and then summed up with its embeddings through a residual connection. Then attention is calculated on this combination by checking how much it matches with the encoder's convolution output. This is done by first calculating the energy by taking a dot product of the convolution output with encoder's convolution output. Softmax function is then applied on the product to calculate the attention. Dot product of the attention with encoder's convolution output provides more information about the specific token of the encoded sequence which is very useful in making prediction.

References:

1) Seq2Seq Pytorch tutorial - https://github.com/bentrevett/pytorch-seq2seq

Convolutional Sequence to Sequence Learning - https://arxiv.org/pdf/1705.03122.pdf

Call for your attention! Do you remember LSTM?

Shashank-Holla — Thu, 03 Dec 2020 18:40:29 +0000

This is my attempt to jot down idea behind Recurrent Neural network variants such as LSTM and GRU and the thought process behind attention mechanism.

Idea behind RNN

Deep Learning has provided many solutions to interesting problems in computer vision, speech and audio domains. One such weapon from Deep Learning's arsenal to tackle sequence data such as text, audio is Recurrent neural networks. Recurrent Neural network allows to extract gist behind texts of data (sentiment analysis), annotate sequences (image captioning) or even generate new sequences (language translations). Recurrent network posesses ability to persist information that allows us to operate on sequence of input to produce sequence of output vectors.

Need for LSTMS, GRUs

The persistence of information in RNN is made possible by loops in neural network. When RNN makes its prediction, it considers the current input as well as the learnings of the past inputs. To decipher this, RNN can be viewed as unrolled version of the same network, with learning from each time step passed to the next.

RNN network performs really well for problems with short contexts. But, as contexts gets longer and as the time steps increases, the network fails to carry forward the learnings from the initial stage. Hence, neural network suffers from Short term memory caused by vanishing gradients. Remember that the network learns by calculating gradients and adjusting internal weights. But, when the calculated gradients become quite small, hence vanishing gradients, the network fails to learn anything meaningful. As an example, consider a long sentence that says -

The General's method is to have his troops ready by dawn.

Our recurrent network might find it hard to decide the context of the word General and if its used as an adjective/noun in our sentence. LSTM and GRU tries to solve these problems.

Talking about LSTM

The core idea behind LSTM (Long Short Term memory) is the cell state, a pathway that transfers relative information all the way down the sequence chain. Its working is somewhat similar to the skip connection that is used in Resnet models. Its very easy for information to travel along unchanged through the sequence. LSTM provides means to add or remove information to the cell state by structures called gates. LSTM has three such gates- input gate, forget gate and output gate to learn over time what information is important. Cell state and hidden state from previous timestep and input are processed by these gates.

The first contributor in LSTM is the forget gate. It's role is to decide the part of the cell state's information that needs to be thrown away or kept. Sigmoid function is used in these gate and it acts like a selector/controller to selectively remove few features from the embedding vector. In our example, forget gate looks at the previous hidden state h(t-1) and input x(t) and would decide to forget the adjective sense of the word 'general'.

The second contributor is the input gate which decides the part of the input's information that needs to be added to the cell state. Here, a sigmoid function selects the part of the input that is to be updated. Next, a tanh function creates new candidate vector from the inputs. By tanh characteristics, the new vectors are squished within the range -1 to 1 which regulates the network and prevents possible gradient explosions. By elementwise multiplication, the sigmoid layer decides which of the features of the new candidate vector is important. In our 'General' example, the model might decide the word is being used as a noun and add this part of speech to the cell state.
With forget gate deciding the part of previous cell state to be forgotten and input gate deciding the part of the input to be added, the new cell state is calculated.

The last contributor is the output gate which now decides the next hidden state. The output gate provides the filtered version of the new cell state. It again uses the sigmoid function to decide the important features of the cell state and tanh function to squash the new cell state between the desired range. In the general example, this would amount to the model deciding adjective as the right part of speech to assist in further timesteps.

Talk about GRU

Gated Recurrent Network (GRU) is similar to LSTM but with few modifications. GRU's doesn't have cell state and uses only the hidden state to transfer information. It uses two gates- Update gate and Reset gate rather than the 3 gates that were used in LSTM.

The first gate is Update gate which acts similar to the forget gate and input gate of LSTM. It helps the model determine how much of past information needs to be passed along.

The second gate is the Reset gate which decides how much of the past information to forget.

Differences and which is better?

GRU doesn't possess internal memory (hidden state) compared to LSTM. GRU also has 3 internal neural networks compared to LSTM which has 4 such networks. Hence, GRU has fewer tensor operations and are little speedier to train than LSTM. However, neither of the model outweighs the other. In cases of training on less data or on need for speed, GRU can be considered. While in cases of longer sequences and to maintain long distance contexts LSTM might be preferred.

Attention

For problems such as machine translation which requires many to many mapping, encoder-decoder based RNN models are used. In traditional RNN encoder-decoder architecture, the encoder provides a single context vector to the decoder. The problem with this approach is that it stresses and requires the encoder to provide a utopic vector that provides all the information that the decoder requires to decipher, which is a complex task for both.

Attention in RNN is a mechanism that provides insight to the decoder to focus on certain part of the input sequence to predict certain part of the output. This part of the post focuses on the idea behind attention and how it tries to solve the above problem.

With the attention mechanism model, all the hidden states from the encoder are offered for decoder's consideration. In our example, hidden vectors are [h1, h2, h3, h4] while si-1 is the state of the decoder. For the current timestep calculation, the previous decoder state is considered. Attention mechanism begins with attention weights calculation. si-1 (previous decoder state) is concatenated with the hidden state vectors and fed to a shallow fully connected layer. Later, softmax function is applied to the output of fc which ensures the outputs sums up to 1. Now the hidden state vectors [h1, h2, h3, h4] are scaled by the attention weights to produce the context vectors. Due to softmax nature of the output, Context vectors captures the degree of relevance of the hidden vector. If the score is close to 1, then the decoder is heavily influenced by that particular hidden state. With this approach, encoder is relieved of the burden to encode all information into single hidden vector and provides the decoder with greater context to predict.