Convolutional Sequence to Sequence learning - A Closer Look

In this post we take a closer look into Convolution based Sequence to Sequence machine translation.

In simple words, Machine translation is the translation of texts from one language to another. Sequence to sequence (Seq2Seq), an encoder-decoder based architecture, is used to convert sequences from the source language to sequences in target language. Widely known approach for sequence to sequence translation is via recurrent neural network. Compared to RNN networks, convolution based network are less common but have certain advantages. In this post, we will look at the inner working of the encoder-decoder modules. We are considering Pytorch based implementation of German to English language translation as an example in this post.

Source code for this available here.

Compared to recurrent networks, convolutional models create fixed size context representations and this can be made larger by stacking convolutional layers on top of each other. This allows CNN control over the maximum length of of dependencies to be modeled. Similar to the idea of edges and gradients to textures to patterns to parts of objects being captured in multiple convolution blocks in image domain, Multi-layered CNN network creates hierarchical representation over the input sentence where nearby input elements interact at lower layers and distant elements interact at higher layers. Computations in convolutional network is applied over all elements in parallel in training and exploits the GPU hardware.

Seq2seq model employs Encoder-Decoder architecture for translation. Encoder's role is to encode the input sentence which is in the source language into a context vector. Encoder produces two context vectors per token. So if the input German sentence has 4 tokens, it would produce 8 context vectors. The two context vectors produced by the encoder are 'conved vector' and 'combined vector'. Conved vector is produced by passing each token through a few fully connected layers and the convolutional block. Combined vector is the sum of the convolved vector and the embedding of the token.
Decoder's role is to use the context vector to produce the output sentence in the target language. Decoder model, unlike recurrent models, predicts all the tokens in parallel in the target sentence. We'll look into the encoder and decoder's working seperately.

Encoding process

Below is the architecture diagram of the encoder block. We'll look into each segment of this block in detail.

1. Embedding vector

During data pre-processing, the input sentence in the source language was tokenized and indexed. Now, in the encoding layer, these tokens are passed through Embedding layer to create word embeddings. Unlike recurrent networks which processed each token sequentially, the CNN based model requires all the tokens to be processed simultaneously. Therefore, the model does not possess information about the position of the tokens within the sequence. To rectify this, information about the position of each token is passed along with the token embedding. This is done by passing the position of the token through the embedding layer to create positional embedding. This is then elementwise summed to create the embedding vector.

2. Fully connected layer-1

Embedding vector is now passed through a fully connected layer. This adds capacity to the model as wells as transforms the dimension of the embedding vector with the required hidden dimension size.

3. Convolutional block

a) Convolution

During convolution, kernel takes in n words from a sentence equal to kernel size and convolves over them to produce a feature map. Here, the length of the sentence after convolution reduces by (kernel size - 1). To maintain the length of the input sentence after convolution to be the same as before covolution, the input sentence is padded with padding element on each end of the sentence. The padding amount on each side is equal to (kernel size -1)/2. This is passed as argument in convolution function.

b) GLU activation

The convolution output is then passed through a activation function called Gated Linear Units (GLU). GLU function splits the input evenly into two tensors, calculates sigmoid for the second tensor and multiplies elementwise by the first tensor. As per the authors, gating mechanism allows selection of words or features that are important for predicting the next word. GLU splits and reduces the hidden dimension by a factor of 2. Therefore the hidden dimension size is doubled during convolution so as to maintain the hidden dimension size through GLU activation.

c) Residual Addition

Similar to residual path in resnet, output from GLU activation is elementwise summed with the same vector before it was passed through convolution layer.

4. Fully connected layer-2

Vector from the convolutional block is now fed into fully connected layer. This again adds capacity to the model and transforms the vector back from hidden dimension to embedding dimension. This vector is called conved vector.

5. Residual layer

Conved vector is elementwise summed with embedding vector via residual connection which bypassed the convolution block. This new vector is called combined vector.

Conved vector and combined vector is generated for each token in the input sentence.

Decoder process

The decoder takes in the actual target word and tries to predict it. Shared below, are the segments that make up the decoder block.

1. Embedding vector

As in encoder, embedding is calculated for the target tokens and the positions and elementwise summed.

2. Fully connected layer-1

Embedding vector is passed through fully connected layer which converts the input from embedded dimension to hidden dimension.

3. Convolution block

a) Convolution

Unlike in encoder where padding was applied equally on both ends of the sentence, padding in decoder is applied only at the beginning of the sentence. This is to ensure the kernel only looks at the previous and current word for processing and prevent it from looking at the next word (token that needs to be predicted). This helps in preventing the model from copying the next word and not learning to translate. Apart from this change, the convolution is same as of encoder's.

b) GLU activation and attention

GLU activation is similar to the one applied in encoder. But, after the GLU activation, attention is calculated by using encoder output as well as the embedding of the current word. The convolution output's dimension is transformed by a fully connected layer and then summed up with its embeddings through a residual connection. Then attention is calculated on this combination by checking how much it matches with the encoder's convolution output. This is done by first calculating the energy by taking a dot product of the convolution output with encoder's convolution output. Softmax function is then applied on the product to calculate the attention. Dot product of the attention with encoder's convolution output provides more information about the specific token of the encoded sequence which is very useful in making prediction.