Convolutions in Sequence to Sequence networks

#attention #convolution #seqtoseq #machinetranslation

Sequence to Sequence networks often use RNNs and its derivatives along with attention to solve problems in machine translation, question-answer models, text sumarization etc.

Solving the same problems using convolutions proved better results. Here we will take a deep dive into how to design a convolutional sequence to sequence network.

You can find the source code here

Lets get started.

The networks uses encoder decoder architecture along with attention.

The components of the models are -

Encoder
Decoder
Attention Block Lets see them one by one.

But before moving into these lets see how convolution works here.

In RNNs we would take a hidden state or a context vector from the previous time step, but that only makes the model cram the next word for a given sequence has catches very less contextual information or contextual relation between the words.

On the other hand CNNs try to capture the spatial information hidden between the within the sentences.
The model gets trained such a way that each dimensions of embedding store a certain concepts within itself.
For example lets say we have a 256 dim embedding and we are given two sentences.

I am using apple.
I am eating apple.

Here we see that the first sentence has a context of a device and second has a context of an edible item.
When convolutions run between the embeddings of 'using apple' and 'eating apple' they try to capture their respective contexts (with the help of attention)
Lets assume the 50th dim of embedding has the context of a device and 100th dim has the context of an edible. The attention will only weight them if the context matches and therefore the model learns better.

Now lets move the encoder.

Encoder

First, we pad the sentence from both ends.
The encoder takes the token embedding along with the positional embedding and then performs their elements wise sum.
This is now fed to a Fully connected layers to convert into a desired embedding. The output of FC layer is fed to the convolutional blocks.
The output of the convolutional blocks is now again passed into a FC layer.
Here we add a skip connection with the elementwise sum output ( positional and token embedding) to give a final output. This is called conved ouput.
We send two outputs to the decoder. One is conved output and other is combined output(conved + embedded).

Encoder Convolution Blocks

The input to the convolution blocks is embedding of the length of the sentence.
We use a odd sized kernel and pad the inputs at each convolution.
We use GLU activation function.

Decoder

The text input to the decoder is similar to the encoder. We take element wise sum of the token and positional embeddings.
This sum is passed to a FC layer and further passed to the convolutional blocks.
This time the convolutional blocks accept the elementwise sum as skip connection and also the conved output and combined output from the encoder.
The output of the convolutional block is again to a FC layer and then sent out to make the predictions.

Decoder Convolutional blocks.

The difference in decoder convolutional block is that the input is padded twice at the front.
The logic behind this is straightforward that we are using a kernel of size 3 and padding the sentence before the token prevents the decoder from catching the context of the next prediction. Not using this will fail the decoder from learning.

Attention

The attention module is used to calculate weight of the features so that the models is able to focus on the necessary features.

The module takes the conved output from the encoder and combined embedded output from the decoder and performs a weighted sum over them.

DEV Community

Convolutions in Sequence to Sequence networks

Top comments (0)

Read next

Check if You Are Using a Local Account or Microsoft Account in Windows 11!

QLIP: New AI System Unifies Image and Text Processing with Breakthrough Token Approach

Simple Fix Cuts AI Model Copyright Violations by 10x Without Retraining

New Privacy Method Makes Data Analysis Both Private and Accurate, Study Shows