DEV Community

Muhammad Saim
Muhammad Saim

Posted on

Introduction to NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

Introduction

Neural machine translation appears more effective than traditional statistical modeling for translating sentences. This paper introduces the concept of attention in neural machine translation, which is a better approach to translating sentences. Normal neural translators use fixed-length vectors, which do not translate longer sentences correctly. However, this paper uses dynamic-length vectors that better convert longer sentences.

NMT uses an encoder-decoder architecture in which fixed-length vectors were used. The model needs to compress all the information into one single vector, which can be a difficult task for NMT. The performance of NMT decreases as the input length increases.

To address this issue, they introduce an extension of the encoder-decoder which learns to align and translate jointly. Each time the proposed model generates the translation of a word, it searches for the relevant information in the context. The model predicts the target word based on the context vector and all previously generated target words.

The Encoder-Decoder Framework in Neural Machine Translation

Back then in machine translation statistical techniques are used. Based on specific x the model learn then it predicts y. Equation is argmax(y | X). RNN uses two components first to encode the variable length source sentence to fixed length vector then decode variable length target sentence.
In Encoder-Decoder framework input sentence in sequence of vectors x = (x1,x2, …. , xn) into a vetor c2
ht = f(xt,ht-1)
c = q({h1, …. , hTx})
The decoder is trained to predict the next word yt from all the previous words {yt, … ,yt’-1}. In other words decoder defines the joint probability into condition.
P(y) = ∏p(yt|{y1,….,yt-1},c)
p(yt|{y1,….,yt-1},c) = g(yt-1,st,c)
where g in non linear function the probability of yt and st is hidden state.

Learning to align and translate

NTM approaches before this paper uses the normal RNN architecture. they introduce bi-direction RNN in which encoder encodes and decoder that emulates through source sentence and decoding the translation.
The model architecture used in paper is following.
P(yi|y1,….,yi-1,X) = g(yi-1,si,ci)
Yi-1 is previous hidden state si is RNN hidden state and ci is current context vector.

Bi directional RNN

The usual RNN reads the input sequence x from X to Xtx In this paper annotation is not only for single word but it is also for the following words. So bidirectional rnn read the input from (x1, … xn) and calculate the hidden states (h1, … hn) after that RNN f reads the sequence in reverse order (x,…..,x1) compute the hidden state (hn, …. h1).

Image description
This model shows good BLEU score for the longer sentences getting score for the RNNsearch-50 is quite optimized.

Results

There are two kind of models are used RNNsearch and RNNenc. RNNenced have 1000 hidden units each decoder and decoder. The encoder of RNNsearch consist of forward and backward have 1000 units each The results maximizes the condition probability.

Top comments (0)