RNN to Transformer NMT: PyTorch Migration with 2.8x BLEU Gain

#transformer #seq2seq #neuralmachinetransla #pytorch

The GRU Model That Stopped Learning at BLEU 18

Our German-to-English translation model plateaued at BLEU 18.3 after 50 epochs. The loss kept dropping, but translations were garbage—verb conjugations wrong, word order scrambled, and anything past 30 tokens turned into repetitive nonsense. I'd been using a fairly standard GRU-based seq2seq with attention, the same architecture from Bahdanau et al. (2015). It worked fine on short sentences. Then someone tried feeding it a legal document paragraph.

The model translated "Der Vertrag wird am ersten Tag des Folgemonats wirksam" (The contract becomes effective on the first day of the following month) into "The contract will be on the day of the month." Half the semantic content just vanished.

This post walks through migrating that RNN-based seq2seq to a Transformer, the real errors that popped up along the way, and why the final model hit BLEU 51.2 on WMT14 De-En test set—using the same training data.