LSTM Attention vs Self-Attention: How Bahdanau Evolved

#attentionmechanism #transformer #lstm #seq2seq

The Gap Between Bahdanau (2014) and Vaswani (2017)

Most people think self-attention appeared out of nowhere in "Attention Is All You Need." But the path from RNN-based attention to pure self-attention involved three years of incremental fixes to a core problem: how do you let a decoder focus on the right input tokens without serializing the entire sequence?

Bahdanau et al.'s 2014 paper (Neural Machine Translation by Jointly Learning to Align and Translate) introduced attention as a fix for encoder-decoder bottlenecks in seq2seq models. The idea was simple: instead of compressing the entire source sentence into a single fixed-size context vector, compute a weighted sum over all encoder hidden states at each decoding step. The weights (attention scores) tell the model which input tokens matter most for the current output token.

The formula looked like this:

$$c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j$$

where $c_i$ is the context vector for decoder timestep $i$, $h_j$ are encoder hidden states, and $\alpha_{ij}$ are attention weights computed via:

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}$$

$$e_{ij} = a(s_{i-1}, h_j)$$

Here $s_{i-1}$ is the previous decoder state, and $a$ is a small feedforward network (the alignment model) that scores how well input position $j$ matches output position $i$.

Continue reading the full article on TildAlice

DEV Community

LSTM Attention vs Self-Attention: How Bahdanau Evolved

The Gap Between Bahdanau (2014) and Vaswani (2017)

Top comments (0)