Seq2Seq and Encoder-Decoder: the one-vector bottleneck that led to attention

#ai #nlp #machinelearning #deeplearning

Before Transformers, before modern chatbots, there was a beautifully simple idea for turning one sequence into another: read the whole thing, squeeze it into a single vector, then unroll that vector back out into a new sequence. That is sequence-to-sequence learning, and it powered the first wave of neural machine translation. It also had a flaw so obvious in hindsight that fixing it produced the entire modern era of AI.

I built an interactive page where you can watch the whole thing happen, step by step, on a toy task. Here is the walkthrough.

The problem: variable in, variable out

Most of the models people meet first are classifiers: one input, one label. But a huge class of real problems maps a whole sequence to a different whole sequence, and the lengths do not match. Translate three French words into five English ones. Turn a long article into a two-line summary. Transcribe audio into text. A plain RNN that emits one output per input step cannot do this, because it forces the output to be exactly as long as the input. We need something that can read everything first, then generate an output of whatever length it wants.

The trick: two RNNs in series

The seq2seq answer is disarmingly clean. Use two networks.

The encoder is an RNN that reads the input one token at a time, updating a hidden state at every step with a genuine recurrence: h_t = tanh(Wx·x_t + Wh·h_{t-1} + b). After the last token, its final hidden state is taken as the context vector (some people call it the "thought vector"). In principle that one fixed-size vector now holds everything the model needs to know about the input.

The decoder is a second RNN whose initial hidden state is set to that context. It generates the output one token at a time: take the previous token, update the state, project to a probability distribution over the vocabulary, pick a token, feed it back in. Decoupling reading from writing is exactly what lets the two sides have different lengths and even different vocabularies.

Starting and stopping: SOS and EOS

Since the output length is not fixed, the model needs markers. A special start-of-sequence token <SOS> is the decoder's very first input, kicking off generation from nothing. A special end-of-sequence token <EOS> is something the decoder can emit to say "I am done." Decoding loops until <EOS> shows up (or a max length is hit). Without <EOS> the model would never know when to stop.

Teacher forcing vs free-running

This part trips people up, so the demo has a toggle for it.

During training you already know the correct output, so at each decoder step you feed the true previous target token rather than the model's own guess. This is teacher forcing, and it stops early mistakes from snowballing across the whole sequence, which makes learning fast and stable.

At inference there is no ground truth, so the decoder runs autoregressively: it feeds its own previous prediction back in as the next input. Same network, decoded two different ways. Flip the toggle in the demo and watch the decoder's input stream switch from true targets to its own outputs.

The bottleneck you can feel

Here is the fatal flaw, and the reason the page exists. No matter how long the input is, the encoder must cram all of it into one fixed-size context vector. Five tokens? Fine. Fifty tokens? Same-size vector, and information gets crushed, especially about the tokens read early on. Empirically, translation quality falls off a cliff past twenty or thirty words.

The demo makes this visceral. It runs a real encoder-decoder on a toy task: reverse a digit string, so 3 1 4 should come back as 4 1 3. The hidden state you see updating is genuine recurrence math, not a scripted cartoon. With a short input it reverses perfectly. Then you drag the length slider up, and the very same machine that nailed a 3-digit string starts mangling a 9-digit one, because nine tokens no longer fit an 8-dimensional context. A little "pressure" meter turns from green to red as you overload it.

Why this matters: attention and the Transformer

Staring at that bottleneck, someone asked the obvious question. If one vector is too small, why force the decoder to use only the encoder's last state? Attention (2014) keeps every encoder hidden state and lets the decoder, at each output step, compute a weighted blend over all of them, focusing on the input positions most relevant to the token it is about to produce. The single summary vector disappears, and long-sentence translation was suddenly rescued.

Attention worked so well that researchers asked a bolder question: do we even need the RNN? The Transformer (2017) threw out recurrence entirely and built everything from self-attention. Dropping the sequential loop let the whole sequence be processed in parallel, which is why it scales on GPUs, and every modern LLM descends from this line: seq2seq, then attention, then Transformer.

The encoder-decoder shape you build here still underlies translation, summarisation, speech, image captioning, and plenty of LLMs. Learning it is learning the skeleton that everything after it is a refinement of.

Play with the encoder, decoder, teacher-forcing toggle and the length slider here:

https://dev48v.infy.uk/dl/day24-seq2seq.html