Vanilla RNNs forget the distant past — gradients vanish over long sequences. LSTMs fix it with a protected "cell state" and three gates. Drop a memory early, feed noise, and watch the LSTM still hold it 20 steps later while a plain RNN bleeds it away.
🚪 See who remembers: https://dev48v.infy.uk/dl/day11-lstm.html
The cell state: a conveyor belt
The LSTM adds a separate cell state c that flows down the sequence almost untouched — edited only by gentle, gated operations, not repeatedly multiplied by weights. Information (and gradients) survive long gaps.
The three gates (each a sigmoid 0→1)
const f = sigmoid(Wf·[h,x]); c = f * c; // FORGET: keep (~1) or erase (~0)
const i = sigmoid(Wi·[h,x]); const g = tanh(Wg·[h,x]);
c = c + i * g; // INPUT: write new info selectively
const o = sigmoid(Wo·[h,x]); h = o * tanh(c); // OUTPUT: reveal a filtered view
- Forget gate ≈ 1 → the memory rides along untouched (no decay). That's the fix for vanishing gradients.
- Input gate decides what new fact to latch (a subject, a flag) and what to ignore.
- Output gate lets the cell hold a fact quietly for many steps and surface it only when relevant.
And GRU
A leaner cousin — two gates, merges cell + hidden state, often as good with less compute. LSTMs/GRUs powered translation, speech, and text generation for years — the bridge from RNNs to attention (the Transformer).
The takeaway
A gated conveyor belt: keep, write, reveal — memory that lasts. Test it.
Top comments (0)