"The present contains nothing more than the past, and what is found in the effect was already in the cause."
— Henri Bergson
Everything We Built Assumed a Snapshot
Look back at every network we've built so far.
A perceptron takes a fixed input vector and draws a line. An MLP stacks layers to bend that line into curves. A CNN slides filters across an image to detect spatial patterns. Even with all that sophistication, the convolutions, the pooling, the skip connections, every single one treats the input as a static snapshot. Feed it in, get a prediction out. The order of inputs doesn't matter. There's no before or after.
That assumption works perfectly for images. A digit is a digit regardless.
But language isn't a snapshot. Neither is audio, or time series, or any signal where what came before changes the meaning of what comes after.
"My teacher said I was slow, but he didn't know I was just getting started."
What does "he" refer to? The teacher, obviously. But only because you held "my teacher" in mind while reading the rest. You carried context forward — unconsciously, effortlessly.
Every architecture we've built so far would fail this. It has no mechanism for carrying anything forward.
That's the gap RNNs were built to fill.
Learning to Read — Letter by Letter
I remember learning to read. Not the fluent reading I do now, the early, effortful kind.
Each letter had to be identified consciously. Then combined with the next to form a sound. Then sounds stitched into a word. Then words assembled into meaning. It was slow, sequential, and exhausting. And crucially, by the time I reached the end of a long sentence, I'd often forgotten how it started.
That's a vanilla RNN.
It processes sequences one step at a time, maintaining a hidden state, a running summary of everything seen so far and updating it at each step:
# At each step t:
hidden(t) = tanh( W_h × hidden(t-1) + W_x × input(t) )
output(t) = W_o × hidden(t)
The hidden state is the memory. It blends the new input with what came before. The same weights are reused at every step, the network doesn't learn separate rules for position 1 vs position 50. One set of weights, applied repeatedly across time.
h(0) ──► h(1) ──► h(2) ──► h(3) ──► ...
▲ ▲ ▲ ▲
│ │ │ │
x(0) x(1) x(2) x(3)
Elegant. And it works, for short sequences. Just like the early reader who handles a short word fine but loses the thread of a long sentence.
Training It: Backprop Through Time
Training uses the same backpropagation from Post 3 — unrolled across time steps. To compute how much each weight contributed to the final loss, you trace gradients backward through every step:
x(0)→[RNN]→h(0)→[RNN]→h(1)→[RNN]→h(2)→[RNN]→h(3)→ Loss
│
gradients flow backward ◄─┘
through every time step
Same chain rule. Just applied across time instead of across layers. The depth is now temporal rather than architectural.
And here's where the familiar problem returns.
The Long Sentence Problem
Remember the vanishing gradient from Post 7? Gradients shrink as they travel backward through many layers, multiply enough numbers less than 1 together and you get zero.
The same thing happens here, but across time steps instead of layers.
At each step backward, the gradient gets multiplied by the weight matrix W_h. For a sequence of 50 words, that's 50 multiplications. The gradient reaching step 1 is effectively zero.
Like my early reading days: by the end of a long sentence, I'd forgotten how it started.
In Post 7, skip connections fixed vanishing gradients by adding a direct additive path that bypassed the layers. We need the same idea, but for time.
LSTM: Learning to Read Fluently
Think about what changes when reading becomes fluent.
You stop processing letter by letter. You chunk into words, phrases, meaning. More importantly, you become selective. You don't hold every word in memory with equal weight. You retain what matters: the subject, the tension, the unresolved question. You discard the filler. And you do this automatically, without thinking.
That selectivity is exactly what the Long Short-Term Memory network (Hochreiter & Schmidhuber, 1997) introduced.
An LSTM has two states instead of one: a hidden state h (what it's currently working with) and a cell state c (long-term memory). The cell state is the key innovation, it runs through the sequence with only small, controlled modifications. Like the skip connection in ResNets, it's an additive path that lets gradients flow backward without decaying at every step.
Three gates control what happens to memory at each step:
Forget gate: should I clear out old memory?
f = sigmoid( W_f × [h(t-1), x(t)] )
Input gate: is this new input worth remembering?
i = sigmoid( W_i × [h(t-1), x(t)] )
candidate = tanh( W_c × [h(t-1), x(t)] )
Output gate: what should I act on right now?
o = sigmoid( W_o × [h(t-1), x(t)] )
Update:
cell: c(t) = f × c(t-1) + i × candidate
hidden: h(t) = o × tanh( c(t) )
The sigmoid gates output values between 0 and 1 — soft switches. A forget gate near 1 means "keep everything." Near 0 means "wipe it." The network learns when to remember and when to forget, based on what the task requires.
The cell state update — c(t) = f × c(t-1) + i × candidate is additive. Old memory plus new information. That additive structure is what saves the gradient. Instead of multiplying through a squashing function at every step, gradients flow backward through the cell state with far less decay.
Same intuition as the ResNet skip connection. Different problem, same fix.
GRU: Fluency With Less Overhead
Once reading becomes fluent, you don't consciously run through all three questions at every word. Most decisions are automatic, keep reading, update the picture, move on.
The Gated Recurrent Unit is that streamlined version. It merges the cell state and hidden state into one, and uses two gates instead of three:
Reset gate: how much past context to use for the new candidate
r = sigmoid( W_r × [h(t-1), x(t)] )
Update gate: how much to blend old state with new candidate
z = sigmoid( W_z × [h(t-1), x(t)] )
Update:
candidate: h̃ = tanh( W × [r × h(t-1), x(t)] )
hidden: h(t) = (1-z) × h(t-1) + z × h̃
Fewer parameters, similar performance. The update gate does double duty controlling both forgetting and writing in one operation. In practice, LSTMs and GRUs perform comparably. GRUs train faster; LSTMs have slightly more expressive memory. Most practitioners try both.
Layer Normalization: The Normalization That Fits Sequences
In Post 7, batch normalization stabilized deep networks by normalizing across the batch. But RNNs have a problem with batch norm. Sequences have variable lengths, and the hidden state carries information across steps. Normalizing across a batch of sequences at each time step is unstable.
Layer normalization fixes this by normalizing across the features of each individual sample, not across the batch. Same idea, different axis. Completely independent of batch size and sequence length.
This is why layer norm became the standard for all sequence models and why every modern LLM uses it. When we get to Transformers in Post 10, it'll be everywhere.
What Clicked for Me
The reading analogy didn't just help me explain RNNs — it helped me understand what the hidden state actually is.
It's not a recording of the past. It's a compressed summary the parts of history that seem relevant for predicting what comes next. Just like a fluent reader doesn't remember the exact words from three pages ago, but does remember that the detective is suspicious of the butler.
Interactive Playground
cd 08-rnn
streamlit run rnn_playground.py
Train both models, then pick a sentence length and watch the confidence bars update word by word, you'll see exactly the step where Vanilla RNN changes its mind and LSTM doesn't.
What's Next
RNNs gave networks memory. But they process sequences step by step - slow, sequential, and still limited by how far gradients can travel even with LSTM.
There's a deeper problem too. The hidden state has to compress everything seen so far into a fixed-size vector. For long sequences, that bottleneck loses information no matter how good the gating is.
Post 9 introduces Attention Mechanisms: a way for the network to directly look back at any part of the input sequence it needs, regardless of distance. No compression bottleneck. No sequential processing. No hoping, the gradient survives 100 time steps.
It's the idea that made RNNs obsolete — and made Transformers possible.
References
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
- Cho, K., et al. (2014). Learning Phrase Representations using RNN Encoder-Decoder. EMNLP.
Top comments (0)