Attention Mechanisms: Stop Compressing, Start Looking Back

#selfattention #ai #transformer #multiheadattention

"The art of being wise is the art of knowing what to overlook."
--William James

Three Problems LSTM Didn't Solve

LSTMs gave networks memory. But I didn't fully understand what was still missing until I thought about my own experience learning English.

I studied in Tamil medium all the way through school. English was a subject, not a language I lived in. When I started my first job 20 years ago, I had to learn to actually speak it and write it. Client emails. Professional communication.

My strategy: compose the sentence in Tamil first, then translate word by word into English. It worked for simple things. It broke down in three specific ways. Those three breakdowns map exactly onto the three problems attention was built to solve.

Problem 1: The Compressed Summary

Long emails broke me. I'd compose a full paragraph in Tamil mentally, then try to hold it all in my head while translating into English. By the third sentence, the first one had blurred. I'd lose the subject I'd introduced. The English output would drift from the original Tamil thought.

The problem: I was trying to carry a compressed summary of a long paragraph in working memory, and that summary wasn't big enough.

That's exactly what an RNN encoder does. It reads the entire input and compresses it into a single fixed-size vector. The decoder uses only that compressed summary. For short sentences, fine. For long ones, something always gets lost.

The fix (Bahdanau): don't compress. Keep every hidden state the encoder produced, one per input word. Let the decoder look back at any of them when generating each output word.

Without attention:  decoder sees only h_final (compressed summary)
With attention:     decoder sees h₁, h₂, ..., hₙ and picks what it needs

Problem 2: Word Order

Tamil is verb-final. "Can you send the report by tomorrow?" in Tamil is roughly "Tomorrow-by that report send can-you?" I'd start translating left to right and end up with "By tomorrow the report send" before realizing "Can you" needed to come first.

Attention solves this. The decoder can look at any encoder position in any order:

Tamil:    நாளைக்குள்  அந்த  report-ஐ  அனுப்ப  முடியுமா
              h₁        h₂      h₃       h₄       h₅

English output → attention focus:
"Can"      → h₅  (முடியுமா — can you?)
"send"     → h₄  (அனுப்ப — send)
"the report" → h₃  (report-ஐ)
"by tomorrow" → h₁  (நாளைக்குள்)

The decoder doesn't follow the Tamil order. It follows the English order, looking back at whatever Tamil position it needs. This is what the Q/K/V formulation captures:

Query (Q): what the decoder is asking for right now
Key (K): what each encoder position offers
Value (V): the content retrieved when you attend to that position

Attention(Q, K, V) = softmax(Q·Kᵀ / √d) · V

Reading it piece by piece: Q·Kᵀ computes a score between every query and every key, measuring how well what I'm asking for matches what each position offers. Softmax turns those scores into weights that sum to 1, so the decoder distributes its focus across positions. Multiplying by V retrieves a weighted blend of the actual content. √d (where d is the dimension of the key vectors) is a scaling factor that prevents dot products from growing too large in high dimensions, which would push softmax into extreme values where gradients vanish.

Problem 3: Speed

The third breakdown was about conversation. Word-by-word translation is sequential. Think in Tamil, translate, speak. Listen in English, translate back to Tamil, formulate response, translate to English, speak. For a fast-moving technical discussion, completely unworkable. By the time I'd finished translating, the conversation had moved on.

The bottleneck wasn't comprehension. It was that the process was sequential. Each step waited for the previous one.

RNNs have the same problem. Step 2 waits for step 1. For 100 tokens, that's 100 sequential operations. Self-attention breaks this entirely. Instead of processing word by word, it computes relationships between all positions simultaneously. No sequential chain. The entire sequence processed at once.

When I started thinking directly in English, the same shift happened. Grammar, meaning, context, all processed in parallel, automatically. Self-attention is the architectural version of that shift.

Self-Attention: Every Word Sees Every Other Word

Consider: "The report that the client who called yesterday requested is ready."

What is "ready"? The report. Which report? The one the client requested. Which client? The one who called yesterday. These connections span many positions. An RNN carries all of this through its hidden state, hoping nothing gets lost.

Self-attention resolves them in one operation. Every word attends to every other word, regardless of distance. "Ready" looks back at "report." "Requested" looks back at "client." No sequential chain, no compression bottleneck.

Multiple attention heads run in parallel, each learning to notice different relationships. One head tracks grammar. Another tracks what pronouns refer to. Another tracks meaning. Eight heads, eight perspectives, same computation cost.

See It

Open the playground. Five concept demos that follow this post's narrative. No training loops, no waiting. Every slider updates instantly because it's all matrix math.

On the left, naive left-to-right alignment: "Can" looks at "by-tmrw," which is wrong. On the right, learned attention: "Can" jumps to "can-you?" at position 5, "send" jumps to position 4, "tomorrow" reaches back to position 1. The non-diagonal pattern is the reordering.

ATTENTION_MATH_DEEP_DIVE

What's Next

Attention solves the bottleneck. But the architecture still has an RNN encoder underneath. It's still sequential at its core.

What if we removed the RNN entirely? What if the whole architecture was just attention, stacked?

That's the Transformer.

References:
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate.
Vaswani, A., et al. (2017). Attention Is All You Need.

Series: From Perceptrons to Transformers | Code: GitHub