DEV Community

Nilavukkarasan R
Nilavukkarasan R

Posted on

Attention Mechanisms: Stop Compressing, Start Looking Back

"The art of being wise is the art of knowing what to overlook."
William James


The Bottleneck We Didn't Notice

In my last post, we gave networks memory. An LSTM reads a sentence word by word, maintaining a hidden state that carries context forward. It solved the forgetting problem that plagued vanilla RNNs.

But there are three problems LSTM still doesn't solve. And I didn't fully understand them until I thought about my own experience learning English.

I studied in Tamil medium all the way through school. English was a subject, not a language I lived in. When I started my first job 20 years ago, I had to learn to actually speak it and more terrifyingly, write it. Client emails. Professional communication. Things that would be read, judged, and replied to.

My strategy was the only one I knew: compose the sentence in Tamil first, then translate it word by word into English.

It worked for simple things. It broke down in three very specific ways. Those three breakdowns map exactly onto the three problems that attention was built to solve.


Problem 1: The Compressed Summary

The first breakdown happened with long emails.

I'd compose a full paragraph in Tamil mentally: three or four sentences, a complete thought. Then I'd try to hold that entire paragraph in my head while translating it into English. By the time I was writing the third sentence in English, the first one had blurred. I'd lose the subject I'd introduced. I'd forget the condition I'd set up. The English output would drift from the original Tamil thought.

The problem wasn't that I forgot individual words. It was that I was trying to carry a compressed summary of a long paragraph in my working memory and that summary wasn't big enough to hold everything.

This is exactly what an RNN encoder does.

It reads the entire input sequence and compresses it into a single fixed-size vector, the final hidden state. Then the decoder uses only that compressed summary to generate the output. For short sentences, fine. For long ones, that summary has to hold everything: the subject, the verb, the object, the tone, the nuance. Something always gets lost.

Bahdanau's Fix (2014)

The fix came from Bahdanau, Cho, and Bengio. The idea is simple in principle: don't compress. Keep every hidden state the encoder produced, one per input word, and let the decoder look back at any of them when needed.

Instead of one compressed summary, the decoder has access to the full sequence of encoder states. When generating each output word, it computes a weighted sum over all of them attending more to the ones that are relevant right now, less to the ones that aren't.

Without attention:  decoder sees only h_final (compressed summary of everything)
With attention:     decoder sees h₁, h₂, ..., hₙ and decides what to focus on
Enter fullscreen mode Exit fullscreen mode

Bahdanau's original formulation used a small neural network to compute how well each encoder state matched the decoder's current need - a learned compatibility function. It worked remarkably well. Translation quality on long sentences improved dramatically.

Your brain does this too. When you're answering a question about something you read, you don't reconstruct a compressed summary, rather you mentally flip back to the relevant section. The original is still accessible. Attention gives the network the same ability.


Problem 2: Word Order

The second breakdown was more embarrassing. It happened in individual sentences, not long paragraphs.

Tamil is a verb-final language. The verb comes at the end. When I wanted to write "Can you send the report by tomorrow?", the Tamil structure in my head was roughly: "நாளைக்குள் அந்த report-ஐ அனுப்ப முடியுமா?" — "Tomorrow-by that report send can-you?" Subject implied. Object before verb.

I'd start translating from the beginning of the Tamil sentence. "Tomorrow-by" → "By tomorrow". OK so far. "That report" → "the report". Fine. "Send" → "send". And then I'd realize I'd already written "By tomorrow the report send" and I was confused where to put "Can you."

What appeared perfectly correct in Tamil didn't map cleanly to English word by word. The structures are different. A literal left-to-right translation produces nonsense.

This is the word order problem — and it's where attention does its real work.

An RNN decoder, even with access to all encoder states, still generates output left to right, one word at a time. But attention lets the decoder look at any encoder position in any order. When generating "Can", it attends to the Tamil modal at position 5. When generating "send", it attends to the Tamil verb at position 4. When generating "tomorrow", it attends back to position 1.

Tamil:    நாளைக்குள்  அந்த  report-ஐ  அனுப்ப  முடியுமா
              h₁        h₂      h₃       h₄       h₅
           (by tmrw)  (that) (report)  (send)  (can you?)

English output → attention focus:
"Can"      → h₅  (முடியுமா — the modal)
"you"      → h₅
"send"     → h₄  (அனுப்ப — the verb)
"the"      → h₂ + h₃
"report"   → h₃  (report-ஐ — the object)
"by"       → h₁  (நாளைக்குள் — the time marker)
"tomorrow" → h₁
Enter fullscreen mode Exit fullscreen mode

The attention weights form a matrix, one row per English output word, one column per Tamil input word. You can literally see the reordering: the decoder jumping from position 5 back to position 4, then to 3, then to 1. It's not following the Tamil order. It's following the English order, looking back at whatever Tamil position it needs.

This is what the Q/K/V formulation captures cleanly:

  • Query (Q): what the decoder is currently asking — "what do I need to generate this word?"
  • Key (K): what each encoder position offers — a description of what's available there
  • Value (V): the actual content retrieved when you attend to that position
Attention(Q, K, V) = softmax(Q·Kᵀ / √d) · V
Enter fullscreen mode Exit fullscreen mode

The √d scaling keeps dot products in a stable range as dimension grows without it, softmax saturates and gradients vanish. Same instability problem we saw in deep networks, same fix.


Problem 3: Speed

The third breakdown was the slowest to notice, because it wasn't about a single sentence. It was about conversation.

Word-by-word translation is sequential by nature. I'd think in Tamil, translate, speak. Then listen to the reply in English, translate it back to Tamil to understand it, formulate a Tamil response, translate that to English, speak. Every exchange had this full round-trip happening in my head.

For a simple two-line exchange, manageable. For a fast-moving technical discussion with multiple people, completely unworkable. By the time I'd finished translating the last thing someone said, the conversation had moved on two turns.

The bottleneck wasn't comprehension. It was that the process was sequential. Each step had to wait for the previous one to finish.

This is the parallelism problem — and it's what self-attention solves.

An RNN processes a sequence one step at a time. Step 2 can't start until step 1 is done. For a sentence of length 100, that's 100 sequential operations. You can't parallelize across time steps because each hidden state depends on the previous one.

Self-attention breaks this dependency entirely. Instead of processing word by word, it computes relationships between all positions simultaneously in a single matrix operation. There's no sequential chain. The entire sequence is processed at once.

When you start thinking directly in English, something similar happens. Its not a sequential process anymore. Grammar, meaning, and context were being processed in parallel, automatically, without conscious effort. It's parallel processing.

Self-attention is the architectural version of that shift.


Self-Attention: Every Word Sees Every Other Word

So far, attention was between two sequences: Tamil input, English output. The decoder attends to the encoder. But the same mechanism applies within a single sequence and this turns out to be even more powerful.

Consider: "The report that the client who called yesterday requested is ready."

What is "ready"? The report. Which report? The one the client requested. Which client? The one who called yesterday. These connections span many positions in the same sentence. An RNN would need to carry all of this through its hidden state, step by step, hoping nothing gets lost.

Self-attention resolves them in one shot every word attends to every other word in the same sequence, regardless of distance.

"ready"     → attends back to "report" (subject of the predicate)
"requested" → attends to "client" (who did the requesting)
"who"       → attends to "client" (relative clause anchor)
Enter fullscreen mode Exit fullscreen mode

No sequential processing. No hidden state bottleneck. One operation, all connections at once.

Your brain does this effortlessly when reading fluently. It's only when you're translating word by word processing sequentially, one token at a time that you lose these long-range connections.


Multi-Head Attention: Noticing Multiple Things at Once

There's one more piece. A single attention operation computes one set of weights. It can only "look for" one type of relationship at a time. But language has many simultaneous relationships.

In "The cat sat on the mat because it was tired", the word "it" has:

  • A syntactic relationship with "sat" (subject of the clause)
  • A coreference relationship with "cat" (what "it" refers to)
  • A semantic relationship with "tired" (property being attributed)

A single attention head would have to pick one. Multi-head attention runs several attention operations in parallel, each with different learned projections:

head_i = Attention(Q·Wᵢ_Q, K·Wᵢ_K, V·Wᵢ_V)
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
Enter fullscreen mode Exit fullscreen mode

Each head learns to notice different relationships simultaneously. One head might track grammatical alignment. Another might track semantic similarity. Another might track coreference, which pronoun refers to which noun.

The standard Transformer uses 8 heads. Each head operates on a smaller slice of the representation (dimension d/8 instead of d), so the total computation is the same as a single large attention — but the network gets 8 different perspectives instead of one.


What Clicked for Me

The compressed summary problem is the bottleneck of trying to hold a whole paragraph in working memory before translating. The word order problem is the mismatch between SOV and SVO that makes literal translation fail. The sequential processing problem is the reason real-time conversation was impossible while I was still translating word by word.

The shift from "translate word by word" to "think in English" is the shift from RNN to attention. It's not an optimization. It's a different way of processing.


Interactive Playground

cd 09-attention
streamlit run attention_playground.py
Enter fullscreen mode Exit fullscreen mode

GitHub Repository

This playground is different from the previous ones. No training loops, no waiting. Five concept demos that follow the blog post narrative — every slider updates instantly because it's all just matrix math under the hood.


What's Next

Attention solves the bottleneck. But the architecture we've built so far still has an RNN encoder underneath — it's still sequential at its core.

Post 10 asks: what if we removed the RNN entirely? What if the whole architecture was just attention, stacked?

That's the Transformer. Attention without recurrence. Parallel processing of the entire sequence at once. Positional encodings to restore order information. And a feed-forward network to add non-linearity between attention layers.

It's the architecture behind every modern language model — GPT, BERT, T5, and everything that came after. And it's built entirely from pieces we already understand.


Deep Dive

For the full mathematical treatment — dot-product attention, scaled attention, the Q/K/V framework, self-attention, multi-head attention, masking, gradient flow, and worked numerical examples — see ATTENTION_MATH_DEEP_DIVE.md.


References

  1. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate
  2. Luong, M., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation
  3. Vaswani, A (2017). Attention Is All You Need

Top comments (0)