Understanding Attention Mechanisms – Part 1: Why Long Sentences Break Encoder–Decoders

#ai #machinelearning

In the previous articles, we understood Seq2Seq models. Now, on the path toward transformers, we need to understand one more concept before reaching there: Attention.

The encoder in a basic encoder–decoder, by unrolling the LSTMs, compresses the entire input sentence into a single context vector.

This works fine for short phrases like "Let's go".

But if we had a bigger input vocabulary with thousands of words, then we could input longer and more complicated sentences, like "Don't eat the delicious-looking and smelling pasta".

For longer phrases, even with LSTMs, words that are input early on can be forgotten.

In this case, if we forget the first word "Don't", then it becomes:

"eat the delicious-looking and smelling pasta"

So, sometimes it is important to remember the first word.

Basic RNNs had problems with long-term memory because they ran both long- and short-term information through a single path.

The main idea of Long Short-Term Memory (LSTM) units is that they solve this problem by providing separate paths for long- and short-term memory.

Even with separate paths, if we have a lot of data, both paths still have to carry a large amount of information.

So, a word at the start of a long phrase, like "Don't", can still get lost.

So, the main idea of attention is to add multiple new paths from the encoder to the decoder.

There is one path per input value, so each step of the decoder can directly access the relevant input values.

We will explore more about attention in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: