Cosine Similarity vs Dot Product in Attention Mechanisms

#ai #machinelearning

For comparing the hidden states between the encoder and decoder, we need a similarity score.

Two common approaches to calculate this are:

Cosine similarity
Dot product

Cosine Similarity

It performs a dot product on the vectors and then normalizes the result.

Example

Encoder output:

[-0.76, 0.75]

Decoder output:

[0.91, 0.38]

Cosine similarity ≈ -0.39

Close to 1 → very similar → strong attention
Close to 0 → not related
Negative → opposite → low attention

This is useful when:

Values can vary a lot in size
You want a consistent scale (-1 to 1)

The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that.

Dot Product

Dot product is much simpler. It does the following:

Multiply corresponding values
Add them up

Example

(-0.76 × 0.91) + (0.75 × 0.38) = -0.41

Dot product is preferred in attention because:

It’s fast
It’s simple
It gives good relative scores

Even if the numbers are not normalized, the model can still figure out:

Which words are more important
Which words to ignore

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: