DEV Community

Cover image for Cosine Similarity vs Dot Product in Attention Mechanisms
Rijul Rajesh
Rijul Rajesh

Posted on

Cosine Similarity vs Dot Product in Attention Mechanisms

For comparing the hidden states between the encoder and decoder, we need a similarity score.

Two common approaches to calculate this are:

  • Cosine similarity
  • Dot product

Cosine Similarity

It performs a dot product on the vectors and then normalizes the result.

Example

Encoder output:

[-0.76, 0.75]
Enter fullscreen mode Exit fullscreen mode

Decoder output:

[0.91, 0.38]
Enter fullscreen mode Exit fullscreen mode

Cosine similarity ≈ -0.39

  • Close to 1 → very similar → strong attention
  • Close to 0 → not related
  • Negative → opposite → low attention

This is useful when:

  • Values can vary a lot in size
  • You want a consistent scale (-1 to 1)

The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that.


Dot Product

Dot product is much simpler. It does the following:

  • Multiply corresponding values
  • Add them up

Example

(-0.76 × 0.91) + (0.75 × 0.38) = -0.41
Enter fullscreen mode Exit fullscreen mode

Dot product is preferred in attention because:

  • It’s fast
  • It’s simple
  • It gives good relative scores

Even if the numbers are not normalized, the model can still figure out:

  • Which words are more important
  • Which words to ignore

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name
Enter fullscreen mode Exit fullscreen mode

… and you’re done! 🚀

Installerpedia Screenshot

🔗 Explore Installerpedia here

Top comments (0)