Rijul Rajesh

Posted on Apr 11

Understanding Transformers Part 5: Queries, Keys, and Similarity

#ai #machinelearning

In the previous article, we explored the self-attention concept for transformers, in this article we will go deeper into how the comparisons are performed.

Building Query and Key Values

Let’s go back to our example.

We have already added positional encoding to the words “Let’s” and “go”.

Creating Query Values

The first step is to multiply the position-encoded values for the word “Let’s” by a set of weights.

Next, we repeat the same process using a different set of weights, which gives us another value (for example, 3.7).

We do this twice because we started with two position-encoded values representing the word “Let’s”.

These resulting values together represent “Let’s” in a new form.

In transformer terminology, these are called query values.

Creating Key Values

Now, we use these query values to measure similarity with other words, such as “go”.

To do this, we first create a new set of values for each word, similar to how we created the query values.

We generate two values for “Let’s”

And two values for “go”

These new values are called key values.

What’s Next?

We will use these key values along with the query values to calculate how similar “Let’s” is to “go”.

We will explore how this similarity is calculated in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name

… and you’re done! 🚀

🔗 Explore Installerpedia here

Top comments (1)

Archit Mittal • Apr 18

The query/key intuition is one of the hardest pieces to land cleanly — most explanations either go full math or stay too hand-wavy. The framing that helped me click was treating queries as 'what am I looking for' and keys as 'what do I offer', with the dot product measuring relevance match. Once that lands, the scaling factor (sqrt(d_k)) makes sense as keeping variance bounded so softmax doesn't collapse. Looking forward to part 6 — the multi-head extension is where I see folks lose the thread.