DEV Community

Jyoti Prajapati
Jyoti Prajapati

Posted on

Transformer Series - Blog #2 Embeddings: Turning Tokens into Vectors

This is the second post in a series where we build Transformers and Vision Transformers from the ground up.
In the previous post, we covered the neural network fundamentals required for Transformers.
In this post, we focus on embeddings β€” the step that makes attention possible.

Why Embeddings Matter More Than Attention

Attention often gets all the credit in Transformers.

But attention operates on vectors, not on words, tokens, or pixels.

Before a Transformer can reason, compare, or attend to anything, we must answer a simple question:

How do we represent discrete symbols in a neural network?

  • The answer is: embeddings.
  • Without embeddings, attention has nothing to work on.

Neural Networks Don’t Understand Tokens

Neural networks operate on:

  • Real-valued vectors
  • Continuous spaces
  • Matrix multiplication

They do not understand: words, characters, tokens, and categories.

If we feed raw token IDs directly into a Transformer, the model would assume numeric relationships that do not exist.

For example:

  • Token ID 10 is not "twice" token ID 5
  • Token ID 1000 is not "larger" than token ID 10 in any semantic sense

We need a representation that removes this artificial ordering.

One-Hot Encoding: The Simplest Idea

The most basic way to represent tokens is one-hot encoding. If the vocabulary size is 𝑉:

  • Each token is a vector of length 𝑉
  • Exactly one position is 1, the rest are 0

Example:
"cat" β†’ [0, 0, 1, 0, 0, ...]
"dog" β†’ [0, 1, 0, 0, 0, ...]

Problems with One-Hot Encoding

  • Are extremely sparse
  • Do not encode similarity
  • Scale poorly with large vocabularies

From a neural network’s perspective:

  • "cat" and "dog" are just as different as "cat" and "car"

This is not useful for learning language.

Dense Embeddings: Learning Meaningful Representations

Instead of fixed one-hot vectors, we learn dense embeddings.

An embedding maps each token to a vector in a lower-dimensional continuous space:

tokenβ†’ 𝑅^𝑑
Enter fullscreen mode Exit fullscreen mode

Where:
𝑑 is the embedding dimension (e.g. 128, 512, 768)

Example:
"cat" β†’ [0.12, -0.31, 0.78, ...]
"dog" β†’ [0.15, -0.28, 0.74, ...]

Now:

  • Similar words have similar vectors
  • Distance and direction carry meaning
  • Neural networks can operate naturally

One-hot vectors are sparse and encode no similarity. Dense embeddings are compact and capture semantic relationships
Figure 1: One-hot vectors are sparse and encode no similarity. Dense embeddings are compact and capture semantic relationships.

Embedding Layer = Lookup Table + Learning

An embedding layer is conceptually simple.
It is just a matrix:

𝐸 ∈ 𝑅^(𝑉×𝑑)
Enter fullscreen mode Exit fullscreen mode

Here
𝑉: vocabulary size

𝑑: embedding dimension

Given a token ID 𝑖, the embedding is:

π‘₯ = 𝐸[𝑖]
Enter fullscreen mode Exit fullscreen mode

That’s it.

During training:

  • Gradients flow into this matrix
  • Embeddings are learned end-to-end
  • Meaning emerges from task optimization

There is no separate "semantic training phase" β€” embeddings learn meaning because the model needs it.

Why Attention Needs Embeddings

Attention computes similarity using dot products:

Attention (𝑄, 𝐾, 𝑉) ∝ 𝑄𝐾^𝑇
Enter fullscreen mode Exit fullscreen mode

Dot products only make sense if:

  • Inputs live in the same vector space
  • Dimensions align
  • Geometry is meaningful

Embeddings provide exactly that.

Without embeddings:

  • Queries and keys would be meaningless
  • Similarity scores would be arbitrary
  • Attention would collapse

Figure 2: Embeddings map discrete tokens into a shared vector space where similarity and attention become meaningful.

Figure 2: Embeddings map discrete tokens into a shared vector space where similarity and attention become meaningful.

Token Embeddings Are Only Part of the Story

So far, embeddings only tell us what the token is.

They do not tell us:

  • Where the token appears in the sequence
  • Whether it comes before or after another token

Transformers solve this using positional information, which we’ll cover in the next post.

For now, remember:

  • Token embeddings encode identity and meaning
  • Position must be injected separately

Embeddings Are Learned, Not Hardcoded

A common misconception is that embeddings are fixed or predefined.

In Transformers:

  • Embeddings are trainable parameters
  • They adapt to the task
  • They change during fine-tuning

This is why:

  • The same word has different embeddings in different models
  • Domain-specific fine-tuning changes representation quality

Key Takeaways

  • Neural networks cannot operate on discrete symbols
  • One-hot encoding is insufficient for learning semantics
  • Dense embeddings map tokens into continuous vector spaces
  • Embedding layers are trainable lookup tables
  • Attention relies entirely on embeddings to function

What’s Next?

In the next post, we’ll cover Positional Encoding β€” how Transformers understand order without recurrence or convolution.

Top comments (0)