DEV Community

Dolly Sharma
Dolly Sharma

Posted on

Tokens

🧠 Tokens in Transformers β€” Developer Notes

πŸ”Ή What is a Token?

A token is the smallest unit of text that a transformer model processes.

It is created by a tokenizer and then converted into numerical IDs before entering the model.

⚠️ Important: Token β‰  always a full word.


πŸ”Ή What Can Be a Token?

Depending on the tokenizer, a token may be:

  • whole word
  • subword (most common)
  • character
  • punctuation
  • special symbol

βœ… Modern transformers mainly use subword tokenization.


πŸ”Ή Example

Sentence:

I like eating apples
Enter fullscreen mode Exit fullscreen mode

Possible subword tokens:

[I] [like] [eat] [##ing] [apple] [##s]
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή Transformer Processing Pipeline

Raw Text β†’ Tokenizer β†’ Tokens β†’ Token IDs β†’ Embeddings β†’ Transformer
Enter fullscreen mode Exit fullscreen mode

Neural networks only understand numbers, so tokens must be converted to IDs and then to vectors.


πŸ”Ή Why Tokenization Is Needed

Tokenization helps to:

  • reduce vocabulary size
  • handle unknown words
  • capture morphology
  • improve generalization
  • enable efficient training

πŸ”Ή Special Tokens (Encoder Models)

Typical encoder input:

[CLS] I like apples [SEP]
Enter fullscreen mode Exit fullscreen mode

Roles

  • [CLS] β†’ sentence-level representation
  • [SEP] β†’ separator between sentences

Token in Vector Form β€”

πŸ”Ή Hidden Size Rule

If a Transformer model has:

hidden_size = 768
Enter fullscreen mode Exit fullscreen mode

βœ… Every token is represented by a 768-dimensional vector inside the model.

  • Fixed by architecture
  • Independent of sentence length
  • Same dimension for all tokens

πŸ”Ή For n Tokens

If tokenization produces n tokens, the representation matrix is:

H ∈ R^(n Γ— 768)
Enter fullscreen mode Exit fullscreen mode

Meaning

  • n β†’ number of tokens (rows)
  • 768 β†’ hidden dimension (columns)
  • Each row β†’ contextual vector of one token

πŸ”Ή Example

Sentence β†’ tokenized into:

7 subword tokens
Enter fullscreen mode Exit fullscreen mode

Then:

H ∈ R^(7 Γ— 768)
Enter fullscreen mode Exit fullscreen mode

βœ” 7 token vectors
βœ” each vector size = 768


πŸ”Ή Important Clarification

❌ Tokens do NOT multiply with 768
βœ… Token vectors are stacked

Correct view:

token₁ β†’ ℝ^768  
tokenβ‚‚ β†’ ℝ^768  
...  
token₇ β†’ ℝ^768
Enter fullscreen mode Exit fullscreen mode

Stacked as:

H = [token₁_vec
     tokenβ‚‚_vec
     ...
     token₇_vec]  ∈ R^(7 Γ— 768)
Enter fullscreen mode Exit fullscreen mode

πŸ”Ή Shape Through Encoder Layers

In encoder models (like DeBERTa):

Input shape  = (n Γ— 768)  
After layer1 = (n Γ— 768)  
After layer2 = (n Γ— 768)  
...
Enter fullscreen mode Exit fullscreen mode

βœ… Shape stays constant
βœ… Only values become more contextual


πŸ”Ή Mental Model

Number of tokens β†’ number of rows
Hidden size β†’ number of columns


In models like DeBERTa-base, each token is mapped to a fixed 768-dimensional vector, so a sentence with n tokens produces a representation matrix of shape (n Γ— 768).


πŸ”Ή Important Interview Points

  • Token β‰  word
  • Most transformers use subword tokens
  • Tokenizer is model-specific
  • Tokens are converted to IDs before embeddings
  • Similar tokens β†’ similar vectors (in context)

πŸ”Ή Common Tokenizer Types

Model Family Tokenizer
BERT / DeBERTa WordPiece
GPT family BPE
LLaMA SentencePiece

πŸ”Ή One-Line Mental Model

Token = smallest text unit the transformer understands.


πŸ”Ή Ultra-Short Interview Answer

A token is the smallest textual unit produced by a tokenizer and converted into numerical form so that a transformer model can process input text.

Top comments (0)