π§ Tokens in Transformers β Developer Notes
πΉ What is a Token?
A token is the smallest unit of text that a transformer model processes.
It is created by a tokenizer and then converted into numerical IDs before entering the model.
β οΈ Important: Token β always a full word.
πΉ What Can Be a Token?
Depending on the tokenizer, a token may be:
- whole word
- subword (most common)
- character
- punctuation
- special symbol
β Modern transformers mainly use subword tokenization.
πΉ Example
Sentence:
I like eating apples
Possible subword tokens:
[I] [like] [eat] [##ing] [apple] [##s]
πΉ Transformer Processing Pipeline
Raw Text β Tokenizer β Tokens β Token IDs β Embeddings β Transformer
Neural networks only understand numbers, so tokens must be converted to IDs and then to vectors.
πΉ Why Tokenization Is Needed
Tokenization helps to:
- reduce vocabulary size
- handle unknown words
- capture morphology
- improve generalization
- enable efficient training
πΉ Special Tokens (Encoder Models)
Typical encoder input:
[CLS] I like apples [SEP]
Roles
-
[CLS]β sentence-level representation -
[SEP]β separator between sentences
Token in Vector Form β
πΉ Hidden Size Rule
If a Transformer model has:
hidden_size = 768
β Every token is represented by a 768-dimensional vector inside the model.
- Fixed by architecture
- Independent of sentence length
- Same dimension for all tokens
πΉ For n Tokens
If tokenization produces n tokens, the representation matrix is:
H β R^(n Γ 768)
Meaning
- n β number of tokens (rows)
- 768 β hidden dimension (columns)
- Each row β contextual vector of one token
πΉ Example
Sentence β tokenized into:
7 subword tokens
Then:
H β R^(7 Γ 768)
β 7 token vectors
β each vector size = 768
πΉ Important Clarification
β Tokens do NOT multiply with 768
β
Token vectors are stacked
Correct view:
tokenβ β β^768
tokenβ β β^768
...
tokenβ β β^768
Stacked as:
H = [tokenβ_vec
tokenβ_vec
...
tokenβ_vec] β R^(7 Γ 768)
πΉ Shape Through Encoder Layers
In encoder models (like DeBERTa):
Input shape = (n Γ 768)
After layer1 = (n Γ 768)
After layer2 = (n Γ 768)
...
β
Shape stays constant
β
Only values become more contextual
πΉ Mental Model
Number of tokens β number of rows
Hidden size β number of columns
In models like DeBERTa-base, each token is mapped to a fixed 768-dimensional vector, so a sentence with n tokens produces a representation matrix of shape (n Γ 768).
πΉ Important Interview Points
- Token β word
- Most transformers use subword tokens
- Tokenizer is model-specific
- Tokens are converted to IDs before embeddings
- Similar tokens β similar vectors (in context)
πΉ Common Tokenizer Types
| Model Family | Tokenizer |
|---|---|
| BERT / DeBERTa | WordPiece |
| GPT family | BPE |
| LLaMA | SentencePiece |
πΉ One-Line Mental Model
Token = smallest text unit the transformer understands.
πΉ Ultra-Short Interview Answer
A token is the smallest textual unit produced by a tokenizer and converted into numerical form so that a transformer model can process input text.
Top comments (0)