Have you ever wondered how a computer knows that a cat is more like a dog than a car?
To a machine, words are just strings of characters or arbitrary ID numbers. But in the world of Natural Language Processing, we’ve found a way to give words a home in a multi-dimensional space. In this space, the neighbors are their semantic relatives.
In this post, we’ll explore the fascinating world of word embeddings. We’ll start with the intuition (no deep technical dives) and build up a clear understanding of what word embeddings really are (along with code), and how they enable AI systems to capture meaning and relationships in human language.
The Magic of Word Math: Static Embeddings
Imagine if you could do math with ideas. The classic example in the world of embeddings is:
King - Man + Woman ≈ Queen
This isn’t just a clever trick! This is the power of Static Embeddings like GloVe (Global Vectors for Word Representation).
GloVe works by looking at massive amounts of text to see how often words appear near each other. It then assigns each word a fixed numerical vector. Because these vectors represent the “meaning”, words that are semantically similar end up close together.

King is closer to queen than man or woman.
The Bank Problem: When One Vector Isn’t Enough
As powerful as static models like GloVe are, they have a blind spot called polysemy: words with multiple meanings.
Think about the word “bank”:
- I need to go to the bank to deposit some money. (A financial institution).
- We sat on the bank of the river. (The edge of a river).

Bank vs river bank (two different meanings of one word).
In a static model like GloVe, a bank has one single, fixed vector. This single meaning is an average across all contexts the model saw during training. This means the model can’t truly distinguish between a place where you keep your savings and the grassy side of a river.
The Solution: Contextual Embeddings with BERT
This is where Dynamic or Contextual Embeddings, like BERT (Bidirectional Encoder Representations from Transformers), have changed the game. Unlike GloVe, BERT doesn’t just look up a word in a fixed dictionary. It looks at the entire sentence to generate a unique vector for a word every single time it appears.
When BERT processes our two bank sentences, it recognizes the surrounding words (like “river” or “deposit”) and generates two completely different vectors. It understands that the context changes the core identity of the word.
Here is the simple usage of BERT with PyTorch in code:
import torch
from transformers import BertTokenizer, BertModel
# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('./bert_model')
model_bert = BertModel.from_pretrained('./bert_model')
# Prevent training
model_bert.eval()
def print_token_embeddings(sentence, label):
"""
Tokenizes a sentence, runs it through BERT,
and prints the first 5 values of each token's embedding.
"""
# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt")
# Forward pass
with torch.no_grad():
outputs = model_bert(**inputs)
# Contextual embeddings for each token
embeddings = outputs.last_hidden_state[0]
# Convert token IDs back to readable tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Print token + partial embedding
print(f"\n--- {label} ---")
for token, vector in zip(tokens, embeddings):
print(f"{token:<12} {vector.numpy()[:5]}")
# Example sentences
sentence1 = "I went to the bank to deposit money."
sentence2 = "We sat on the bank of the river."
# Compare contextual embeddings
print_token_embeddings(sentence1, "Sentence 1")
print_token_embeddings(sentence2, "Sentence 2")
Output
The output shows that BERT assigns different vectors to the word bank based on its surrounding context.
Sentence 1: I went to the bank to deposit money.

Sentence 2: We sat on the bank of the river.
Which Model Should You Use?
Choosing the right embedding depends entirely on your specific task and your available computational resources.
Static Embeddings (like GloVe) are the best choice when you need a fast, computationally lightweight solution with a small memory footprint. They are perfect for straightforward tasks like document classification, where the broader meaning of words is usually sufficient.
On the other hand, Contextual Embeddings (such as BERT) are necessary when your task requires a deep understanding of language and ambiguity, such as question answering or advanced chatbots. They excel at handling words with multiple meanings, which is often the key to an application’s success. However, keep in mind that they require more computational power and a larger memory footprint.
Wrapping Up
Embeddings are the foundation of how AI reads and processes our human world. Whether you are using a pre-trained model like BERT or building a simple embedding model from scratch using PyTorch’s nn.Embedding layer, you are essentially building a bridge between human thought and machine calculation.
What do you think? If you were training a model from scratch today, what specific vocabulary or niche topic would you want it to learn first? Let me know in the comments 👇.
Note: All illustrations in this post were generated using DALL·E 3.
Quick Quiz
Let’s test your understanding. Share your answer in the comments 👇.
How does text data differ from image data in machine learning?
References
- Devlin, J. et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Stanford NLP Group. GloVe: Global Vectors for Word Representation.
- Spot Intelligence. GloVe Embeddings Explained.


Top comments (0)