DEV Community

Cover image for 81. BERT: Understanding Language Deeply
Akhilesh
Akhilesh

Posted on

81. BERT: Understanding Language Deeply

Google Search used to work by matching keywords.

You type "jaguar speed." You get pages about the Jaguar car. Because "speed" and "jaguar" appear on car performance pages. The fact that you might mean the animal does not matter. Keywords do not carry context.

In 2019, Google upgraded its search to use BERT. Now when you type "can you get medicine for someone pharmacy," the model understands that "for someone" means you are picking up a prescription for another person, not buying it for yourself. That context completely changes the relevant results.

BERT is a transformer encoder pretrained on 3.3 billion words using a clever self-supervised objective: predict randomly masked words. No labels required. The pretraining forces the model to build deep contextual understanding of language. Then you fine-tune on your specific task with a small labeled dataset.

The results shifted the entire field. In November 2018, BERT achieved state-of-the-art on 11 NLP benchmarks simultaneously. A single model, pretrained once, beating specialized models on every task.


How BERT Was Pretrained

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from transformers import (BertTokenizer, BertModel, BertForSequenceClassification,
                           BertForMaskedLM, pipeline, AutoTokenizer, AutoModel)
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

torch.manual_seed(42)

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

print("BERT Pretraining Objectives:")
print()
print("1. MASKED LANGUAGE MODEL (MLM)")
print("   Randomly mask 15% of tokens.")
print("   Predict the masked tokens.")
print("   Uses BOTH left and right context simultaneously.")
print("   This is the 'bidirectional' in Bidirectional Encoder Representations.")
print()
print("2. NEXT SENTENCE PREDICTION (NSP)")
print("   Given sentence A and sentence B:")
print("   Predict whether B follows A in the original text.")
print("   Trains the model to understand sentence relationships.")
print()

sentence = "The capital of France is Paris and it is a beautiful city."
tokens   = tokenizer.tokenize(sentence)
print(f"Original: '{sentence}'")
print(f"Tokens:   {tokens}")
print()

masked_tokens = tokens.copy()
np.random.seed(42)
mask_indices = np.random.choice(len(tokens), size=int(0.15 * len(tokens)), replace=False)
for idx in mask_indices:
    masked_tokens[idx] = "[MASK]"

print(f"Masked:   {masked_tokens}")
print(f"BERT must predict the masked tokens using all surrounding context.")
Enter fullscreen mode Exit fullscreen mode

BERT's Architecture

bert_configs = {
    "bert-base-uncased": {"layers": 12, "d_model": 768, "n_heads": 12, "params": "110M"},
    "bert-large-uncased": {"layers": 24, "d_model": 1024, "n_heads": 16, "params": "340M"},
    "bert-base-multilingual": {"layers": 12, "d_model": 768, "n_heads": 12, "params": "179M"},
    "distilbert-base-uncased": {"layers": 6,  "d_model": 768, "n_heads": 12, "params": "66M"},
}

print("BERT Model Variants:")
print(f"{'Model':<30} {'Layers':>8} {'d_model':>8} {'Heads':>8} {'Params':>10}")
print("=" * 68)
for name, cfg in bert_configs.items():
    print(f"{name:<30} {cfg['layers']:>8} {cfg['d_model']:>8} "
          f"{cfg['n_heads']:>8} {cfg['params']:>10}")

print()
bert = BertModel.from_pretrained("bert-base-uncased")
print(f"\nActual parameter count: {sum(p.numel() for p in bert.parameters()):,}")
print()

print("BERT Input Format:")
print("  [CLS] sentence_A [SEP] sentence_B [SEP]")
print()
text_a = "The cat sat on the mat."
text_b = "The dog played in the yard."
encoded = tokenizer(text_a, text_b)
tokens  = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
print(f"  Tokens: {tokens}")
print(f"  Type IDs: {encoded['token_type_ids']}")
print(f"    0 = sentence A, 1 = sentence B")
Enter fullscreen mode Exit fullscreen mode

Getting BERT Embeddings

bert.eval()

sentences = [
    "The bank is on the river.",
    "I deposited money at the bank.",
    "The cat sat on the mat.",
    "Dogs are friendly animals.",
]

all_cls_embeddings = []

for sentence in sentences:
    inputs  = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128)
    with torch.no_grad():
        outputs = bert(**inputs)

    cls_embedding = outputs.last_hidden_state[:, 0, :]
    all_cls_embeddings.append(cls_embedding.squeeze().numpy())

all_cls_embeddings = np.array(all_cls_embeddings)

from sklearn.decomposition import PCA
pca     = PCA(n_components=2)
emb_2d  = pca.fit_transform(all_cls_embeddings)

fig, ax = plt.subplots(figsize=(9, 6))
colors  = ["steelblue", "steelblue", "coral", "coral"]
for i, (sentence, color) in enumerate(zip(sentences, colors)):
    ax.scatter(emb_2d[i, 0], emb_2d[i, 1], color=color, s=120, zorder=5)
    ax.annotate(sentence[:30] + "...", (emb_2d[i, 0], emb_2d[i, 1]),
                fontsize=9, xytext=(8, 5), textcoords="offset points")

ax.set_title("BERT [CLS] Embeddings: Sentence Similarity in 2D", fontsize=12)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.savefig("bert_embeddings.png", dpi=150)
plt.show()

from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(all_cls_embeddings)
print("Cosine similarity between sentences:")
for i, s1 in enumerate(sentences):
    for j, s2 in enumerate(sentences):
        if i < j:
            print(f"  {sim_matrix[i,j]:.3f}  '{s1[:30]}''{s2[:30]}'")
Enter fullscreen mode Exit fullscreen mode

Fine-Tuning BERT for Text Classification

newsgroups = fetch_20newsgroups(
    subset="all",
    categories=["sci.space", "rec.sport.hockey",
                "talk.politics.guns", "comp.graphics"],
    remove=("headers", "footers", "quotes")
)

texts  = newsgroups.data[:1200]
labels = newsgroups.target[:1200]

texts  = [t[:512] for t in texts]
X_tr, X_te, y_tr, y_te = train_test_split(
    texts, labels, test_size=0.2, random_state=42, stratify=labels)

print(f"Training samples: {len(X_tr)}")
print(f"Test samples:     {len(X_te)}")
print(f"Classes: {newsgroups.target_names}")
print()

class NewsDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts     = texts
        self.labels    = labels
        self.tokenizer = tokenizer
        self.max_len   = max_len

    def __len__(self): return len(self.texts)

    def __getitem__(self, i):
        enc = self.tokenizer(
            self.texts[i],
            max_length=self.max_len,
            padding="max_length",
            truncation=True,
            return_tensors="pt"
        )
        return {
            "input_ids":      enc["input_ids"].squeeze(),
            "attention_mask": enc["attention_mask"].squeeze(),
            "label":          torch.tensor(self.labels[i], dtype=torch.long)
        }

train_ds = NewsDataset(X_tr, y_tr, tokenizer)
test_ds  = NewsDataset(X_te, y_te, tokenizer)

train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=32, shuffle=False)
Enter fullscreen mode Exit fullscreen mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

bert_clf = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=4
)
bert_clf = bert_clf.to(device)

optimizer = optim.AdamW(bert_clf.parameters(), lr=2e-5, weight_decay=0.01)
scheduler = optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0,
                                          end_factor=0.0, total_iters=len(train_loader)*3)

print("Fine-tuning BERT for news classification:")
print(f"{'Epoch':>6} {'Train Loss':>12} {'Train Acc':>10} {'Test Acc':>10}")
print("=" * 42)

for epoch in range(3):
    bert_clf.train()
    total_loss = correct = total = 0

    for batch in train_loader:
        input_ids      = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels_b       = batch["label"].to(device)

        optimizer.zero_grad()
        outputs = bert_clf(input_ids=input_ids,
                           attention_mask=attention_mask,
                           labels=labels_b)
        loss = outputs.loss
        loss.backward()
        torch.nn.utils.clip_grad_norm_(bert_clf.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()
        correct    += outputs.logits.argmax(1).eq(labels_b).sum().item()
        total      += labels_b.size(0)

    bert_clf.eval()
    t_correct = t_total = 0
    with torch.no_grad():
        for batch in test_loader:
            out = bert_clf(
                input_ids=batch["input_ids"].to(device),
                attention_mask=batch["attention_mask"].to(device)
            )
            t_correct += out.logits.argmax(1).eq(batch["label"].to(device)).sum().item()
            t_total   += batch["label"].size(0)

    print(f"{epoch+1:>6} {total_loss/len(train_loader):>12.4f} "
          f"{correct/total:>10.2%} {t_correct/t_total:>10.2%}")
Enter fullscreen mode Exit fullscreen mode

Output:

Fine-tuning BERT for news classification:
 Epoch   Train Loss  Train Acc   Test Acc
==========================================
     1       0.4823     86.77%     90.42%
     2       0.1912     94.12%     92.08%
     3       0.0934     97.19%     93.33%
Enter fullscreen mode Exit fullscreen mode

93% accuracy in 3 epochs fine-tuning on 960 training examples. Compare this to the Naive Bayes and TF-IDF classifiers from Phase 6 which typically get 80-85% on similar tasks.


Using the HuggingFace Pipeline API

print("Quick BERT usage with HuggingFace Pipelines:")
print()

sentiment_pipeline = pipeline("sentiment-analysis",
                                model="distilbert-base-uncased-finetuned-sst-2-english",
                                device=-1)

test_sentences = [
    "This movie was absolutely incredible!",
    "I hated every minute of that film.",
    "The food was okay, nothing special.",
    "Best experience I have ever had.",
    "Complete waste of money and time.",
]

print("Sentiment Analysis:")
for sentence in test_sentences:
    result = sentiment_pipeline(sentence)[0]
    print(f"  {result['label']:<10} {result['score']:.3f}  '{sentence}'")
Enter fullscreen mode Exit fullscreen mode

Output:

Sentiment Analysis:
  POSITIVE   0.999  'This movie was absolutely incredible!'
  NEGATIVE   0.998  'I hated every minute of that film.'
  NEGATIVE   0.576  'The food was okay, nothing special.'
  POSITIVE   0.999  'Best experience I have ever had.'
  NEGATIVE   0.998  'Complete waste of money and time.'
Enter fullscreen mode Exit fullscreen mode
ner_pipeline = pipeline("ner",
                          model="dbmdz/bert-large-cased-finetuned-conll03-english",
                          aggregation_strategy="simple",
                          device=-1)

ner_text = "Elon Musk founded SpaceX in Hawthorne, California in 2002."
entities = ner_pipeline(ner_text)

print(f"\nNamed Entity Recognition:")
print(f"Text: '{ner_text}'")
print("Entities found:")
for ent in entities:
    print(f"  {ent['entity_group']:<6} '{ent['word']}'  (score={ent['score']:.3f})")
Enter fullscreen mode Exit fullscreen mode

Output:

Named Entity Recognition:
Text: 'Elon Musk founded SpaceX in Hawthorne, California in 2002.'
Entities found:
  PER    'Elon Musk'  (score=0.999)
  ORG    'SpaceX'     (score=0.998)
  LOC    'Hawthorne'  (score=0.987)
  LOC    'California' (score=0.996)
Enter fullscreen mode Exit fullscreen mode

BERT for Semantic Search

def encode_sentences(sentences, model, tokenizer, device="cpu"):
    model.eval()
    embeddings = []
    for sent in sentences:
        inputs = tokenizer(sent, return_tensors="pt",
                            truncation=True, max_length=128,
                            padding=True).to(device)
        with torch.no_grad():
            out = model(**inputs)
        cls_emb = out.last_hidden_state[:, 0, :].squeeze().cpu().numpy()
        embeddings.append(cls_emb)
    return np.array(embeddings)

from sklearn.metrics.pairwise import cosine_similarity

document_corpus = [
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "The stock market crashed in 2008 causing global recession.",
    "Python is the most popular programming language for data science.",
    "Transformers revolutionized natural language processing in 2017.",
    "The Amazon rainforest produces 20% of the world's oxygen.",
    "BERT stands for Bidirectional Encoder Representations from Transformers.",
    "Random forests combine many decision trees to improve accuracy.",
]

query = "How do neural networks understand text?"

doc_embeddings   = encode_sentences(document_corpus, bert, tokenizer)
query_embedding  = encode_sentences([query], bert, tokenizer)

similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
ranked       = sorted(enumerate(similarities), key=lambda x: -x[1])

print(f"Semantic Search: '{query}'")
print()
print(f"{'Rank':>5} {'Score':>8}  Document")
print("-" * 65)
for rank, (idx, score) in enumerate(ranked[:4], 1):
    print(f"{rank:>5} {score:>8.4f}  {document_corpus[idx]}")
Enter fullscreen mode Exit fullscreen mode

What BERT Cannot Do

print("BERT limitations:")
print()
print("1. CANNOT GENERATE TEXT")
print("   BERT is an encoder. It understands, it does not write.")
print("   Bidirectional attention means no causal left-to-right generation.")
print()
print("2. FIXED CONTEXT WINDOW")
print("   bert-base: max 512 tokens.")
print("   Longer documents must be chunked or truncated.")
print()
print("3. EXPENSIVE INFERENCE")
print("   110M parameters for every prediction.")
print("   DistilBERT (66M) is 60% smaller and 40% faster with ~97% performance.")
print()
print("4. PRETRAINING DOMAIN MATTERS")
print("   bert-base trained on books + Wikipedia.")
print("   For medical text: BioBERT.")
print("   For legal text: LegalBERT.")
print("   For code: CodeBERT.")
print("   General BERT underperforms on specialized domains.")
print()
print("5. REQUIRES FINE-TUNING PER TASK")
print("   BERT with no task head outputs embeddings, not answers.")
print("   GPT-style models can answer questions with zero or few-shot prompting.")
print("   This prompted the shift toward GPT-style models for general AI assistants.")
Enter fullscreen mode Exit fullscreen mode

A Resource Worth Reading

The original BERT paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al. (2018) from Google AI is essential reading. It clearly describes both pretraining objectives, the fine-tuning approach, and results on 11 benchmarks. One of the most cited papers in NLP history. Search "Devlin BERT pre-training deep bidirectional transformers 2018 arxiv."

Jay Alammar's "The Illustrated BERT, ELMo, and co." at jalammar.github.io visualizes the masking strategy, the input format, and the fine-tuning approach with characteristic clarity. The companion piece to his transformer visualization. Mandatory reading for understanding how pretraining works visually. Search "Jay Alammar illustrated BERT ELMo."


Try This

Create bert_practice.py.

Part 1: semantic similarity. Load BERT. Encode 10 pairs of sentences (5 similar, 5 unrelated). Compute cosine similarity for each pair. Do the similar pairs consistently score higher? Plot as a bar chart.

Part 2: masked language model. Load BertForMaskedLM. Take three sentences and mask different words (important nouns, verbs, adjectives). Print the top 5 predictions for each mask. Does BERT predict contextually appropriate words?

Part 3: fine-tuning. Fine-tune BertForSequenceClassification on the IMDB sentiment dataset (or any binary classification dataset). Train for 3 epochs. Report accuracy, precision, recall, and F1. Compare to TF-IDF + LogisticRegression from Phase 6.

Part 4: feature extraction. Use BERT as a frozen feature extractor. Extract [CLS] embeddings for all samples. Train an sklearn LogisticRegression on top. How does this compare to full fine-tuning in accuracy and training time?


What's Next

BERT reads text from both directions simultaneously. It understands language. But it cannot generate language.

GPT flips the design: decoder only, causal attention, trained to predict the next token. This one change is responsible for ChatGPT, text generation, code completion, and every generative AI application. Next post.

Top comments (0)