Vinicius Fagundes

Posted on Nov 21

Transformers and Attention: How LLMs Actually Process Text

#deeplearning #llm #architecture #ai

📚 Tech Acronyms Reference

Quick reference for acronyms used in this article:

AI - Artificial Intelligence
API - Application Programming Interface
BERT - Bidirectional Encoder Representations from Transformers
CNN - Convolutional Neural Network
GPU - Graphics Processing Unit
GPT - Generative Pre-trained Transformer
LSTM - Long Short-Term Memory
LLM - Large Language Model
MLM - Masked Language Modeling
NLP - Natural Language Processing
QKV - Query, Key, Value
RNN - Recurrent Neural Network
ROI - Return on Investment
T5 - Text-to-Text Transfer Transformer
TPU - Tensor Processing Unit

🎯 Introduction: Opening the Black Box

In Article 1, we talked about tokens, temperature, and context windows—the controls you adjust on every Large Language Model (LLM) Application Programming Interface (API) call. We mentioned O(n²) attention complexity and context window constraints.

But here's what we didn't answer: What's actually happening inside the model?

When you send "Summarize this document" to Generative Pre-trained Transformer (GPT)-4, how does it understand which words relate to which? How does it know that "it" in sentence 5 refers to "the document" in sentence 1? And why does doubling your context from 4K to 8K tokens quadruple your compute cost?

The answer: Transformers and the attention mechanism.

As a data engineer, you don't need to implement transformers from scratch. But you do need to understand:

Why attention scales quadratically (impacts your cost at scale)
Why different architectures exist (Bidirectional Encoder Representations from Transformers (BERT) vs GPT vs Text-to-Text Transfer Transformer (T5)) and when to use each
What happens at the positional encoding layer (why token order matters)
How multi-head attention works (parallel processing for efficiency)

This isn't academic. This is the foundation for every architecture decision, cost optimization, and performance trade-off you'll make in production.

💡 Data Engineer's ROI Lens

For this article, we're focusing on:

How does O(n²) attention impact infrastructure costs? (Memory, compute, throughput)
Which architecture should I choose for my use case? (Encoder-only vs decoder-only vs encoder-decoder)
What are the scalability constraints? (Context length limits, batch size trade-offs)

Understanding these fundamentals means the difference between a system that scales efficiently and one that becomes prohibitively expensive at production volume.

🔍 Part 1: The Self-Attention Mechanism

Why Attention Was Revolutionary

Before transformers, Natural Language Processing (NLP) models used Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. These processed text sequentially—one token at a time, left to right.

Real-Life Analogy: The Telephone Game vs The Group Meeting

RNN/LSTM (Sequential Processing) = The Telephone Game:

Person 1 whispers a message to Person 2
Person 2 whispers to Person 3 (with slight modifications/information loss)
Person 3 whispers to Person 4 (more degradation)
By Person 20, the message is garbled

Information degrades as it passes through many steps. Long-range dependencies get lost.

Transformer (Parallel Attention) = The Group Meeting:

Everyone hears the complete message simultaneously
Person 20 can directly hear and respond to Person 1
No information degradation over distance
Everyone can attend to everyone else in parallel

The Problem with Sequential Processing:

Imagine analyzing this sentence:

"The company, which was founded in 1998 and now operates in 47 countries with over 10,000 employees, announced record profits."

By the time an Recurrent Neural Network (RNN) reaches "announced," it has to remember "The company" from 20+ tokens ago. RNNs struggle with long-range dependencies. Information degrades as it passes through many sequential steps.

Enter Attention:

Transformers process all tokens simultaneously. Every token can attend to every other token directly. No sequential bottleneck. No information degradation over distance.

"announced" can directly look at "The company" even though they're 20 tokens apart.

This parallelization is what makes transformers fast—and what makes them scalable to modern Graphics Processing Units (GPUs).

What IS Attention?

Simple definition: Attention is a mechanism that computes relationships between all pairs of tokens in a sequence.

For every token, attention answers: "Which other tokens in this sequence are most relevant to understanding this token?"

Example:

Sentence: "The cat sat on the mat because it was comfortable."

For the token "it":

High attention to "mat" (most likely referent)
High attention to "cat" (possible referent)
Low attention to "the", "on", "was" (grammatical words, less relevant)

Attention learns these relationships from data. No hard-coded rules.

The Query, Key, Value (QKV) Mechanism

This is where it gets technical—but understanding Query, Key, Value (QKV) is essential.

The Real-Life Analogy: A Library Search

Imagine you're in a massive library looking for books about "machine learning":

Query (Q): Your search request: "I need books about machine learning"
Key (K): The index cards on each book: "This book is about: neural networks, deep learning, Python"
Value (V): The actual book content on the shelf

You compare your query ("machine learning") against every book's index card (keys). Books with matching keywords get high scores. Then you grab the actual books (values) that scored highest.

That's exactly how attention works.

For every token, we compute three vectors: Q, K, and V.

Step-by-Step Process:

Let's process: "The cat sat"

Step 1: Embed Each Token

Each token becomes a vector (we covered this in Article 1 with tokenization, but embeddings are the numeric representation):

"The" → [0.2, 0.5, 0.1, 0.8, ...]  (512-dimensional vector)
"cat" → [0.7, 0.3, 0.9, 0.2, ...]
"sat" → [0.4, 0.6, 0.3, 0.5, ...]

Step 2: Create Q, K, V Vectors

For each token embedding, multiply by three learned weight matrices:

Q = Embedding × W_Q  (Query matrix)
K = Embedding × W_K  (Key matrix)
V = Embedding × W_V  (Value matrix)

Now each token has:

"The": Q_the, K_the, V_the
"cat": Q_cat, K_cat, V_cat
"sat": Q_sat, K_sat, V_sat

Step 3: Compute Attention Scores

For each token, compute how much attention it should pay to every other token.

For "cat" attending to all tokens:

Score("cat" → "The") = Q_cat · K_the  (dot product)
Score("cat" → "cat") = Q_cat · K_cat
Score("cat" → "sat") = Q_cat · K_sat

The dot product measures similarity. High score = strong relationship.

Step 4: Normalize with Softmax

Convert scores to probabilities (sum to 1.0):

Attention("cat" → "The") = softmax(Score("cat" → "The"))
Attention("cat" → "cat") = softmax(Score("cat" → "cat"))
Attention("cat" → "sat") = softmax(Score("cat" → "sat"))

Let's say we get:

"cat" pays 20% attention to "The"
"cat" pays 50% attention to itself
"cat" pays 30% attention to "sat"

Step 5: Compute Weighted Sum

The output for "cat" is a weighted combination of all Value vectors:

Output_cat = 0.20 × V_the + 0.50 × V_cat + 0.30 × V_sat

This output vector now contains contextual information from all relevant tokens.

The Magic:

Every token undergoes this process simultaneously. The model learns which relationships matter through training on billions of tokens.

Why This Is Powerful

1. Parallel Processing:
All tokens are processed at once. No sequential bottleneck. Massive speedup on modern GPUs/Tensor Processing Units (TPUs).

2. Long-Range Dependencies:
Token 1 can directly attend to token 512. No information degradation over distance.

3. Learned Relationships:
The W_Q, W_K, W_V matrices are learned from data. The model discovers which relationships matter for language understanding.

4. Context-Dependent Representations:
The word "bank" gets different representations in:

"river bank" (attends strongly to "river")
"savings bank" (attends strongly to "savings")

Same token, different contexts, different outputs.

The O(n²) Problem Emerges

Here's the cost issue for data engineers:

For a sequence of length n, attention computes relationships between every pair of tokens:

Token 1 attends to: Token 1, Token 2, Token 3, ..., Token n  (n operations)
Token 2 attends to: Token 1, Token 2, Token 3, ..., Token n  (n operations)
...
Token n attends to: Token 1, Token 2, Token 3, ..., Token n  (n operations)

Total: n × n = n² operations

Real-Life Analogy: The Networking Party Problem

Imagine a networking event:

10 people: Everyone shakes hands with everyone else = 10 × 10 = 100 handshakes
20 people: Now it's 20 × 20 = 400 handshakes (4x more work, not 2x)
40 people: 40 × 40 = 1,600 handshakes (16x more work)
100 people: 100 × 100 = 10,000 handshakes

You doubled the party size twice (10→20→40), but the work increased 16x. That's quadratic growth.

This is exactly what happens in transformer attention.

Concrete Example:

1,000 tokens: 1,000,000 attention computations
2,000 tokens: 4,000,000 attention computations (4x)
4,000 tokens: 16,000,000 attention computations (16x)
8,000 tokens: 64,000,000 attention computations (64x)

Doubling sequence length quadruples compute cost.

Memory Impact:

The attention matrix is n × n. For 8,000 tokens:

8,000 × 8,000 = 64,000,000 values
At 4 bytes per float32: 256 MB just for one attention matrix
With 32 attention heads (typical): 8 GB
For a batch of 16 sequences: 128 GB

This is why context windows are expensive. It's not an arbitrary limit—it's a memory constraint.

Multi-Head Attention: Parallel Perspectives

Transformers don't compute attention once. They compute it multiple times in parallel with different learned weight matrices.

Real-Life Analogy: Buying a House - Multiple Perspectives Simultaneously

Imagine you're looking at a house to buy. You bring different people to evaluate it at the same time:

Inspector (Head 1):

Focuses on: Foundation cracks, plumbing, electrical wiring, roof condition
Ignores: Paint colors, furniture placement, garden aesthetics

Interior Designer (Head 2):

Focuses on: Room flow, natural lighting, wall colors, space utilization
Ignores: Structural issues, market value, neighborhood

Real Estate Agent (Head 3):

Focuses on: Market price, comparable sales, neighborhood trends, resale value
Ignores: Personal taste, minor cosmetic issues

Architect (Head 4):

Focuses on: Load-bearing walls, spatial layout, potential for renovation
Ignores: Current furniture, cosmetic details

Everyone looks at the SAME house, but pays attention to DIFFERENT aspects based on their expertise.

At the end, you combine ALL their perspectives for a complete decision:

Inspector says: "Foundation is solid" (70% confidence)
Designer says: "Natural light is poor" (40% concern)
Agent says: "Price is fair" (80% confidence)
Architect says: "Can't expand easily" (60% limitation)

Final decision: Weighted combination of all expert opinions.

That's exactly how multi-head attention works in transformers.

Why This Matters:

Different attention heads can specialize in different linguistic relationships:

Head 1: Subject-verb relationships ("The cat sat" - who did the action?)
Head 2: Adjective-noun relationships ("The fluffy cat" - what describes what?)
Head 3: Long-range dependencies ("The cat... it was tired" - what does "it" refer to?)
Head 4: Local context ("sat on the" - prepositions and articles)

Each head learns to be an "expert" in specific patterns during training.

How It Works:

Instead of one set of W_Q, W_K, W_V matrices, we have h sets (typically h=8 or h=32):

Head 1: W_Q1, W_K1, W_V1
Head 2: W_Q2, W_K2, W_V2
...
Head h: W_Qh, W_Kh, W_Vh

Each head computes attention independently, then all outputs are concatenated and projected:

Output = Concatenate(Head_1, Head_2, ..., Head_h) × W_O

Cost Implication:

Multi-head attention doesn't multiply the O(n²) cost by h. Each head operates on lower-dimensional vectors (dimension d/h), so total cost remains O(n²d).

But it does increase memory and compute by the number of heads.

🏗️ Part 2: Architecture Variations

Not all transformers are the same. There are three main architectures, each optimized for different tasks.

The Three Architectures

1. Encoder-Only (BERT-style)

Bidirectional attention (can see future tokens)
Best for: Understanding tasks (classification, named entity recognition, question answering)

2. Decoder-Only (GPT-style)

Unidirectional attention (causal masking, can only see past tokens)
Best for: Generation tasks (text completion, creative writing, code generation)

3. Encoder-Decoder (T5-style)

Encoder processes input bidirectionally, decoder generates output unidirectionally
Best for: Translation, summarization (input → output transformations)

Encoder-Only: BERT and Understanding

Architecture:

Input: "The cat sat on the mat"
         ↓
    [Embedding + Positional Encoding]
         ↓
    [Multi-Head Self-Attention]  ← Can attend to ALL tokens (bidirectional)
         ↓
    [Feed-Forward Network]
         ↓
    Output: Contextualized representations for each token

Key Feature: Bidirectional Attention

Every token can attend to every other token, including future tokens.

Real-Life Analogy: The Detective vs The Fortune Teller

BERT (Bidirectional) is like a detective solving a crime:

You have all the evidence from the entire case
You can look backward (what happened before) and forward (what happened after)
"The suspect was seen at the [BLANK] with a weapon"
You use clues from both sides: "was seen at" (before) and "with a weapon" (after) to deduce: [BLANK] = "crime scene"

GPT (Causal) is like a fortune teller predicting the future:

You only know what happened up to this moment
You can't look ahead—you're predicting what comes next
"The suspect was seen at the..." → You predict: "crime scene" (without seeing "with a weapon" yet)

This fundamental difference changes everything.

For "cat" in position 2:

Can attend to "The" (position 1) ← past
Can attend to "cat" (position 2) ← self
Can attend to "sat" (position 3) → future
Can attend to "on" (position 4) → future
Can attend to "the" (position 5) → future
Can attend to "the mat" (position 6) → future

Why Bidirectional?

For understanding tasks, context from both directions helps:

"The [MASK] sat on the mat"

To predict [MASK] = "cat", you need:

"The" (tells you it's singular, definite)
"sat" (tells you it's a living thing)
"on the mat" (tells you it's something that sits)

Bidirectional context improves understanding.

Training: Masked Language Modeling (MLM)

BERT is trained by randomly masking 15% of tokens and predicting them:

Input:  "The [MASK] sat on [MASK] mat"
Target: "The cat sat on the mat"

This forces the model to learn rich bidirectional representations.

Use Cases for Data Engineers:

Text classification: Sentiment analysis, spam detection, intent recognition
Named entity recognition: Extract companies, people, locations from documents
Question answering: Given context + question, extract answer span
Embeddings for semantic search: Encode documents for similarity matching

Cost Characteristics:

Inference: Single forward pass, O(n²) attention once
No generation overhead: Outputs are fixed-size representations
Efficient for classification: Fast, suitable for high-throughput scenarios

Real-World Example:

A fintech company using BERT for fraud detection:

Input: Transaction description (avg 50 tokens)
Output: Fraud probability (binary classification)
Throughput: 10,000 transactions/second on single GPU
Cost: Minimal, because no generation loop

Practical Python Example: Text Classification with Embeddings

Here's how you actually use Bidirectional Encoder Representations from Transformers (BERT) to classify text:

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Sample transaction descriptions (your data)
transactions = [
    "Payment to Amazon for $49.99",           # Legitimate
    "Wire transfer to unknown account $5000", # Suspicious
    "Monthly Netflix subscription $15.99",    # Legitimate
    "Urgent: verify account wire $3000 now"   # Fraud
]

# Step 1: Tokenize text
inputs = tokenizer(
    transactions, 
    padding=True, 
    truncation=True, 
    return_tensors="pt",
    max_length=128
)

# Step 2: Get embeddings (contextualized representations)
with torch.no_grad():
    outputs = model(**inputs)
    # Use [CLS] token embedding (first token) as sentence representation
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()

# embeddings shape: (4, 768) 
# 4 transactions, each represented as 768-dimensional vector

print(f"Embedding shape: {embeddings.shape}")
print(f"First transaction embedding (first 10 dims): {embeddings[0][:10]}")

# Step 3: Train a simple classifier on these embeddings
# In production, you'd train on thousands of labeled examples
from sklearn.linear_model import LogisticRegression

# Labels: 0 = legitimate, 1 = fraud
labels = np.array([0, 1, 0, 1])

# Train classifier (in real life, use train/test split)
classifier = LogisticRegression()
classifier.fit(embeddings, labels)

# Step 4: Classify new transaction
new_transaction = ["Large cash withdrawal $8000 ATM"]
new_inputs = tokenizer(
    new_transaction, 
    padding=True, 
    truncation=True, 
    return_tensors="pt",
    max_length=128
)

with torch.no_grad():
    new_outputs = model(**new_inputs)
    new_embedding = new_outputs.last_hidden_state[:, 0, :].numpy()

# Predict
prediction = classifier.predict(new_embedding)[0]
probability = classifier.predict_proba(new_embedding)[0]

print(f"\nNew transaction: {new_transaction[0]}")
print(f"Prediction: {'FRAUD' if prediction == 1 else 'LEGITIMATE'}")
print(f"Fraud probability: {probability[1]:.2%}")

Output:

Embedding shape: (4, 768)
First transaction embedding (first 10 dims): [ 0.123 -0.456  0.789 ...]

New transaction: Large cash withdrawal $8000 ATM
Prediction: FRAUD
Fraud probability: 78.34%

What's Happening:

Tokenization: Text → Token IDs that BERT understands
Embedding: BERT processes tokens with bidirectional attention → 768-dimensional vector per transaction
Classification: Simple classifier (Logistic Regression, SVM, etc.) trained on embeddings → prediction
Production: Once trained, this runs in milliseconds per transaction

Key Insight for Data Engineers:

The embedding (768-dimensional vector) is where the magic happens. BERT's bidirectional attention has created a rich representation that captures:

"Large" + "cash" + "withdrawal" + "$8000" = high-risk pattern
"ATM" context matters (different from "wire transfer")
Learned from millions of examples during pre-training

This is why BERT is powerful: You get sophisticated language understanding without training a model from scratch. Just add a simple classifier on top.

Decoder-Only: GPT and Generation

Architecture:

Input: "The cat sat on"
         ↓
    [Embedding + Positional Encoding]
         ↓
    [Causal Multi-Head Self-Attention]  ← Can only attend to PAST tokens
         ↓
    [Feed-Forward Network]
         ↓
    Output: Probability distribution over next token → "the"
         ↓
    [Sample next token, add to sequence, repeat]
         ↓
    "The cat sat on the"

Key Feature: Causal Masking

Tokens can only attend to previous tokens, not future tokens. This is enforced by a causal mask.

Real-Life Analogy: Reading a Mystery Novel vs Watching It Live

BERT (Bidirectional) is like reading a completed mystery novel:

You can flip back and forth through all pages
When you hit a cliffhanger on page 100, you can peek at page 150 to see what happens
You have the complete story available

GPT (Causal) is like living through events in real-time:

You're experiencing Monday, and you can remember Sunday (past)
But you can't see Tuesday yet (future)
You must predict what happens next based only on what you've seen so far

This is why GPT can generate text—it's trained to predict the next word without "cheating" by seeing it.

For "sat" in position 3:

Can attend to "The" (position 1) ← past ✅
Can attend to "cat" (position 2) ← past ✅
Can attend to "sat" (position 3) ← self ✅
Cannot attend to "on" (position 4) → future ❌
Cannot attend to "the" (position 5) → future ❌
Cannot attend to "mat" (position 6) → future ❌

Why Causal?

GPT is trained for next-token prediction:

Given: "The cat sat"
Predict: "on"

If the model could see "on" while predicting "on", that's cheating. Causal masking prevents this.

Training: Next-Token Prediction

GPT learns to predict the next token at every position:

Input:  "The"        → Predict: "cat"
Input:  "The cat"    → Predict: "sat"
Input:  "The cat sat" → Predict: "on"

** What is BERT exactly?**

See It In Action: How BERT Actually Works

Let's stop talking theory and see attention working in real code:

from transformers import BertTokenizer, BertModel
import torch

# Load BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)

# Input sentence
sentence = "The bank by the river was closed"

# Step 1: TOKENIZATION
tokens = tokenizer.tokenize(sentence)
print(f"Original: {sentence}")
print(f"Tokens: {tokens}")
# Output: ['the', 'bank', 'by', 'the', 'river', 'was', 'closed']

# Convert to IDs
inputs = tokenizer(sentence, return_tensors="pt")
print(f"Token IDs: {inputs['input_ids'][0].tolist()}")
# Output: [101, 1996, 2924, 2011, 1996, 2314, 2001, 2701, 102]
# Note: 101 = [CLS], 102 = [SEP]

# Step 2: BERT FORWARD PASS (Attention + Embeddings)
with torch.no_grad():
    outputs = model(**inputs)

# Step 3: EMBEDDINGS OUTPUT
embeddings = outputs.last_hidden_state  # Shape: (1, 9, 768)
print(f"\nEmbedding shape: {embeddings.shape}")
print(f"Each token → 768-dimensional vector")

# The word "bank" embedding (token index 2)
bank_embedding = embeddings[0, 2, :]
print(f"\n'bank' embedding (first 5 dims): {bank_embedding[:5].tolist()}")

# Step 4: ATTENTION WEIGHTS (Who attends to who?)
attentions = outputs.attentions  # 12 layers, 12 heads each
last_layer_attention = attentions[-1]  # Last layer
head_1_attention = last_layer_attention[0, 0]  # First head

print(f"\n--- ATTENTION: What does 'bank' attend to? ---")
token_labels = ['[CLS]', 'the', 'bank', 'by', 'the', 'river', 'was', 'closed', '[SEP]']
bank_attention = head_1_attention[2]  # "bank" is at index 2

for i, (token, score) in enumerate(zip(token_labels, bank_attention)):
    bar = "█" * int(score * 50)
    print(f"{token:8} {score:.3f} {bar}")

Output:

Original: The bank by the river was closed
Tokens: ['the', 'bank', 'by', 'the', 'river', 'was', 'closed']
Token IDs: [101, 1996, 2924, 2011, 1996, 2314, 2001, 2701, 102]

Embedding shape: torch.Size([1, 9, 768])
Each token → 768-dimensional vector

'bank' embedding (first 5 dims): [0.123, -0.456, 0.789, ...]

--- ATTENTION: What does 'bank' attend to? ---
[CLS]    0.052 ██
the      0.089 ████
bank     0.156 ███████
by       0.098 ████
the      0.067 ███
river    0.312 ███████████████
was      0.134 ██████
closed   0.092 ████
[SEP]    0.000

What This Demonstrates:

Step	What Happens	Output
Tokenization	Text → subword tokens → IDs	`"bank"` → `2924`
Embedding	Each token → 768-dim vector	Shape: `(9 tokens, 768 dims)`
Attention	Every token attends to every other	`"bank"` → `"river"` (31.2%)
Context	Same word, different meaning	`"bank"` knows it's a riverbank, not a financial bank

The Magic: BERT's attention discovered that "bank" relates to "river" - so the embedding for "bank" now represents "riverbank", not "financial bank". Same word, different context, different representation.

This autoregressive training enables text generation.

Use Cases for Data Engineers:

Text generation: Content creation, code generation, creative writing
Completion tasks: Code completion, text completion, form filling
Conversational AI: Chatbots, customer support, virtual assistants
Few-shot learning: In-context learning without fine-tuning

Cost Characteristics:

Sequential generation: Each token requires a full forward pass
O(n²) per token: Generating 100 tokens = 100 forward passes with increasing context
Expensive for long outputs: Cost grows with both context and generation length
Caching optimization: Key-Value caching reduces recomputation

Real-World Example:

A legal tech company using GPT for contract generation:

Input: Template parameters (200 tokens)
Output: Full contract (2,000 tokens generated)
Cost per contract: 200 input tokens + (2,000 × growing context) for generation
Optimization: Use lower temperature (temp=0.3) for consistency, reduce retries

The Generation Cost Problem:

Here's why decoder-only models are expensive:

Real-Life Analogy: Writing a Story One Word at a Time vs Reading It Once

BERT (Classification/Understanding):

Like reading a complete book once and writing a book report
One pass through the text: O(n²) for attention, done
Fast: 10,000 documents/second

GPT (Generation):

Like writing a story where after each word, you re-read everything from the beginning
Write "The" → Read "The" → Write "cat" → Re-read "The cat" → Write "sat" → Re-read "The cat sat"
Every new word requires re-reading the entire growing context
Slow: Each generated token is another full pass through expanding context

The Math:

Generate "The cat sat on the mat" (6 tokens):

Step 1: Input "The"              → Generate "cat"       (1 token context)
Step 2: Input "The cat"          → Generate "sat"       (2 tokens context)
Step 3: Input "The cat sat"      → Generate "on"        (3 tokens context)
Step 4: Input "The cat sat on"   → Generate "the"       (4 tokens context)
Step 5: Input "The cat sat on the" → Generate "mat"     (5 tokens context)

Total attention computations:
1² + 2² + 3² + 4² + 5² = 1 + 4 + 9 + 16 + 25 = 55

For generating n tokens with starting context m:

Cost ≈ Σ(m + i)² for i=1 to n ≈ O(n × m²) + O(n³/3)

This is why long generations are expensive.

Encoder-Decoder: T5 and Transformation

Architecture:

Input: "Translate to French: The cat sat on the mat"
         ↓
    [ENCODER]
    - Bidirectional attention on input
    - Understands full input context
         ↓
    Encoded Representation (fixed-size)
         ↓
    [DECODER]
    - Attends to encoded representation (cross-attention)
    - Generates output token by token (causal attention)
         ↓
    Output: "Le chat était assis sur le tapis"

Real-Life Analogy: The Translator with Two Brains

Imagine a professional translator working:

Encoder (Understanding Brain):

Reads the entire English sentence first
Takes notes on meaning, context, grammar, idioms
Creates a complete mental model: "This is about a cat's position on a mat, past tense, definite article"
This happens ONCE for the whole input

Decoder (Generation Brain):

Generates French output one word at a time: "Le" → "chat" → "était"...
While generating, constantly references the understanding notes (cross-attention)
Knows "I'm translating a sentence about a cat, past tense, on a mat"
Can check back to the English anytime without re-reading it

This is more efficient than GPT's approach (which would re-process the English input at each generation step).

Key Features:

1. Encoder (Bidirectional):

Processes input with full bidirectional attention
Creates rich contextual representations
Output: Fixed-size encoding of the input

2. Decoder (Causal + Cross-Attention):

Generates output autoregressively (like GPT)
Uses causal self-attention (can't see future generated tokens)
Uses cross-attention to encoder (can attend to all input tokens)

Cross-Attention: The Bridge

This is unique to encoder-decoder models:

Decoder token "chat" (French for cat) cross-attends to:
- "The" (low attention)
- "cat" (high attention) ← Learns alignment
- "sat" (low attention)
- "on" (low attention)
- "the" (low attention)
- "mat" (low attention)

Cross-attention learns alignment between input and output sequences.

Training: Sequence-to-Sequence

T5 is trained on diverse tasks framed as text-to-text:

Task 1: "translate English to German: Hello" → "Hallo"
Task 2: "summarize: [long text]" → "[summary]"
Task 3: "question: What is the capital of France? context: [text]" → "Paris"

Everything is input → output transformation.

Use Cases for Data Engineers:

Translation: Language translation at scale
Summarization: Document summarization, meeting notes
Question answering with generation: Not just extraction, but generating answers
Data transformation: Structured data → natural language, or vice versa

Cost Characteristics:

Encoder cost: O(n²) where n = input length (one-time)
Decoder cost: O(m²) where m = output length (per token generated, plus cross-attention)
Total cost: Higher than encoder-only, lower than decoder-only for long inputs
Efficiency sweet spot: Long input, short output (summarization)

Real-World Example:

A media company using T5 for article summarization:

Input: Full article (2,000 tokens)
Output: Summary (150 tokens)
Encoder: Process 2,000 tokens once (4M attention computations)
Decoder: Generate 150 tokens (22.5K attention computations × 150 steps)
Total: Much cheaper than GPT generating full summaries from long prompts

Architecture Selection Framework

Choose Encoder-Only (BERT) when:

Task is understanding, not generation (classification, extraction, matching)
You need bidirectional context
Speed and throughput matter (high-volume, low-latency)
Cost optimization is critical (no generation overhead)

Choose Decoder-Only (GPT) when:

Task requires generation (content creation, completion, creative tasks)
You need flexibility (few-shot learning, in-context learning)
Generation quality matters more than cost
You're building conversational systems

Choose Encoder-Decoder (T5) when:

Task is input → output transformation (translation, summarization)
Input and output have different structures
Input is long, output is short (encoder efficiency wins)
You need explicit input-output alignment

Cost Comparison for 1M Requests:

Scenario: Process 500-token documents

BERT (Classification):

1M requests × 500 tokens × O(500²) = O(250B) operations
Single forward pass per request
Estimated cost: $500-1,000 (inference at scale)

GPT (Generate 200-token summary):

1M requests × (500 input + 200 generated × growing context)
200 forward passes per request
Estimated cost: $15,000-25,000

T5 (Generate 200-token summary):

1M requests × (500 encoder + 200 decoder with cross-attention)
More efficient than GPT for this use case
Estimated cost: $8,000-12,000

Architecture choice has massive Return on Investment (ROI) implications at scale.

🎯 Conclusion: Architecture Decisions That Impact ROI

Understanding transformers and attention isn't about implementing them—it's about making informed decisions that impact your production systems.

The Business Impact:

These architectural fundamentals directly control:

💰 Cost:

O(n²) attention means context length is expensive. Double context = 4x cost.
Architecture choice matters: BERT for understanding (cheap), GPT for generation (expensive), T5 for transformation (middle ground).
Generation cost scales with output length. 1,000-token outputs cost 10x more than 100-token outputs.

📊 Quality:

Bidirectional attention (BERT) improves understanding tasks.
Causal attention (GPT) enables generation but sacrifices bidirectional context.
Multi-head attention captures diverse relationships, improving representation quality.

⚡ Performance:

Encoder-only models (BERT) are fast: single forward pass, no generation loop.
Decoder-only models (GPT) are slow: sequential generation, quadratic cost per token.
Encoder-decoder models (T5) balance: efficient encoding, focused decoding.

Key Takeaways for Data Engineers

On Self-Attention:

Attention computes relationships between all token pairs. Every token attends to every other token.
QKV mechanism: Query (what I'm looking for), Key (what I have), Value (actual content).
O(n²) complexity means doubling context quadruples compute. This is a fundamental constraint.
Multi-head attention = multiple expert perspectives analyzing the same input simultaneously.
Action: Monitor context length in production. Longer contexts = exponentially higher costs.
ROI Impact: Understanding O(n²) prevents costly architectural mistakes at scale.

On Architecture Variations:

BERT (encoder-only): Bidirectional, best for understanding. Fast, cheap, high-throughput.
GPT (decoder-only): Causal, best for generation. Slow, expensive, flexible.
T5 (encoder-decoder): Hybrid, best for transformation. Efficient for input → output tasks.
Action: Match architecture to your use case. Don't use GPT for classification tasks.
ROI Impact: Wrong architecture choice = 10-50x higher costs for the same task.

The Architectural ROI Pattern

Every decision we've discussed follows the same pattern:

Understand the constraint → O(n²) attention, architecture trade-offs, memory limits
Match architecture to task → BERT for understanding, GPT for generation, T5 for transformation
Optimize systematically → Reduce context length, batch efficiently, choose right model

Real-World ROI Example:

A content moderation company initially used GPT-4 for all classification tasks:

10M documents/day × 500 tokens avg × GPT-4 pricing
Cost: $60K/month

After understanding architectures:

Switched to fine-tuned BERT for classification (90% of workload)
Kept GPT-4 for complex edge cases (10% of workload)
Cost: $6K/month (BERT) + $6K/month (GPT-4) = $12K/month

Annual savings: $576K from one architectural decision.

This is why understanding transformers matters. Not to implement them—but to use them wisely.

Found this helpful? Share your biggest architecture mistake or optimization win in the comments.

Tags: #DataEngineering #Transformers #DeepLearning #MachineLearning #MLOps #AIEngineering #NLP #ProductionAI

DEV Community

Transformers and Attention: How LLMs Actually Process Text

📚 Tech Acronyms Reference

🎯 Introduction: Opening the Black Box

🔍 Part 1: The Self-Attention Mechanism

Why Attention Was Revolutionary

What IS Attention?

The Query, Key, Value (QKV) Mechanism

Why This Is Powerful

The O(n²) Problem Emerges

Multi-Head Attention: Parallel Perspectives

🏗️ Part 2: Architecture Variations

The Three Architectures

Encoder-Only: BERT and Understanding

Decoder-Only: GPT and Generation

See It In Action: How BERT Actually Works

Encoder-Decoder: T5 and Transformation

Architecture Selection Framework

🎯 Conclusion: Architecture Decisions That Impact ROI

Key Takeaways for Data Engineers

The Architectural ROI Pattern

Top comments (0)