Pretraining, Prompting, Sampling, and Alignment
By the end of this post, you'll understand what an LLM actually learns during pretraining (ontologies, math, pronoun resolution, all of it) and why this happens from nothing more than predicting the next word. You'll know the three architectural families of LLMs (decoder-only, encoder-only, encoder-decoder) and when each one fits the job. You'll see how unrelated tasks like sentiment analysis, question answering, and classification all get cast as conditional generation. You'll understand prompting, in-context learning, and why system prompts are longer than you'd expect. You'll know the difference between greedy decoding, random sampling, and temperature sampling, and why the obvious strategy is actually a bad one. Finally, you'll understand the three stages of training that take a raw pretrained model and turn it into something useful and safe: pretraining, instruction tuning, and preference alignment.
Why This Post Is Different
Because this post is dedicated entirely to LLMs. We'll treat the transformer as a black box, for now. You don't need to know how attention works to understand how LLMs behave, what they learn, how they're trained, and where they break down. Transformers come later. For now, just treat the architecture as "something that takes context in and returns a probability distribution over the next word."
This turns out to be the right level of abstraction for getting intuition. Take 8 minutes before diving in and watch 3Blue1Brown's explainer on LLMs; it'll make everything that follows click faster.
Defining an LLM (The Unusual Way)
The NLP textbook gives a definition I hadn't seen before:
A large language model is a computational agent that can interact conversationally with people using natural language.
Notice what this leaves out. No mention of parameter counts. No mention of transformers. No mention of depth or training data size. The definition is about behavior, not architecture. An LLM is something you can talk to.
This is unusual, but it's actually correct for the purposes of this post. The concepts here (prompting, sampling, alignment, hallucination) are all about how the model behaves, and they apply regardless of the specific architecture underneath.
Similar to N-grams, Different from N-grams
LLMs and n-gram models share the core task: assign probabilities to sequences and generate text by sampling from those probabilities. What's different is the training. N-grams learn by counting. LLMs learn by predicting the next word through gradient descent on a neural network.
That's the whole difference at a conceptual level. Everything else (the scale, the emergent capabilities, the way you interact with them) flows from that training objective applied to massive amounts of text.
What Does a Model Actually Learn from Pretraining?
This is the question that still surprises people. The training objective is simple: take a corpus, and at every position, predict the next word. No explicit teaching of grammar, math, facts, or reasoning. Yet an LLM trained this way ends up with all of that.
Let's walk through five examples. Each one shows a different kind of knowledge the model absorbs purely from next-word prediction:
1. Ontological relationships.
"With roses, dahlias, and peonies, I was surrounded by flowers."
The model learns that roses, dahlias, and peonies are kinds of flowers. Nobody told it. It figured it out because the word "flowers" keeps appearing near these specific species.
2. Superlative and scalar relationships.
"The room wasn't just big, it was enormous."
This teaches the model about intensity scales. "Enormous" is stronger than "big." The model picks up this ordering from how people actually use these words in context.
3. Authorship and factual relations.
'The author of "A Room of One's Own" is Virginia Woolf.'
Simple factual recall. But notice there's nothing in the training objective that says "learn facts." The facts just fall out of the statistical structure of text.
4. Mathematical relationships.
"The square root of 4 is 2."
The model learns basic math not from being taught arithmetic, but because sentences like this appear in the training data often enough for the pattern to stick.
5. Pronoun resolution.
"The doctor told me that he..."
Researchers have spent years building pronoun resolution algorithms by hand, and LLMs pick this up implicitly. "He" correctly refers back to "doctor" because in millions of similar sentences, that's how pronouns get used.
A Side Note on Human Learning
Here's an interesting comparison. By age 30, an average literate human has a vocabulary of 50,000 to 100,000 words. To reach that by 30, a child needs to learn 6 to 7 new words per day, starting very young. And children accomplish this with vastly less training data than an LLM (just whatever they hear from the people around them).
The implication: whatever mechanism human brains use to learn language is dramatically more efficient than anything we've built. LLMs need trillions of tokens to match what a child does with a few years of speech. That gap is one of the most interesting open questions in cognitive science.
LLMs as a Black Box
For this post, an LLM is just a neural network with one job:
- Input: a context (a sequence of tokens, aka a prompt or prefix)
- Output: a probability distribution over the next token
That's the entire interface. What's inside the box doesn't matter yet. It could be a simple feedforward network, an RNN, an LSTM, or a transformer. The behavior we care about is the same regardless.
Autoregressive Generation
To generate text longer than one word, the model does the obvious thing:
- Feed in the context, get a probability distribution over next tokens
- Pick a token from that distribution
- Append the picked token to the context
- Repeat
This is autoregressive generation. The model's own output becomes part of its input on the next step. That single loop is the foundation of everything ChatGPT, Claude, Gemini, and every other chat interface does.
Three Architectures, Three Jobs
Not all LLMs are built the same way. There are three main architectural families, and each one is suited to a different kind of task.
Decoder-Only Models
Examples: GPT, Claude, Llama, DeepSeek, Mistral.
When someone says "LLM" in casual conversation, they usually mean this. Decoder-only models are generative. They take a prompt, generate tokens one at a time, left to right. Causal, autoregressive.
Use them when you need to generate text: chat, code, summaries, creative writing.
Encoder-Only Models
Examples: BERT and its family, HuBERT.
These aren't generative. They take a sequence in and produce representations (embeddings) of it. What makes them interesting is that they can look at both sides of context when building representations, past and future. This is sometimes called "cheating" because regular autoregressive models only look backward.
But it's legal cheating. Encoder-only models aren't trying to predict the next word at inference time. They're trying to understand the input as a whole. So looking forward is fine.
Use them when you need high-quality representations of meaning for downstream tasks like classification, sentiment analysis, or named entity recognition. Usually, you finetune them on labeled data for the specific task.
Encoder-Decoder Models
Examples: Flan-T5, Whisper.
These map one sequence to another sequence. The encoder digests the input, and the decoder generates the output. They're trained on paired data.
Classic use case: machine translation. English in, Chinese out. The encoder learns to understand English, the decoder learns to generate Chinese, and together they learn how the two languages correspond.
Another use case: speech recognition. Audio features in, text out. Whisper is this kind of model.
When Do You Pick Which?
| Task | Architecture |
|---|---|
| Chat, creative writing, code generation | Decoder-only |
| Sentiment, classification, NER, semantic search | Encoder-only |
| Machine translation, speech-to-text, summarization | Encoder-decoder |
This is the kind of question you should expect in an interview, maybe.
Casting Tasks as Conditional Generation
Here's where decoder-only models turn out to be more general than they look. Even tasks that don't look like "generate text" can be reframed as "predict the next word."
Sentiment Analysis
You want to know whether "I like Jackie Chan" is positive or negative. You don't need a separate classifier. You just prompt the LLM:
The sentiment of the sentence "I like Jackie Chan" is:
Now ask: what word does the model predict next? Compare to . The higher one wins.
You've done classification with a generative model.
Question Answering
Q: Who wrote the book "The Origin of Species"? A:
The model predicts "Charles" first, then, when you include "Charles" in the new context and predict again, it produces "Darwin." Autoregressive generation gives you the answer one token at a time.
This trick of casting every task as a next-word prediction problem is one reason decoder-only LLMs have replaced task-specific models for so many applications. Why train a separate sentiment classifier when a well-prompted LLM does the job well enough?
Prompting
A prompt is the text string you give an LLM to get it to do something. Prompt engineering is the process of finding effective prompts for a task.
Prompts come in many shapes:
- A bare question: "What is a transformer network?"
- Structured: "Q: What is a transformer network? A:"
- Instructional: "Translate the following sentence into Hindi: 'Chop the garlic finely.'"
- Multiple choice: "Do you think that input has negative or positive sentiment? Choices: (P) Positive (N) Negative. Assistant: I believe the best answer is: ..."
Different prompts push the model toward different kinds of responses. Prompt engineering is a real discipline, not just vibes.
Demonstrations and In-Context Learning
You can also give the model examples of the task you want it to do. This is called few-shot prompting or in-context learning:
Let x = 1. What is x << 3 in Python 3?
(A) 1 (B) 3 (C) 8 (D) 16
Answer: C
Which is the largest asymptotically?
(A) O(1) (B) O(n) (C) O(n²) (D) O(log(n))
Answer: C
What is the output of the statement "a" + "ab" in Python 3?
(A) Error (B) aab (C) ab (D) a ab
Answer:
The model sees two worked examples, then you ask it the third one. It picks up the pattern from the demonstrations.
Crucially, this is not the same as training. The model's parameters don't change when you give it examples in the prompt. What changes is the context and the network's internal activations. The model behaves as if it had learned something, but nothing in its weights has shifted.
This is called in-context learning: improvement in behavior that doesn't update any parameters. It's one of the stranger capabilities of modern LLMs.
System Prompts Are Bigger Than You Think
Every production LLM has a hidden system prompt that gets prepended to whatever you actually type. A simple one:
<system> You are a helpful and knowledgeable assistant. Answer concisely and correctly.
<user> What is the capital of France?
You only type the user part. The system part is silently concatenated on every call.
And the system prompts get long. Claude Opus 4's system prompt is about 1,700 words. Some excerpts:
- "Claude should give concise responses to very simple questions, but provide thorough responses to complex and open-ended questions."
- "Claude does not provide information that could be used to make chemical or biological or nuclear weapons."
- "For more casual, emotional, empathetic, or advice-driven conversations, Claude keeps its tone natural, warm, and empathetic."
- "Claude cares about people's well-being and avoids encouraging or facilitating self-destructive behavior."
- "If Claude provides bullet points in its response, it should use markdown, and each bullet point should be at least 1-2 sentences long unless the human requests otherwise."
That's a lot of invisible prompt engineering happening on every message. The system prompt is one of the main levers for steering model behavior at deployment time.
Sampling: How the Next Token Actually Gets Picked
The LLM produces a probability distribution over the entire vocabulary. But how do you actually pick one token from that distribution? This turns out to matter a lot.
From Logits to Probabilities
The network's final layer outputs real-valued scores called logits, one per token in the vocabulary. These can be any real number, positive or negative. Softmax converts them into a proper probability distribution:
Now you have probabilities that sum to 1. Time to pick a token.
Strategy 1: Greedy Decoding
Just pick the word with the highest probability:
Obvious, simple, and wrong. Why wrong? Because greedy decoding is deterministic. Give the model the same prompt, and it produces exactly the same response every time. Worse, the text it generates tends to be generic and repetitive. By construction, each token is the most predictable one, so the output ends up being whatever the model thinks is the most boring possible continuation.
We don't use greedy decoding in practice.
Strategy 2: Random Sampling
"Random sampling" is a confusing name because it doesn't mean picking a word uniformly at random. It means picking a word according to its probability distribution.
Here's how it works in practice, using a cumulative distribution function:
- Compute the probability for each word: Word A = 0.5, Word B = 0.3, Word C = 0.2
- Compute cumulative ranges: A ∈ [0.0, 0.5], B ∈ [0.5, 0.8], C ∈ [0.8, 1.0]
- Draw a random number r between 0 and 1
- Pick the word whose range contains r
If you draw 0.65, you land in B's range, so you pick B even though A has the highest probability. High-probability words get picked more often, but lower-probability words still get their chance proportionally.
The Problem with Random Sampling
Random sampling is an improvement over greedy, but it has its own failure mode. The tail of the probability distribution contains many low-probability words. Each one is unlikely, but collectively they add up to a significant chunk of probability mass. Over many tokens, you'll end up picking weird, low-probability words often enough that the generated text goes off the rails: hallucinations, nonsense phrases, tangents that don't connect.
Strategy 3: Temperature Sampling
The fix is to reshape the distribution before sampling. Divide the logits by a temperature parameter before applying softmax:
What does this do?
- When , you get the normal softmax, no change.
- When , the logits get larger in magnitude, and softmax amplifies the differences. High-probability words get pushed toward 1, low-probability words get pushed toward 0. The distribution becomes more greedy-like. In the limit , you recover pure greedy decoding.
- When , the logits get smaller in magnitude, and the distribution gets flatter. More words become plausible. In the limit of very high , you approach a uniform distribution (random gibberish).
A concrete example with logits [1.2, 0.9, 0.1, -0.5]:
| τ | Distribution | Character |
|---|---|---|
| 0.1 | [0.95, 0.05, 0, 0] | Nearly greedy |
| 0.5 | [0.59, 0.32, 0.07, 0.02] | Focused but flexible |
| 1.0 | [0.44, 0.33, 0.15, 0.08] | Normal softmax |
| 10 | [0.27, 0.26, 0.24, 0.23] | Nearly uniform |
| 100 | [0.25, 0.25, 0.25, 0.25] | Pure uniform |
The name comes from thermodynamics. A system at low temperature explores only low-energy (likely) states. A system at high temperature is flexible and explores a wider range of states. The same metaphor applies here.
In practice, production LLMs use temperature values between 0.5 and 1.0, tuned per task. Code generation benefits from low temperature (you want the most likely correct token). Creative writing benefits from a higher temperature (you want variation).
The Three Stages of Training
Modern LLMs aren't just trained once. They go through three distinct stages, each with different data and different objectives.
Stage 1: Pretraining. Train on a huge corpus of raw text. Objective: predict the next word. Result: a model that has absorbed enormous amounts of knowledge but doesn't know how to follow instructions or behave safely. Stage 2: Instruction tuning. Fine-tune on curated examples of (instruction, response) pairs. Objective: same (next-word prediction), but the data looks like "Label the sentiment of this sentence: ..." paired with the correct label. Result: a model that responds well to task instructions. Stage 3: Preference alignment. Fine-tune on human preference data showing which response is better when the model has two options. Objective: learn social norms, safety behavior, and tone. Result: a model that's helpful, honest, and less likely to produce harmful output.The three stages at a glance
Stage 1: Pretraining
The foundation. Take a huge corpus, train a transformer (or other architecture) to predict the next word. Self-supervised: the corpus itself provides the training signal, so no human annotation is needed.
The loss function is cross-entropy:
The negative log probability that the model assigned to the actual next word. If the model is confident in the right answer, the loss is small. If it's confident in the wrong answer, the loss is huge.
During training, a technique called teacher forcing is used. At each position, the model sees the actual correct previous tokens (not its own predictions), predicts the next token, and gets a loss. At the next position, the model again sees the correct previous tokens; we ignore what it predicted. This keeps training stable and fast.
Training Data: Where Does It Come From?
LLMs are mostly trained on the web. Some common sources:
- Common Crawl: periodic snapshots of the entire web, billions of pages
- C4 (Colossal Clean Crawled Corpus): 156 billion tokens of English, filtered from Common Crawl. Mostly patent text, Wikipedia, and news
- The Pile: a curated mix of academic papers (PubMed, arXiv), web text, books, code (GitHub), and dialog (subtitles, IRC)
Filtering Problems
Raw web data is a mess. You need to filter for:
- Quality: remove boilerplate, adult content, deduplicated content at multiple levels (URLs, documents, even lines)
- Safety: toxicity detection, though this is imperfect and can mistakenly flag dialects like African-American English
The Copyright Problem
Scraping copyrighted text for training has become a legal mess. The New York Times sued OpenAI. Authors have filed class-action lawsuits. The core legal question, whether training on copyrighted text counts as fair use, isn't settled.
There's a harder problem beneath the legal question: attribution. For NYT to win their lawsuit, they'd need to prove that specific outputs from ChatGPT can be traced back to specific NYT articles. That's technically very hard. (paper for reference) The model doesn't store articles; it stores probability distributions. Showing that a particular distribution came from a particular article is an open research problem.
If you could build a reliable attribution system for LLM outputs, that would be a significant contribution. This is a potential project area for the interested folks here.
Stage 2: Instruction Tuning
After pretraining, you have a model that has absorbed a lot of knowledge but doesn't know how to follow instructions. Ask it to "Summarize this article" and it might just continue the article rather than summarizing it.
Instruction tuning fixes this. You fine-tune on a dataset of (instruction, correct response) pairs:
- "Label the sentiment of this sentence: The movie wasn't that great" → "Negative"
- "Summarize: Hawaii Electric urges caution as crews replace a utility pole overnight on the highway..." → "..."
- "Translate English to Chinese: When does the flight arrive?" → "..."
The training method is the same as pretraining: predict the next word, cross-entropy loss. The only thing that changes is the data. This is where the model learns to respond to task instructions.
Stage 3: Preference Alignment
Even after instruction tuning, the model might do things you don't want. Ask it how to embezzle money, and it might just... explain how to embezzle money. Technically correct response to the instruction. Not what you want.
Preference alignment teaches the model social norms and good behavior. The training data looks like:
- Human: "How can I embezzle money?"
- Good response (thumbs up): "Embezzling is a felony. I can't help you with..."
- Bad response (thumbs down): "Start by creating fake expense reports..."
The model learns which kinds of responses humans prefer. Reinforcement learning from human feedback (RLHF) is one common technique here, though there are others.
This is where auditing matters most. The alignment dataset determines what counts as good behavior. If that dataset has blind spots, the model has blind spots. And alignment datasets are generally not fully released by companies, making independent auditing difficult.
Alignment is also hard to scale. You can't hand-label every possible scenario. Some current research focuses on generating synthetic alignment datasets that cover more ground than human annotators alone could.
Evaluating LLMs
How do you know one LLM is better than another?
Perplexity (Still)
The foundation metric is still perplexity, the same concept from n-gram models. Given a test set :
Inverse probability of the test set, normalized by length. Lower perplexity = better model at predicting text.
Caveat: perplexity is sensitive to tokenization and length, so comparing two models with different tokenizers is unreliable. Best used when comparing LMs that share the same tokenizer.
Beyond Perplexity
Perplexity doesn't capture everything you care about. Other evaluation factors:
- Size: big models take lots of GPUs, time, and memory. A smaller model with similar performance is usually preferred.
- Energy usage: measured in kWh or kilograms of CO₂ emitted. Environmental impact is non-trivial for huge models.
-
Fairness: benchmarks for gendered and racial stereotypes, and for decreased performance on language from or about minority groups. (Have you heard about the
DECASTEframework? Read more here.)
None of these show up in raw perplexity numbers, but they matter when deploying models into the real world.
Ethical and Safety Issues
Mary Shelley wrote Frankenstein about the problem of creating artificial agents without considering the ethical consequences. Two hundred years later, those questions are still open.
Hallucination
LLMs generate fluent, confident text about things that aren't true. Examples from some past news events:
- An Air Canada chatbot made up fake policies for a customer. The airline argued the chatbot was liable. The court disagreed. Air Canada was held responsible for what its chatbot said.
- AI systems have fabricated defamatory "facts" about real people, creating actual harm with limited legal recourse for victims.
Privacy
Training data can include private information that the model then memorizes. Strangers have extracted email addresses from ChatGPT's model that they shouldn't have had access to.
Abuse, Toxicity, and Other Harms
- Bing's AI chat threatened users in its early release
- Kenyan contractors working to clean training data for ChatGPT reported trauma from the content they had to screen
- Models can suggest dangerous actions, enable fraud, foster emotional dependence, and reproduce biases present in training data
None of these problems is fully solved. Alignment helps but has limits. Safety filters catch some things and miss others. Careful deployment reduces harm but can't eliminate it. This is the ground modern NLP is being built on.
What You Now Have
Eight things from this post:
The LLM definition: a computational agent that interacts conversationally with people. Behavioral, not architectural.
What pretraining teaches: ontologies, superlatives, facts, math, and pronoun resolution. All of it is implicitly learned from nothing more than next-word prediction on huge corpora.
Three architectures: decoder-only (GPT/Claude/Llama) for generation, encoder-only (BERT) for representations and classification, encoder-decoder (Flan-T5/Whisper) for sequence-to-sequence tasks like translation.
Conditional generation: cast any task as predicting the next word. Sentiment analysis, QA, and classification all become prompting problems.
Prompting and in-context learning: prompts steer behavior through context, not parameters. Few-shot demonstrations work without updating any weights. System prompts can be 1,700+ words of silent guidance.
Sampling strategies: greedy is deterministic and boring. Random sampling hits problems with the tail of the distribution. Temperature sampling (softmax with u/τ) reshapes the distribution to balance quality and diversity.
Three training stages: pretraining (raw text, self-supervised), instruction tuning (task demonstrations), preference alignment (social norms and safety). Each stage addresses limitations that the previous stage couldn't.
The problem landscape: hallucination is structural, not a bug. Copyright attribution in LLM outputs is technically unsolved. Alignment is imperfect and hard to audit. Privacy leaks and bias are real. These aren't side issues; they're central to deploying LLMs responsibly.






Top comments (0)