jackma

Posted on Nov 15

LLM Interview Series(4): Pre-training vs Fine-tuning — How LLMs Learn

#programming #beginners #ai #tutorial

1. (Interview Question 1) What is the fundamental difference between pre-training and fine-tuning in Large Language Models?

Key Concept: Training Objectives & Learning Stages

Standard Answer:
Pre-training and fine-tuning represent two distinct but complementary phases in building high-performance Large Language Models. Pre-training is the stage where the model learns general linguistic, semantic, and world knowledge from massive, diverse, and mostly unlabeled datasets. This stage uses self-supervised learning objectives like masked language modeling or next-token prediction. The key idea is that the model builds broad capabilities: grammar understanding, reasoning structures, world context, and token-level statistical relationships.

Fine-tuning, in contrast, adapts the broadly capable model to a more specific domain, task, or behavior. Instead of trillions of tokens, fine-tuning usually uses a much smaller, curated dataset designed to shape the model toward specific outputs — such as answering in a certain tone, following instructions, or performing well on a domain like medical Q&A or financial reasoning.

The most important conceptual difference is breadth vs. specialization. Pre-training is where the model becomes intelligent; fine-tuning is where the model becomes useful. During pre-training, the model learns latent representations through billions of gradient updates, optimizing a generic objective. During fine-tuning, the model’s representations are adjusted so it can perform tasks that the base model was not explicitly trained on.

From an engineering perspective, pre-training is extremely resource-intensive. It requires distributed training setups, complex data pipelines, and large-scale orchestration. Fine-tuning, on the other hand, can be done with modest hardware using techniques like LoRA, QLoRA, or parameter-efficient adapters. While pre-training defines the model’s foundation, fine-tuning determines its final form in practical applications. Without pre-training, fine-tuning has nothing to refine; without fine-tuning, pre-training is too general to provide task-aligned behavior.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

Why can’t we skip pre-training and only fine-tune a smaller model?
How does the dataset design differ between pre-training and fine-tuning?
Which stage has a larger impact on model hallucination rates?

2. (Interview Question 2) Why is self-supervised learning essential for LLM pre-training?

Key Concept: Self-supervision and Unlabeled Data

Standard Answer:
Self-supervised learning is essential for LLM pre-training primarily because it allows models to learn from enormous amounts of text without requiring human-labeled data. Modern language models are trained on hundreds of billions to trillions of tokens, and manually labeling datasets of this scale is impractical. Self-supervised objectives — such as predicting the next token or reconstructing masked text — turn every piece of text into its own training label, enabling the model to extract structure and meaning from raw data.

Another reason self-supervision is critical is that it encourages the model to infer missing information, detect patterns, and learn long-range dependencies. Predicting the next token forces a model to understand syntax, semantics, reasoning cues, and pragmatic context. When a model trains on such objectives repeatedly, it gradually internalizes a latent world model — the statistical structure of human communication.

Self-supervision also provides robustness and generalization. Because the model sees data from many domains — books, articles, code, web pages — it learns patterns that transfer to downstream tasks even if those tasks were never part of the training. This is why a pre-trained model can perform translation, summarization, question answering, or code generation before being fine-tuned for any of them.

Finally, self-supervision aligns well with transformer architectures. Transformers thrive when the objective allows dense contextual understanding, and predicting missing tokens across long sequences is a perfect match. This synergy between architecture and training objective is a major reason modern LLMs scaled so effectively.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

What alternative training objectives could replace next-token prediction?
How does self-supervision differ from reinforcement learning in LLM training?
What are the risks of using noisy or low-quality data in self-supervised training?

3. (Interview Question 3) What types of datasets are typically used in pre-training vs fine-tuning?

Key Concept: Data Composition & Curation Strategy

Standard Answer:
Pre-training datasets are designed to maximize diversity, scale, and linguistic coverage. They are typically composed of large text corpora pulled from books, web pages, scientific papers, news articles, forums, and code repositories. The goal is to expose the model to as many writing styles, topics, and knowledge domains as possible. Since pre-training is self-supervised, the dataset does not require labels — only raw text.

The dataset construction process focuses on scale and distribution. Engineers aggressively deduplicate, dedust, filter for toxic or illegal content, and ensure multilingual or domain coverage depending on model goals. A pre-training dataset may contain hundreds of billions of tokens, often grouped into mixtures such as 40% web, 20% books, 20% code, and so on. The emphasis is on providing rich natural variation and minimizing data artifacts that could induce bias.

Fine-tuning datasets, by contrast, are task-specific and carefully curated. They may include instruction-response pairs, domain-specific Q&A, customer service transcripts, legal documents, or medical explanations. Unlike pre-training, fine-tuning datasets are usually much smaller — often between 50K and 5M samples. Quality matters far more than quantity because fine-tuning shapes model behavior, not underlying knowledge.

Fine-tuning data must be consistent, aligned with desired behavior, and often labeled with explicit targets. For example, instruction fine-tuning requires prompts paired with ideal responses; RLHF requires human preference rankings; domain fine-tuning requires accurate domain-knowledge outputs.

In short, pre-training datasets make the model capable; fine-tuning datasets make the model aligned and specialized. This separation ensures that the model has both strong foundational knowledge and the ability to behave reliably in real-world applications.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

How do you prevent a fine-tuning dataset from introducing bias?
What happens if the fine-tuning dataset contradicts pre-training knowledge?
Should fine-tuning data include negative examples?

4. (Interview Question 4) Why do pre-trained models sometimes perform poorly on specific tasks without fine-tuning?

Key Concept: Alignment Gap & Task-Specific Optimization

Standard Answer:
Even though pre-trained models learn broad knowledge, they often perform poorly on specific tasks because they lack task alignment — meaning the prompts, expected outputs, and reasoning workflows required by a task may not naturally emerge from generic next-token prediction. Pre-training optimizes for likelihood, not correctness or task structure. For example, a base model might complete a sentence fluently but still fail to follow step-by-step instructions or provide accurate domain-specific reasoning.

One major issue is instruction following. Pre-training does not teach the model the concept of "tasks." Without fine-tuning, a pre-trained LLM does not understand that it should answer concisely, reason stepwise, format outputs, or adhere to a user’s explicit request. This is why early GPT models before instruction tuning behaved unpredictably and often ignored instructions.

Additionally, domain adaptation is a major factor. A model trained on general text may not know highly specialized vocabulary or precise workflows required for tasks like legal analysis, medical triage, or financial modeling. Even if the raw knowledge is present in pre-training, the model has not been shaped to apply it in task-specific formats.

Fine-tuning resolves these gaps by providing curated examples of how the model should behave. It teaches interaction patterns (e.g., Q&A structure), enforces correctness, and aligns the model with practical workflows. This is why instruction-tuned or domain-tuned models significantly outperform base models in real applications.

In summary, poor performance without fine-tuning is not a failure of pre-training — it is a natural outcome of optimizing for broad language modeling instead of specialized task execution.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

Could prompt engineering alone solve task alignment without fine-tuning?
In what cases can a base model perform well without fine-tuning?
Why do base models hallucinate more often?

5. (Interview Question 5) How does Instruction Fine-tuning differ from RLHF?

Key Concept: Supervised Alignment vs Human Preference Optimization

Standard Answer:
Instruction fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are two complementary techniques used to align LLMs with human expectations. Although both aim to improve behavior, they differ in purpose, methodology, and training signals.

Instruction fine-tuning is supervised. The model is trained on a dataset of prompts paired with ideal answers. The loss directly measures how closely the model’s output matches the target response. This stage teaches the model to follow instructions, handle structured tasks, and respond helpfully and coherently. It is essentially pattern learning from curated exemplars.

RLHF, however, is preference-based rather than example-based. Instead of giving the model a single “correct” answer, humans compare multiple model outputs and rank them. A reward model is trained on these human preferences, and then the LLM is optimized (often via PPO or DPO) to maximize reward. RLHF teaches nuanced behavior such as politeness, safety, adherence to values, and reduced hallucinations.

The relationship between them is hierarchical:

SFT provides basic instruction-following.
RLHF refines behavior using human values.

SFT shapes what the model does; RLHF shapes how the model behaves.

Another key difference is failure modes. Over-aggressive SFT may cause the model to overfit to training examples or ignore alternative reasoning paths. RLHF can lead to reward hacking or excessive refusal if the reward model is not well-designed. Both require careful dataset construction to ensure balanced alignment.

In practice, modern LLMs use a combination of pre-training → SFT → RLHF → post-training steps like distillation. Each stage builds upon the previous one to create a model that is knowledgeable, helpful, aligned, and safe for real-world deployment.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

What are typical failure cases caused by poorly designed reward models?
Can RLHF be used without instruction fine-tuning?
Why is human preference modeling more scalable than traditional supervision?

6. (Interview Question 6) What are Parameter-Efficient Fine-Tuning (PEFT) methods and why are they important?

Key Concept: LoRA, Adapters, QLoRA

Standard Answer:
Parameter-Efficient Fine-Tuning (PEFT) methods allow developers to fine-tune LLMs without updating all model weights. Instead, they introduce small additional modules — like adapter layers or low-rank matrices — and optimize only those. This drastically reduces computational costs and makes fine-tuning feasible on consumer-grade GPUs.

For example, LoRA (Low-Rank Adaptation) injects trainable low-rank matrices into attention layers. During fine-tuning, only these matrices are updated, while the original model weights remain frozen. This reduces gradient computation, memory footprint, and storage requirements, enabling large-scale customization at a fraction of the cost.

QLoRA goes even further by quantizing the base model to 4-bit precision while keeping LoRA modules in higher precision. This allows fine-tuning 7B–70B models on single GPUs without losing performance. PEFT methods therefore democratize LLM fine-tuning, making it accessible to researchers, startups, and applied engineering teams.

Adapters serve a similar purpose by adding small bottleneck layers between transformer blocks. These layers can encode task-specific behavior and be swapped in and out without retraining the entire model.

The importance of PEFT extends beyond cost savings. It allows organizations to maintain one central base model while supporting multiple fine-tuned variants for different tasks. This modularity reduces deployment overhead and makes versioning and rollback simpler.

Overall, PEFT techniques have become essential in modern LLM development because they combine efficiency, flexibility, and performance, allowing fine-tuning at scale without the massive infrastructure typically required for full-parameter updates.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

How does LoRA compare to full fine-tuning in terms of performance trade-offs?
In what situations is full fine-tuning still necessary?
How do PEFT approaches help prevent overfitting?

7. (Interview Question 7) How does catastrophic forgetting occur during fine-tuning, and how can we prevent it?

Key Concept: Knowledge Retention & Regularization

Standard Answer:
Catastrophic forgetting happens when fine-tuning overwrites the model’s previously learned representations, causing it to lose general knowledge acquired during pre-training. Because fine-tuning datasets are usually much smaller and narrower, the model may become overly specialized and lose the ability to handle diverse tasks or reasoning patterns.

The issue arises from gradient update dynamics. When the model is trained on a narrow dataset, the gradients strongly adjust weights in directions that optimize for the fine-tuning task. However, many of those same weights also encode general-purpose linguistic or semantic information. Aggressive updates can therefore disrupt pre-trained representations.

Preventing catastrophic forgetting involves multiple strategies:

Freezing Layers — Only a subset of the model’s parameters (typically higher layers) are updated. This protects core representations learned during pre-training.
PEFT Techniques — Methods like LoRA avoid modifying the original weights entirely, preserving the pre-trained knowledge.
Regularization — Techniques such as L2-SP penalize deviations from the pre-trained weight state, preventing radical shifts in the parameter space.
Replay Buffers — Mixing some portion of general text from pre-training sources during fine-tuning helps maintain balance.
Low Learning Rates — Controlled updates reduce the risk of overwriting essential pre-trained patterns.

In certain domains, catastrophic forgetting is desirable — for example, when redefining model behavior or removing dangerous knowledge. But most often, maintaining pre-trained capabilities is crucial for robust generalization. Modern fine-tuning pipelines usually apply PEFT and careful learning rate schedules to mitigate this problem.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

Why does catastrophic forgetting happen more frequently in small datasets?
How does LoRA specifically help mitigate forgetting?
Can catastrophic forgetting ever improve performance?

8. (Interview Question 8) How do scaling laws influence the pre-training strategy of LLMs?

Key Concept: Compute–Data–Model Trade-offs

Standard Answer:
Scaling laws describe predictable relationships between model size, dataset size, compute budget, and performance. These laws show that improving performance requires balanced scaling across all three factors. Pre-training is fundamentally influenced by these scaling dynamics because an imbalance — such as a model that is too large for the available data — leads to suboptimal learning and wasted compute.

For example, scaling laws indicate that doubling the dataset size yields diminishing but measurable improvements, while scaling model parameters without increasing data can lead to overfitting. Similarly, compute requirements grow superlinearly as model size increases, meaning training a trillion-parameter model requires not just more data, but significantly more compute.

Because of scaling laws, engineers design pre-training runs with specific ratios — such as the Chinchilla optimum — which recommend increasing dataset size proportionally to model size to maximize training efficiency.

These laws guide decisions like:

How many tokens should we train on?
How many training steps should we use?
Should we spend compute on a larger model or train a smaller one longer?
Which mixture of data sources yields the best scaling efficiency?

Ignoring scaling laws results in models that are either compute-inefficient or undertrained relative to their parameter count. Modern LLM research increasingly optimizes toward data-efficient and compute-efficient strategies informed by scaling law research.

Ultimately, scaling laws make pre-training predictable and engineering-driven, rather than trial-and-error. They provide a roadmap for allocating resources intelligently to achieve the best possible performance at a given scale.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

How did the Chinchilla paper change the industry’s understanding of scaling?
What happens if a model is undertrained according to scaling laws?
How do scaling laws differ between language and vision models?

9. (Interview Question 9) What is the role of tokenization during pre-training and fine-tuning?

Key Concept: Vocabulary, Subword Units, Efficiency

Standard Answer:
Tokenization determines how raw text is broken into units that the model can understand. It plays a crucial role in both pre-training and fine-tuning because it defines the granularity at which the model learns representations. Most modern LLMs use subword-based tokenizers like Byte Pair Encoding (BPE) or SentencePiece, which split text into manageable units that balance expressiveness and efficiency.

During pre-training, the tokenizer shapes the entire learning process. If the vocabulary is too small, sequences become long, lowering efficiency. If it is too large, rare words fragment the embedding space and reduce generalization. A well-designed tokenizer produces consistent, compact token sequences that cover languages and domains with minimal ambiguity.

Because pre-training involves trillions of tokens, even small improvements in tokenization efficiency can save millions of dollars in compute. The tokenizer also determines how effectively the model can represent rare words, technical language, and multilingual content.

During fine-tuning, tokenization ensures consistency between pre-training and downstream tasks. Using a different tokenizer would break compatibility, since token embeddings and positional mappings are tied to the original vocabulary. Fine-tuning datasets must be tokenized identically to avoid introducing errors.

Some fine-tuning workflows also rely on special tokens — system tokens, formatting tokens, or domain-specific markers — to guide model behavior. For example:

<System>
Follow the structure below:
</System>

These special tokens must be added carefully because expanding the vocabulary changes embedding tables.

In summary, tokenization is not a minor preprocessing step: it is a foundational design choice that influences efficiency, accuracy, multilingual support, and downstream adaptability.

Possible 3 Follow-up Questions: 👉 ( Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

Should tokenizers be updated periodically as new domains emerge?
How does BPE differ from WordPiece?
What are the trade-offs of using byte-level tokenization?

10. (Interview Question 10) How does continual training differ from fine-tuning, and when should each be used?

Key Concept: Model Expansion vs Domain Adaptation

Standard Answer:
Continual training and fine-tuning serve different purposes within the lifecycle of an LLM. Fine-tuning is typically a small, targeted update that aligns the model with a specific task or behavior. Continual training, however, extends the pre-training stage by exposing the model to additional large-scale data, often to update world knowledge, incorporate new domains, or expand multilingual capabilities.

In continual training, the model learns new general-purpose representations without narrowing its focus. This process is usually compute-heavy and resembles pre-training more than fine-tuning. It can involve hundreds of billions of tokens from new sources, enabling the model to stay up to date with evolving information.

Fine-tuning, by contrast, optimizes the model for a specific use case — such as customer support dialogue, medical Q&A, or reasoning tasks. It is typically applied on much smaller datasets and shapes the model’s behavior rather than expanding its core knowledge.

When to use each:

Use continual training when the model lacks essential knowledge or needs global improvements (e.g., new languages, new time-sensitive information).
Use fine-tuning when the model needs domain alignment, task-specific structure, or behavior refinement.

Another difference lies in risk. Continual training can accidentally overwrite important pre-trained patterns if not executed carefully. Fine-tuning carries the risk of catastrophic forgetting if performed too aggressively. The right strategy often involves combining both: continually updating the base model periodically, then fine-tuning variants for specific deployments.

Ultimately, continual training expands what the model knows; fine-tuning refines what the model does. Understanding the interplay between them is essential for designing scalable, maintainable LLM ecosystems.

Possible 3 Follow-up Questions: 👉 (Want to test your skills? Try a Mock Interview — each question comes with real-time voice insights)

When does continual training introduce more risk than benefit?
How do we select datasets for continual training?
Can continual training fully replace fine-tuning?

Top comments (1)

carl • Nov 15

The way I see it, pre-training is like giving the model its entire “education” — huge amounts of text so it learns language patterns and general world stuff