đ Note: This article was originally written in April 2023. Even though Iâve updated parts of it, some parts may feel a bit dated by todayâs standards. However, most of the key ideas about LLMs remain just as relevant today.
So⌠Are LLMs âFoundation Modelsâ?
Short answer: kind of, yesâbut with caveats.
Long answer: letâs walk through why a model trained to predict the next token can still feel like a general-purpose engine for NLP tasks.
Quick recap: what a Language Model actually does
Collect a massive amount of text.
Show it to a language model.
Train it to predict the next token (word/subword).
Feed the modelâs own output back into the input (auto-regressive) to generate long sequences.
In other words, a Language Model predicts the next token; stretched out over many steps, it writes. Now, does making that model large turn it into an NLP foundation modelâa base you can adapt to many downstream tasks? A strict yes/no is tough, but in practice: LLMs do a surprisingly good job today and are the closest thing we have so far. Better approaches may come; for now, LLMs are the front-runners.
Why can âjust next-token predictionâ look like general intelligence?
Two big reasons.
1) The breadth of data
A student who has read widely writes better than one who hasnât. Same for LLMs: while âknowledgeâ in a philosophical sense is debatable, training across diverse, large-scale text exposes the model to patterns, facts, styles, and structures from many domains. That breadth makes next-token prediction look powerful across tasks.
Think of it this way: someone whoâs read a mountain of detective novels can mimic the genre wellâeven if theyâve never solved a case.
2) The power of Transformers
Itâs not just the data. Transformers learn statistical relationships across tokens efficiently. They donât build an explicit knowledge graph with named entities and edges, but self-attention lets the model connect distant parts of text and maintain coherence over long spans. Multi-head self-attention is why an LLM can hold a thread instead of getting lost mid-paragraph. (Unlike some blog posts that shall remain unnamedâŚ)
If LLMs are a kind of foundation, how do we use them?
Fine-tuning
Assume the LLM already âknowsâ general language. You then fine-tune it on your task (sentiment, NER, classification, etc.) with labeled data.
Another way to view it: use the LLM as an initialization rather than starting from random weights. Starting near a good solution can converge faster and more reliably.
Reality check: As models grew, full fine-tuning became slow and expensive. Thatâs why we often reach for the next thingâŚ
In-Context Learning (ICL): zero-shot & few-shot
Instead of changing the model, change the input at inference time.
Zero-shot: âWhatâs the capital of South Korea?â
Few-shot: Provide patterns first:
USA -> Washington, D.C.
Japan -> Tokyo
China -> Beijing
South Korea -> ?
The model isnât âanswering a questionâ so much as continuing the pattern in the text you gave it. With strong base models, few-shotâand often zero-shotâare already useful without retraining.
Fine-tuning can still win on accuracy for some tasks. But for many practical cases, ICL gives you âgood enoughâ without training cost.
Prompt Engineering
LLMs are pattern completers. So how you phrase the input matters.
Worse: Whatâs the capital of South Korea?
-
Better: Youâre a system that answers world-capital questions concisely.
Question: What is the capital of South Korea?
Answer:
The second prompt supplies role, format, and intent, which biases the completion toward what you want.
Conversational LLMs
Base LMs arenât chatbots. But they act like one if:
1) They see lots of dialog data during pre-training or fine-tuning, and
2) We wrap user input with a dialog-style prompt before sending it to the model.
How is context maintained? We keep a running transcript:
User: Whatâs the capital of South Korea?
Assistant: Seoul.
User: And Japan?
The model gets the whole history (up to the context window, measured in tokens), then continues it. Thatâs itâno magic, just careful prompt construction and truncation when the history gets too long.
Steering with RLHF
Left alone, a base LM will happily produce anything it thinks âfitsâ the next-token distributionâincluding unsafe or unhelpful text. Enter Reinforcement Learning with Human Feedback (RLHF): humans rank model responses; a reward model learns those preferences; the LM is optimized to produce safer, more helpful, more âon-policyâ outputs.
Important: RLHF doesnât grant new raw capabilities; it steers behavior. Sometimes raw benchmark scores even dip slightly while helpfulness/safety improve.
Challenges we shouldnât hand-wave away
Concentration of power
Data access is improving (especially for English), but compute is the new bottleneck. Training frontier models requires huge GPU clusters and budgets, which risks consolidation among a few players. Open weights, shared preference datasets, and efficient training methods can helpâbut itâs an ongoing tension.
Carbon footprint
Training and serving LLMs consume significant energy. Estimates for a single large run can be hundreds of tons of COâ-equivalent. The field is working on efficiency (better hardware, algorithms, and scheduling) and reporting emissions more transparently, but this is a real externality.
Hallucinations
LLMs will invent details when the next-token distribution âleans that way.â The prose looks confident, which makes fact-checking hard. Mitigations include:
Retrieval-augmented generation (RAG) to ground answers in external sources,
Better prompts and system rules,
Task-specific fine-tuning or adapters,
Structured output and verification steps.
Open questions
Do LLMs âreasonâ?
One camp: LLMs just do massive pattern matching.
Another: human reasoning might itself be pattern completion over experience.
Truth likely sits between: techniques like chain-of-thought, tool use, and self-consistency push LLMs to perform surprisingly well on reasoning-like tasksâyet they still fail in distinctly non-human ways.
Arthur C. Clarke had a line for this: âAny sufficiently advanced technology is indistinguishable from magic.â Weâre somewhere along that curveâimpressive, but not magic.
Will LLMs replace doctors or lawyers?
Passing an exam â practicing the profession. Real-world work involves clients, tools, procedures, accountability, and context. Todayâs LLMs wonât replace entire professions, but they already automate slices of knowledge work (drafting, summarizing, retrieval, brainstorming). The trajectory points toward AI-augmented professionals, not wholesale replacementâat least for now.
Bottom line
Are LLMs the foundation model for NLP? Today, theyâre the best weâve got.
Are they perfect? No.
Can we adapt them to many tasks? Absolutelyâand thatâs why they feel foundational.
Top comments (0)