Jimin Lee

Posted on Sep 13 • Edited on Sep 14

(3/3) LLM: In-Context Learning, Hype, and the Road Ahead

#llm #machinelearning

📌 Note: This article was originally written in April 2023. Even though I’ve updated parts of it, some parts may feel a bit dated by today’s standards. However, most of the key ideas about LLMs remain just as relevant today.

So… Are LLMs “Foundation Models”?

Short answer: kind of, yes—but with caveats.

Long answer: let’s walk through why a model trained to predict the next token can still feel like a general-purpose engine for NLP tasks.

Quick recap: what a Language Model actually does

Collect a massive amount of text.
Show it to a language model.
Train it to predict the next token (word/subword).
Feed the model’s own output back into the input (auto-regressive) to generate long sequences.

In other words, a Language Model predicts the next token; stretched out over many steps, it writes. Now, does making that model large turn it into an NLP foundation model—a base you can adapt to many downstream tasks? A strict yes/no is tough, but in practice: LLMs do a surprisingly good job today and are the closest thing we have so far. Better approaches may come; for now, LLMs are the front-runners.

Why can “just next-token prediction” look like general intelligence?

Two big reasons.

1) The breadth of data

A student who has read widely writes better than one who hasn’t. Same for LLMs: while “knowledge” in a philosophical sense is debatable, training across diverse, large-scale text exposes the model to patterns, facts, styles, and structures from many domains. That breadth makes next-token prediction look powerful across tasks.

Think of it this way: someone who’s read a mountain of detective novels can mimic the genre well—even if they’ve never solved a case.

2) The power of Transformers

It’s not just the data. Transformers learn statistical relationships across tokens efficiently. They don’t build an explicit knowledge graph with named entities and edges, but self-attention lets the model connect distant parts of text and maintain coherence over long spans. Multi-head self-attention is why an LLM can hold a thread instead of getting lost mid-paragraph. (Unlike some blog posts that shall remain unnamed…)

If LLMs are a kind of foundation, how do we use them?

Fine-tuning

Assume the LLM already “knows” general language. You then fine-tune it on your task (sentiment, NER, classification, etc.) with labeled data.

Another way to view it: use the LLM as an initialization rather than starting from random weights. Starting near a good solution can converge faster and more reliably.

Reality check: As models grew, full fine-tuning became slow and expensive. That’s why we often reach for the next thing…

In-Context Learning (ICL): zero-shot & few-shot

Instead of changing the model, change the input at inference time.

Zero-shot: “What’s the capital of South Korea?”
Few-shot: Provide patterns first:

USA -> Washington, D.C.

Japan -> Tokyo

China -> Beijing

South Korea -> ?

The model isn’t “answering a question” so much as continuing the pattern in the text you gave it. With strong base models, few-shot—and often zero-shot—are already useful without retraining.

Fine-tuning can still win on accuracy for some tasks. But for many practical cases, ICL gives you ‘good enough’ without training cost.

Prompt Engineering

LLMs are pattern completers. So how you phrase the input matters.

Worse: What’s the capital of South Korea?
Better: You’re a system that answers world-capital questions concisely.

Question: What is the capital of South Korea?

Answer:

The second prompt supplies role, format, and intent, which biases the completion toward what you want.

Conversational LLMs

Base LMs aren’t chatbots. But they act like one if:

1) They see lots of dialog data during pre-training or fine-tuning, and

2) We wrap user input with a dialog-style prompt before sending it to the model.

How is context maintained? We keep a running transcript:

User: What’s the capital of South Korea?

Assistant: Seoul.

User: And Japan?

The model gets the whole history (up to the context window, measured in tokens), then continues it. That’s it—no magic, just careful prompt construction and truncation when the history gets too long.

Steering with RLHF

Left alone, a base LM will happily produce anything it thinks “fits” the next-token distribution—including unsafe or unhelpful text. Enter Reinforcement Learning with Human Feedback (RLHF): humans rank model responses; a reward model learns those preferences; the LM is optimized to produce safer, more helpful, more “on-policy” outputs.

Important: RLHF doesn’t grant new raw capabilities; it steers behavior. Sometimes raw benchmark scores even dip slightly while helpfulness/safety improve.

Challenges we shouldn’t hand-wave away

Concentration of power

Data access is improving (especially for English), but compute is the new bottleneck. Training frontier models requires huge GPU clusters and budgets, which risks consolidation among a few players. Open weights, shared preference datasets, and efficient training methods can help—but it’s an ongoing tension.

Carbon footprint

Training and serving LLMs consume significant energy. Estimates for a single large run can be hundreds of tons of CO₂-equivalent. The field is working on efficiency (better hardware, algorithms, and scheduling) and reporting emissions more transparently, but this is a real externality.

Hallucinations

LLMs will invent details when the next-token distribution “leans that way.” The prose looks confident, which makes fact-checking hard. Mitigations include:

Retrieval-augmented generation (RAG) to ground answers in external sources,
Better prompts and system rules,
Task-specific fine-tuning or adapters,
Structured output and verification steps.

Open questions

Do LLMs “reason”?

One camp: LLMs just do massive pattern matching.

Another: human reasoning might itself be pattern completion over experience.

Truth likely sits between: techniques like chain-of-thought, tool use, and self-consistency push LLMs to perform surprisingly well on reasoning-like tasks—yet they still fail in distinctly non-human ways.

Arthur C. Clarke had a line for this: “Any sufficiently advanced technology is indistinguishable from magic.” We’re somewhere along that curve—impressive, but not magic.

Will LLMs replace doctors or lawyers?

Passing an exam ≠ practicing the profession. Real-world work involves clients, tools, procedures, accountability, and context. Today’s LLMs won’t replace entire professions, but they already automate slices of knowledge work (drafting, summarizing, retrieval, brainstorming). The trajectory points toward AI-augmented professionals, not wholesale replacement—at least for now.

Bottom line

Are LLMs the foundation model for NLP? Today, they’re the best we’ve got.
Are they perfect? No.
Can we adapt them to many tasks? Absolutely—and that’s why they feel foundational.

DEV Community