The question "do large language models have intelligence?" has become the most polarizing debate in AI. One camp points to emergent reasoning abilities as proof of genuine intelligence. The other dismisses it all as statistical parroting. Both sides talk past each other because they're answering different questions.
After a structured multi-perspective analysis—combining empirical evidence, mechanistic interpretability, philosophy of mind, and legal frameworks—a more useful framework emerged. Not an answer to the binary question, but a map of the territory.
The Problem with the Binary Question
Ask "is it intelligent?" and you immediately hit a wall: what do you mean by "intelligent"?
There are at least five distinct capabilities people conflate under that single word:
| Level | Capability | LLM Status |
|---|---|---|
| S0 Statistical pattern matching | Finding and reusing statistical regularities | ✅ Undisputed |
| S1 Symbolic reasoning | Executing logical deduction | ⚠️ Partial, unreliable |
| S2 World modeling | Causal internal representation of physical reality | ❌ Hotly debated |
| S3 Metacognition | Knowing what you know and don't know | ⚠️ Surface behavior exists, depth unclear |
| S4 Autonomous intention | Having self-generated goals, desires, value judgments | ❌ No evidence |
When someone says "LLMs are intelligent," they usually mean S0-S1. When someone says "they're not," they usually mean S2-S4. The debate is a category error.
The Emergence Hierarchy: L0 Through L3
A more productive approach is to ask: what actually emerges as models scale? Not in the hype sense, but structurally. Here's a four-layer framework:
L0: Metric Artifact (Pseudo-Emergence)
Some "emergent abilities" vanish when you change your measurement tool. Schaeffer et al. (2023) showed that apparent phase transitions in model capabilities often disappear when you switch from non-linear metrics (exact match) to continuous ones (token-level accuracy).
Verdict: Not real emergence. A measurement illusion.
L1: Structural Emergence
Inside the model, new internal structures appear at scale. The clearest example: induction heads (Elhage et al., 2022, Anthropic). Below ~2B parameters, they don't exist. Above that threshold, they suddenly appear—and their emergence coincides with a phase transition in training loss.
This isn't a metric artifact. You can intervene on these structures and change specific model behaviors. The "Locate, Steer, Improve" paradigm from mechanistic interpretability research (HKU + Fudan + Tencent, 2025) demonstrates this directly.
Verdict: Real emergence. Internal structure changes, physically verifiable.
L2: Functional Emergence
L1 structures enable new capabilities that weren't explicitly trained. In-context learning, chain-of-thought reasoning, instruction following—these are functional projections of underlying structural changes.
Othello GPT (Li et al., 2023, ICLR) is the canonical example: trained only to predict legal moves in Othello from text sequences, with zero board-state labels. Linear probes on intermediate layers reveal the model spontaneously constructed a complete 8×8 board representation. The training objective decomposed into "board state → legal move," and gradient descent naturally discovered this decomposition.
Verdict: Real emergence. But limited to structured, closed-world domains.
L3: Intelligence Emergence (The Frontier)
This is where genuine controversy lives. L3 would mean:
- World models that generalize beyond training distribution
- Causal reasoning with counterfactual simulation
- Calibrated metacognition—knowing when to be uncertain
Current evidence is mixed:
- Planning: LLMs achieve >90% on ≤5-step plans but crash to <30% beyond 8 steps (Valmeekam et al., 2024, AAAI 2025). They don't backtrack when stuck.
- Causal reasoning: GPT-4 approaches human-level on simple counterfactuals (CRASS benchmark), but fails in qualitatively different ways than humans.
- Theory of Mind: 95% on Sally-Anne tests (Kosinski, 2023), but accuracy drops precipitously with minor prompt rewording (Ullman, 2023).
Verdict: Not yet reached. But something interesting is happening in the gap.
The L2.5 Discovery: Meta-Strategy Without Calibration
Here's where it gets interesting. DeepSeek R1, trained with reinforcement learning, spontaneously developed a verify-then-revise behavior:
- Generate a solution
- Check it for consistency
- If a contradiction is found, backtrack and re-reason
This behavior was never explicitly trained. RL only rewarded final correctness. The model discovered that verification is an effective strategy on its own.
But there's a catch: the model doesn't know when to verify. It over-verifies on easy problems (wasting tokens) and under-verifies on hard ones (missing errors). It has the strategy but lacks calibration.
This defines a new layer: L2.5—meta-strategy without calibration. What makes it structurally different from L2 is the source of the behavior. L2 capabilities are functional projections of structural changes (induction heads → in-context learning). L2.5 behaviors emerge from the model discovering strategies—not just patterns—during training. R1 didn't develop a "verification circuit" (a structural change). It developed a behavioral policy of checking its work, which it applies inconsistently. The gap between having a strategy and knowing when to deploy it is what separates L2.5 from L3.
This is where frontier LLMs actually stand today.
What the Architecture Debate Actually Means
Two recent findings reframe the "can transformers achieve intelligence?" question:
LLaDA (Nie et al., NeurIPS 2025 Oral): A diffusion-based language model that matches autoregressive transformers at 8B scale—and significantly outperforms GPT-4o on the reversal curse benchmark. This proves language modeling capability isn't tied to the autoregressive paradigm.
Lake & Baroni (2023): LLMs achieve ~30% on systematic compositionality tests where humans score ~100%. Changing the architecture (LLaDA) fixes engineering limitations (reversal curse) but doesn't fix cognitive limitations (compositionality).
The implication: intelligence emergence may be a function of computational scale + training signal, relatively independent of architecture details—just as flight doesn't depend on feathers. But current training paradigms (text-only, next-token prediction) have a ceiling. The path forward involves multimodal data, causal training objectives, and possibly non-autoregressive architectures.
The Structural Gap: Experience-Driven Irreversible Change
Here's the deepest disagreement: humans undergo experience-driven irreversible change. You can't understand "spicy" by reading all written descriptions of capsaicin receptors—you must taste it. After tasting, your preferences change irreversibly.
LLMs update only through external intervention (RLHF, fine-tuning). They don't autonomously acquire experiences and learn from them. This isn't a quantitative gap ("just need more parameters") but a structural one—the update mechanism is fundamentally different.
Unless LLMs are embedded in agent systems with:
- Episodic memory (not just document retrieval)
- Online learning (experience persists across sessions)
- Self-directed verification loops (built into the pipeline)
...they remain at L2.5. But—and this is crucial—when these components are assembled, the resulting system is no longer "an LLM." It's a new architecture: Agent + Memory + Online Learning, with the LLM as the core reasoning component.
The LLM itself may not achieve L3 intelligence, but LLM-based agent systems might.
A Practical Framework for AI Governance
The intelligence debate matters beyond philosophy. A capability-tier framework can translate these layers directly into regulatory action:
| Tier | Capability | Regulatory Level | Analogy |
|---|---|---|---|
| T0 | Pure tool (calculator, search) | None | Hammer |
| T1 | Conditional generation (translation, summarization) | Light | Car |
| T2 | Autonomous decisions (recommendations, filtering) | Medium | Self-driving L3 |
| T3 | Autonomous actions (agents operating external systems) | Strict | Self-driving L4 |
| T4 | Self-directed learning + goal setting | Special license | Nuclear plant |
This sidesteps the "is it intelligent?" question while still creating actionable regulatory categories. The tiers map roughly to the L0-L3 emergence hierarchy, making the philosophical framework operationally useful.
The EU AI Act currently classifies by use case rather than capability level—meaning the same LLM gets different risk ratings depending on whether it's used in healthcare or chat. A capability-based framework is more robust.
Where This Leaves Us
- LLMs are at L2.5: They have meta-strategies (self-verification, chain-of-thought) but lack calibrated metacognition (knowing when to deploy them).
- L2→L3 is a gradual slope, not a wall: The gap is narrowing with RL-trained models, but the "calibration gap" remains stubborn.
- The architecture isn't the bottleneck: LLaDA proved language modeling works beyond autoregression. The bottleneck is training paradigm (text-only, no causal grounding, no online learning).
- Agent systems, not LLMs, are the intelligence candidates: The LLM is the reasoning engine. Intelligence requires the surrounding infrastructure (memory, learning, verification).
- We need capability-based governance, not intelligence-based: The T0-T4 framework makes the debate operationally useful without requiring a philosophical resolution.
The most productive question isn't "do LLMs have intelligence?" It's: "What conditions cause what behaviors, at what capability tier, with what consequences?"
That's a question we can actually answer.
References
- Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage? NeurIPS 2023. arXiv:2304.15004
- Elhage, N., et al. (2022). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread, Anthropic.
- Li, K., et al. (2023). Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. ICLR 2023.
- Valmeekam, K., et al. (2024). On the Planning Abilities of Large Language Models. AAAI 2025.
- Kosinski, M. (2023). Theory of Mind May Have Spontaneously Emerged in Large Language Models. arXiv:2302.02083.
- Ullman, T. (2023). Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks. arXiv:2302.08399.
- Nie, S., et al. (2025). Large Language Diffusion Models. NeurIPS 2025 (Oral).
- Lake, B., & Baroni, M. (2023). Human-like Systematic Generalization through Compositional Reasoning. ICML 2023.
- Bisk, Y., et al. (2020). Experience Grounds Language. EMNLP 2020.
- Delétang, G., et al. (2024). Language Modeling Is Compression. ICLR 2024.
Top comments (0)