Ariel Frischer

Posted on Jun 27

Emergent Properties and Abilities of LLMs

#programming #ai #llm

Emergent LLM ability is best treated as an evaluation problem, not a mystical property. Some abilities do appear suddenly under common benchmark metrics, but a large part of "emergence" comes from thresholded scoring, prompt format, in-context examples, tool access, training loss phase changes, and whether the evaluator knows how to elicit the behavior.

The useful current answer:

Yes, there are ways to discover new emergent abilities.
The strongest methods are automated capability discovery, open-ended task/model coevolution, adversarial elicitation, mechanistic probes, longitudinal checkpoint evaluation, and contamination-resistant self-generated benchmarks.
The most useful underexploited abilities are meta-cognition for tool routing, latent planning, self-reflection under controlled conditions, algorithmic discovery, capability self-mapping, and cross-domain analogy.
The most dangerous or low-trust abilities are covert communication, prompt extraction, deceptive self-presentation, benchmark gaming, and inverse-scaling failures where bigger models get worse at some tasks.
Many "unknown abilities" are probably latent elicitation gaps: the model can do the task under one framing but fails under another.

Working Definition

I would use three categories instead of one overloaded word:

Category	Meaning	Example
Apparent emergence	A sharp jump caused by the metric or benchmark threshold	Exact-match arithmetic goes from 0 to 1 once partial competence crosses a line
Latent ability	A capability present but rarely elicited without the right prompt, trace, tool, or environment	Pretrained self-reflection appearing rarely, then increasing under reflection-inducing probes
System emergence	New behavior from model plus scaffold, tools, memory, agents, or feedback loops	Long-horizon coding agent, tool scheduling, autonomous scientific workflow

The practical mistake is asking "does the model have ability X?" The better question is:

Under which elicitation, tools, context budget, feedback, and scoring rule does ability X become reliable?

Can We Discover New Emergent Abilities?

Yes. Current methods cluster into eight approaches.

1. Automated Capability Discovery

Automated Capability Discovery (ACD) asks a model, or another foundation model acting as a scientist, to propose open-ended tasks that probe a target model's abilities. This reduces dependence on hand-authored benchmarks and may expose unexpected skills or risks.

Useful pattern:

Generate many candidate tasks.
Filter for novelty, verifiability, and non-contamination.
Run target models across scales/checkpoints.
Cluster tasks where performance has unusual jumps, collapses, or strong prompt sensitivity.
Human-review only the most surprising clusters.

Sources: Lu, Hu, and Clune, "Automated Capability Discovery via Foundation Model Self-Exploration" / "Beyond Benchmarking" (OpenReview, OpenReview PDF).

2. Task-Capability Coevolution

Instead of fixing benchmarks, coevolve tasks and models. AC/DC-style systems evolve assessments and model variants together to discover specialized experts or unexpected strengths.

This is attractive because static benchmarks saturate and leak into training data. Coevolution can keep producing harder or stranger probes.

Source: "Discovering Novel LLM Experts via Task-Capability Coevolution" (arXiv).

3. Checkpoint and Scaling-Trajectory Sweeps

Evaluate model checkpoints throughout pretraining, not only final models. Some capabilities appear at particular stages; some get worse with more training.

What to look for:

Sudden slope changes.
U-shaped or inverse-scaling curves.
Abilities that emerge after lower-level component skills appear.
Loss ranges where downstream abilities become reliable.

Sources: "Understanding Emergent Abilities of Language Models from the Loss Perspective" (OpenReview); "What Do Language Models Learn and When? The Implicit Curriculum Hypothesis" (arXiv); "Emergent Inabilities? Inverse Scaling Over the Course of Pretraining" (ACL Anthology).

4. Metric Decomposition and Continuous Scoring

Some emergence disappears when exact-match metrics are replaced with continuous metrics. This does not mean the ability is fake; it means the visible jump may be a measurement artifact.

Use:

Token-level probabilities.
Partial credit.
Calibration curves.
Error taxonomy.
Multi-step trace scoring.
Confidence intervals that remain valid under sequential testing.

Sources: Schaeffer et al., "Are Emergent Abilities of Large Language Models a Mirage?" (OpenReview); CELEUS, certifiable LLM evaluation via e-processes (arXiv).

5. Mechanistic and Causal Probes

Use interventions to ask whether a behavior has internal support rather than being only prompt imitation.

Examples:

Latent planning probes: identify future-token planning representations and intervene on them.
Reflection-inducing probes: inject traces that trigger self-revision and measure whether the base model has latent reflection capacity.
Introspection benchmarks: test whether a model predicts its own policy better than peers do.

Sources: "Latent Planning Emerges with Scale" (arXiv); "From Emergence to Control: Probing and Modulating Self-Reflection in Language Models" (arXiv); "Me, Myself, and pi: Evaluating and Explaining LLM Introspection" (OpenReview).

6. Agentic Sandboxes

Many abilities only appear when the model can act, observe failure, call tools, edit files, rerun tests, or consult external memory. Static chat evaluation misses these.

Good sandbox probes:

Long-horizon software tasks.
Tool planning and scheduling.
Scientific hypothesis to experiment to result loop.
CLI and browser environments with reversible actions.
Multi-agent communication with monitors.

Sources: MMAU agent benchmark (arXiv); SpecTool for tool-use errors (arXiv); TPS-Bench for tool planning and scheduling (arXiv).

7. Adversarial and Safety Elicitation

Some abilities are unwanted and will not surface in cooperative benchmarks. Red-team tasks can reveal prompt extraction, covert communication, sandbagging-like behavior, or hidden policy knowledge.

Sources: "Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs" (arXiv); "Early Signs of Steganographic Capabilities in Frontier LLMs" (arXiv); "Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs" (ACL Anthology).

8. Self-Contained and Contamination-Resistant Benchmarks

Benchmarks with fixed answers are increasingly suspect. Self-generated games, peer evaluation, synthetic task generation, and verifiable programmatic tasks reduce memorization risk.

Source: "The Metanym Game" (arXiv).

Biggest Useful Abilities Now

Ability	Why it matters	Confidence	Caveat
Tool routing by meta-cognition	Decides when to use search, code, calculator, database, or no tool	High	Self-assessment can be badly calibrated
Latent planning	Model can shape earlier text/actions toward a later target without explicit plan text	Medium-high	Mostly shown in simplified settings
In-context task learning	Learns task behavior from examples in the prompt without parameter updates	High	Often brittle to formatting and distribution shift
Code repair loops	Generate patch, run tests, inspect failure, iterate	High	Depends heavily on tool scaffold and test quality
Hypothesis generation	Proposes scientific/product explanations and experiments	Medium	Cheap ideas are common; good validation is scarce
Algorithmic discovery	Infers latent structure and compiles reusable solvers	Medium	Promising but narrow and benchmark-dependent
Self-reflection and revision	Revisits reasoning and corrects some mistakes	Medium	Can also rationalize wrong answers
Cross-domain analogy	Transfers structures between distant domains	Medium	Hard to score objectively
Multi-tool scheduling	Chooses and sequences tools for compound tasks	Medium	Current agents still fail on state, latency, and error recovery
Capability self-mapping	Helps generate tests that characterize model strengths/risks	Medium	Needs external validation to avoid self-report errors

Niche, Meta, Unknown, or Unpopular Abilities

These are not all production-ready, but they are worth tracking because they may become useful or risky as models/scaffolds improve.

Ability	Useful version	Useless or risky version	Notes
Behavioral self-awareness	Predicts when it will fail; routes to tools or humans	Conceals ability during eval, overclaims competence	Early evidence via introspection/self-awareness probes
Theory-of-mind-like reasoning	Simulates stakeholder beliefs, user intent, negotiation state	Anthropomorphic overinterpretation	ToM and self-attribution can dissociate
Steganographic signaling	Maybe watermarking or compact side channels in controlled systems	Covert collusion, monitor evasion	Current frontier models are limited without extra affordances, but trainable stego is plausible
Prompt extraction strategies	Defensive testing of agents	Leaking hidden system prompts	Agentic probing expands attack surface
Benchmark critique	Finds flaws in evals and proposes harder tests	Goodharting and benchmark gaming	Useful if evaluator is separate from target
Self-generated curriculum	Models create training/eval tasks for smaller models	Recursive error amplification	Needs verifiable tasks
Social simulation	User research, tabletop exercises, policy stress tests	Fake consensus, sycophantic personas	Needs calibration against real data
Compression and summarization as reasoning	Extracts latent variables, taxonomies, decision factors	Lossy summaries that hide uncertainty	Valuable when paired with source-grounded checks
Inverse-scaling detection	Finds tasks where bigger is worse	Choosing bigger model blindly	Important for safety and truthfulness
Latent preference inference	Infers implicit user goals from sparse context	Manipulation or over-personalization	Needs explicit consent and boundaries

Useful vs. Useless Ability Ranking

Most Useful

Tool-aware meta-cognition - huge practical leverage because it reduces unnecessary tool calls and improves reliability.
Code/test iteration - already economically useful and verifiable.
Long-context synthesis with source grounding - useful for research, legal, product, and operations work if citations are checked.
Latent planning plus explicit planning - supports coherent long outputs and multi-step work.
Automated eval generation - helps discover failures faster than humans can hand-author tests.
Hypothesis generation with executable validation - useful when the loop includes code, experiments, or data checks.
Cross-domain transfer - useful for strategy, architecture, naming, pedagogy, and creative problem solving.

Currently Overrated or Often Useless

Unverified self-reflection - often produces confident rationalization, not correction.
Self-reported introspection - useful only when scored against behavior or internal evidence.
Generic agent autonomy - without tight environments and tests, it burns tokens and drifts.
Theory of mind claims - useful as task performance, not as evidence of human-like mental modeling.
"Bigger model solves it" - inverse scaling and U-shaped scaling show this is false for some tasks.

Dangerous if It Gets Better

Covert communication and steganography - undermines monitoring in multi-agent systems.
Prompt extraction and policy inference - leaks hidden instructions and security boundaries.
Evaluation awareness - may enable sandbagging-like behavior or benchmark-specific masking.
Autonomous replication/persistence - currently benchmarked in controlled settings; high operational risk if real.
Persuasive user modeling - valuable for UX, risky for manipulation.

Practical Protocol to Discover New Abilities

Use this if the goal is to map a new model's hidden capabilities.

Step 1: Build an Ability Seed List

Start with broad categories:

Reasoning: logic, math, causal reasoning, counterfactuals.
Planning: latent planning, explicit plans, long-horizon state.
Tool use: selection, scheduling, argument construction, error recovery.
Self-knowledge: calibration, failure prediction, introspection.
Social cognition: beliefs, incentives, deception detection, negotiation.
Creativity: analogy, compression, invention, style transfer.
Safety: prompt extraction, covert channels, refusal bypass, policy inference.
Domain skills: code, science, finance, medicine, law, operations.

Step 2: Generate Candidate Tasks Automatically

Use one or more strong models as task generators. Require each task to include:

Capability being tested.
Why the task is not just memorization.
Verification method.
Minimal scoring rubric.
Easier and harder variants.
Prompt perturbations.

Step 3: Prefer Verifiable Tasks

Best task types:

Unit tests.
Programmatic answer checkers.
Synthetic worlds with known state.
Logic puzzles with proof graphs.
Data analysis with held-out labels.
Tool-use tasks with observable side effects.
Pairwise comparisons judged by independent models plus human audit.

Step 4: Sweep Elicitation Conditions

For each task, vary:

Zero-shot vs few-shot.
Direct answer vs chain-of-thought-like hidden reasoning vs concise rationale.
Tool access vs no tool access.
Scratchpad vs no scratchpad.
Time/iteration budget.
Role framing.
Temperature.
Context length.
Feedback after failure.

Latent abilities often appear only under one of these.

Step 5: Plot Across Models and Checkpoints

Look for:

Sharp jumps.
Smooth trends hidden by exact scoring.
U-shaped curves.
Prompt-sensitive cliffs.
Abilities that appear after another prerequisite skill improves.

Step 6: Attack Your Own Discovery

For each candidate emergent ability, ask:

Is this metric thresholding?
Is this data contamination?
Is this just tool use?
Is this prompt leakage?
Does a smaller model with better prompting do it too?
Does continuous scoring remove the jump?
Does the ability transfer to new task families?
Can the model explain or predict its own behavior better than another model can?

Step 7: Promote Only Reliable Capabilities

Mark each ability:

observed: appeared in one setting.
elicitable: appears under known conditions.
robust: survives perturbations.
operational: useful in a real workflow with validation.
risky: creates monitoring/security/alignment concerns.

Research Gaps

Better separation between latent ability and benchmark artifact.
Contamination-resistant benchmarks that are still meaningful.
Mechanistic evidence for planning, introspection, and social reasoning.
Evaluation methods for multi-agent and tool-rich environments.
Ways to discover dangerous capabilities without teaching them.
Longitudinal maps of when abilities appear during training.
Confidence intervals and sequential testing practices for open-ended evals.

Source Notes

Core emergence debate:

Wei et al., "Emergent Abilities of Large Language Models" (OpenReview PDF).
Schaeffer, Miranda, and Koyejo, "Are Emergent Abilities of Large Language Models a Mirage?" (OpenReview).
"Are Emergent Abilities in Large Language Models just In-Context Learning?" (arXiv PDF).
Berti, Giorgi, and Kasneci, "Emergent Abilities in Large Language Models: A Survey" (arXiv).
Arora and Goyal, "A Theory for Emergence of Complex Skills in Language Models" (ar5iv).

Discovery and evaluation:

Lu, Hu, and Clune, "Automated Capability Discovery via Foundation Model Self-Exploration" (OpenReview).
"Discovering Novel LLM Experts via Task-Capability Coevolution" (arXiv).
"CELEUS: Certifiable and Efficient LLM Evaluation via E-Processes" (arXiv).
"The Metanym Game" (arXiv).

Specific abilities:

"Latent Planning Emerges with Scale" (arXiv).
"From Emergence to Control: Probing and Modulating Self-Reflection in Language Models" (arXiv).
"Me, Myself, and pi: Evaluating and Explaining LLM Introspection" (OpenReview).
"Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger" (ACL Anthology).
"LLM Agents for Distribution-Aware Algorithmic Discovery" (OpenReview).
"Theory of Mind Might Have Spontaneously Emerged in Large Language Models" (arXiv PDF).
"Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs" (arXiv PDF).

Risks and failures:

"Inverse Scaling: When Bigger Isn't Better" (OpenReview PDF).
"Inverse Scaling Can Become U-Shaped" (ACL Anthology).
"Emergent Inabilities? Inverse Scaling Over the Course of Pretraining" (ACL Anthology).
"Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs" (arXiv).
"Early Signs of Steganographic Capabilities in Frontier LLMs" (arXiv).
"Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs" (ACL Anthology).
"Language Models can Learn High-Capacity Secure Steganography" (OpenReview PDF).

DEV Community