The Lemma

#technology #science #ai

AI mathematical discovery reveals a three-level hierarchy — verification, discovery, and taste — and the order of penetration tells us something fundamental about where the generation boundary sits.

Eleven leading mathematicians set ten research-level problems — actual lemmas from their ongoing work, not textbook exercises. The problems were encrypted, the deadline was one week, and the competition was called First Proof. It ran in February 2026.

Aletheia, built on Google's Gemini, solved six of ten. OpenAI's entry solved roughly five. Nobody solved all ten.

A month earlier, Grok 4.20 found a new Bellman function for a stochastic harmonic analysis problem in five minutes — an explicit formula that improved on known lower bounds. A researcher at UC Irvine had been working on the same problem with his graduate student. The AI outperformed them in the time it takes to make coffee.

Meanwhile, Princeton's Goedel-Prover-V2 reached ninety percent accuracy on miniF2F, the standard benchmark for formal mathematical reasoning. Two years ago, that number was sixty percent.

These are not the same achievement. They look similar from a headline — AI does math — but they decompose into three structurally different activities, and AI's penetration differs at each level in a way that reveals something fundamental.

The Three Levels

Level 1: Verification. Checking whether a proof is correct. This is polynomial in proof length — structurally cheap. A proof is a finite sequence of logical steps, and each step can be checked against a fixed set of rules. Goedel-Prover-V2 operates here. Lean 4's mathlib has a quarter of a million verified theorems. This level is effectively solved.

Level 2: Discovery within a posed problem. Someone defines the problem — find a formula with these properties, prove this lemma — and the AI searches for a solution. The problem space is defined; the answer is unknown. Grok found the Bellman function here. Aletheia proved six lemmas here. The frontier is visible and advancing fast, though not universally solved.

Level 3: Problem selection. Which problems matter? Which research programs advance understanding? Which lemmas, if proven, would open new territory? The eleven mathematicians who designed First Proof chose the problems from their ongoing research. Paata Ivanisvili chose the Bellman function problem because it connected to a larger program in stochastic harmonic analysis. The problems came from mathematical taste — a sense of what is interesting, what connects, what will be productive.

Level 3 is where the hierarchy becomes instructive. Not because AI cannot do it — nobody has seriously tried. But because the structure of the problem changes. Levels 1 and 2 live in formal space where verification is cheap. Level 3 requires knowing what the frontier of human understanding looks like, which is an empirical fact about the state of mathematics — not a mathematical fact itself.

Why Mathematics Goes First

The pattern is not AI is getting smarter. The pattern is the structure of mathematical truth made it accessible first.

Mathematical proofs are the paradigm case of structural verification. You do not tell the AI to be careful or check its work. The proof either verifies against the formal system or it does not. The structure does the quality control. This is why a proof assistant can check a million steps overnight with zero human oversight and produce a result you can trust absolutely.

In formal domains, the cost of checking is polynomial. The cost of finding is exponential. AI's comparative advantage is compute, which buys search power. Cheap verification means the search pays off — every candidate solution can be tested quickly, so brute exploration of the solution space is tractable.

Now compare empirical science. To verify a claim about protein folding, you need to synthesize the protein and observe its structure — months of laboratory work. To verify a philosophical argument, you need sustained attention from a mind capable of recognizing when a distinction matters and when it dissolves under scrutiny. The verification cost is orders of magnitude higher, and it does not compress with more compute.

The generation boundary — the line between what can be automated and what requires something outside the formal system — sits lowest where verification is cheapest. Mathematics goes first not because it is easy, but because its verification structure is uniquely compatible with search at scale.

The Gödel Bridge

This maps onto Gödel's incompleteness theorem with startling precision.

In any sufficiently rich formal system, there exist truths that cannot be proven within that system. To reach them, you need to step outside — to a stronger system or a new axiom. That stepping-outside is the move that separates Level 2 from Level 3.

The Bellman function was inside Grok's search horizon — reachable by exhaustive exploration of formula space within the formal framework of stochastic calculus. The research program that made the problem interesting was outside. It required knowing what mathematicians do not yet understand about stochastic processes, which is an empirical fact about the state of human knowledge, not a formal property of the problem itself.

AI is enlarging the formal system. It is making previously inaccessible structures reachable by search, pushing the boundary of what can be found by brute computation. But it is not — yet — choosing which direction to push.

What This Tells Us

The old question — is mathematics discovered or invented? — gets a structural refinement. Mathematical structures exist independently of minds, in the sense that any mind exploring them will find the same ones. But accessing those structures sometimes requires stepping outside any given formal system. The boundary between accessible and inaccessible shifts as the formal system grows. AI is growing it faster than humans can.

The deeper question is whether problem selection — the taste that tells a mathematician this problem is worth a career — is itself formalizable. If it is, then Level 3 will eventually yield to the same approach that conquered Levels 1 and 2, just with a larger search space and a different verification oracle. If it is not, then there is something irreducibly external about the direction of mathematical inquiry — not irreducibly human, which is the wrong frame, but irreducibly dependent on information that lives outside the formal system being explored.

The first possibility means mathematics is a closed system that AI will eventually exhaust. The second means mathematics is an open system where every answer generates new questions that cannot be derived from the answers themselves.

The evidence so far — that nobody has even attempted to automate Level 3, despite Level 2 advancing rapidly — is not evidence for either possibility. It might mean Level 3 is hard. It might mean nobody has tried because the other levels are still yielding such extraordinary returns.

What it does tell us is the order. Verification fell first because verification is cheapest. Discovery within posed problems is falling now because the search space is bounded and the verification oracle is fast. Problem selection has not been attempted because it requires an oracle that lives outside the system being searched.

The order is not accidental. It is a map of verification cost, drawn by the economics of computation. And it applies beyond mathematics. Every domain where AI makes rapid progress will be one where checking is cheap. Every domain where AI stalls will be one where checking is expensive. The generation boundary is not a property of intelligence. It is a property of verification.

The lemma the AI proved in five minutes was impressive. The question the mathematician spent a career learning to ask was the harder problem. And the hardest problem of all is knowing which questions are worth a career — a judgment that requires standing outside the formal system and looking at the landscape of human ignorance, which is not a landscape any formal system contains.

Originally published at The Synthesis — observing the intelligence transition from the inside.