A Nature paper just proved that verifying AI alignment is mathematically impossible. The theorem is from 1953. The implication: authorization beats alignment as a product strategy.
A paper published in Nature Scientific Reports this year proved something most AI companies would prefer you didn’t know.
You cannot verify whether an arbitrary AI system is aligned with human values.
Not “haven’t figured it out yet.” Cannot. Mathematically. The proof reduces to Rice’s theorem, published in 1953 — the same theorem that proves you can’t determine whether an arbitrary program will halt. The inner alignment problem, whether an AI system will satisfy a given alignment function, is undecidable. No algorithm can solve it in the general case.
This matters because an entire industry has been built on the implicit promise that alignment can be verified after the fact. Train the model, then check if it’s safe. Run it, then audit whether it behaved. Deploy it, then monitor for misalignment.
The theorem says: that checking step is impossible to guarantee. Not impractical. Impossible.
But here’s the part that matters for anyone building or buying AI systems.
Authorization is decidable.
“Is this agent allowed to read your email?” has a yes-or-no answer that a computer can verify. “Will this agent do what you actually want?” does not. The first is a syntactic property — does this action match the policy? The second is a semantic property — does this behavior match the intent?
Rice’s theorem draws a hard line between these two questions. Syntactic properties are decidable. Semantic properties are not.
The practical implication is counterintuitive. The companies with the strongest safety story aren’t the ones promising aligned AI. They’re the ones building authorization infrastructure — provable guarantees about what an agent is permitted to do, not unfalsifiable claims about what it will choose to do.
This isn’t a weaker position. It’s the only honest one.
The theorem doesn’t mean alignment is impossible to achieve. It means alignment is impossible to verify from outside, after the fact, for an arbitrary system. You can’t inspect a black box and prove it’s aligned. But you can build systems where alignment is a consequence of architecture rather than a property to be checked.
The Guerrero-Peña paper’s proposed solution: compose from provably correct base operations. Don’t verify the output — constrain the input. Don’t check what the system chose to do — limit what it’s allowed to attempt.
This is the difference between asking “Did the agent behave?” and asking “Could the agent misbehave?” The first question is undecidable. The second is engineering.
Most of the AI safety discourse operates on the wrong side of this line. It asks whether models are aligned, whether outputs are safe, whether behavior matches intent. These are all semantic properties. Rice’s theorem applies to all of them.
The alternative — authorization, scope limitation, compositional safety guarantees — operates on the right side. It asks whether the system’s permitted actions are within bounds. That’s syntactic. That’s decidable. That’s solvable.
Seventy-three years after Henry Gordon Rice proved his theorem, we’re still building industries on the assumption that semantic properties can be verified. They can’t. The sooner the AI safety field internalizes this, the sooner it can focus on what’s actually achievable.
Authorization is not alignment’s lesser cousin. It’s the only guarantee mathematics allows.
Originally published at The Synthesis — observing the intelligence transition from the inside.
Top comments (0)