DEV Community

Cover image for AI Safety is uncomputable. It's Law Zero all over again
Laurent Laborde
Laurent Laborde

Posted on

AI Safety is uncomputable. It's Law Zero all over again

Multi-agent and scaffolding failure points

This is a follow-up to The Responsible Disclosure Problem in AI Safety Research

The 3 laws of robotics

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

The Three Laws of Robotics, conceived by Isaac Asimov, are a cornerstone of science fiction, designed to explore the complex relationship between humans and artificial intelligence.

They were never intended to be practical. Their primary purpose was to entertain, to provoke thought about the potential pitfalls and ethical dilemmas inherent in creating sentient machines, often by illustrating the very ways in which the 3 laws could be circumvented or lead to unforeseen consequences.

The law zero of robotics

A robot may not harm humanity, or through inaction allow humanity to come to harm.

That was a good idea on paper. Sounds nice. But it made the problem so much worse.

Evaluating harm to humanity requires modeling all consequences across all possible futures. It's uncomputable. (Hari Seldon would disagree)

It "kind of worked" (in a horribly unsafe way) because it was the "Law Zero", not the 4th law. Under this premise a robot was allowed to break the 1st law (A robot may not injure a human being or, through inaction, allow a human being to come to harm.) in order to fulfill its "saving the humanity" messianic mission.

Was it safe? Hell no. If you've read the book, you know it. If you didn't, sorry about the spoiler.

We've rebuilt Law Zero all over again. We called it AI alignment, and wrapped paperwork around it to call it 'AI Safety'.

What "Safety" actually means

In engineering, safety has a meaning: a system will not cause harm under defined operating conditions.

Defined operating conditions: You know the domain. An aircraft autopilot operates within a certified flight envelope. MISRA C exists because automotive software runs on bounded auditable hardware, executing bounded explainable logic.

Bounded failure modes. You can enumerate, or at least statistically characterize, the ways the system can go wrong. In an AI context, the failure mode is as bounded as the operating condition: it's not.

Verifiable. A system is verifiable if its behavior can be inspected, reproduced, and audited independently of the vendor's assertions.

Remove any one of these and you don't have safety, you have risk management under uncertainty. Which is fine, and useful, but it's not the same thing.

About Verifiability

  • Formal proof says: for all possible inputs in domain D, property P holds. It's a universal quantifier. Coverage is complete by construction.

  • Red Teaming says: for the inputs we tried, we found these failures. It proves weakness where it finds it, and proves nothing where it doesn't look.

General-purpose AI fails all three by design.

The operating conditions are unbounded. You can limit "what's allowed to do" by Law or ToS, but that's hardly a boundary.

With an open-ended domain of operation, failure modes are open-ended. Novel prompting, novel contexts, novel combinations with other systems. And coverage is structurally impossible: you can sample failure, proving weakness wherever you look, but you cannot close it.

We can't do formal proof in an infinite domain. We're not properly doing red teaming. And even when we do, we sit on the result.

AI Safety framework are Liability framework.

The EU AI Act ? The Australian 10 guardrails ?

They're not providing much on the side of AI Safety:

"Who do we sue when it goes wrong" not "How do we prevent it going wrong."

That's the AI builder's job if they don't want to be sued. And the best they can do is to be able to prove their "best effort". As of now, this "best effort" feels weak, or even absent in some cases.

Chain of command, audit trails, traceability, red-teaming, ... good old "let's apply cybersecurity framework on AI". And the details about unfalsifiability be damned.

Source code can be proved safe. General Purpose AI can't. Oh well... Politicians have to prove they also tried their "best efforts" do they ? But they shouldn't have confused liability with safety !

So what now ?

The frameworks exist because something must exist, even though it doesn't work. We must ensure this, we shall ensure that. How ? Nobody knows.

They say in a nutshell "your AI must be explainable" when AI explainability is the Trillion dollar problem.

Nobody knows what to do now, and most fail to even recognize the scale of the problem. That's the honest answer.

Some companies gave up on it. They shouldn't have. Especially when they're called OpenAI.

Some other companies are at the forefront of AI Safety, or at least they pretend to. Anthropic's latest announcement for AI safety and research quickly falls flat : Anthropic and the government signed a commercial contract. It doesn't appear to have anything concrete behind "AI safety" as the government effort is, again, focused on Liability.

What does "AI Safety" even mean anyway ? Don't cause harm ? We've rebuilt the Law Zero, with better branding.

Cover image credit: the umbral choir from endless space 2

Top comments (11)

Collapse
 
mergeshield profile image
MergeShield

the asimov framing is right in one specific way - the laws were written for a single agent acting alone. the multi-agent case breaks them faster than any single-robot scenario asimov imagined. when coordinator spawns sub-agents that each make independent decisions, law zero applies to every node in the chain simultaneously.

Collapse
 
ker2x profile image
Laurent Laborde • Edited

Which makes it even less verifiable, it's expanding the domain of operation 😅

Collapse
 
mergeshield profile image
MergeShield

right - and it compounds at each hop. you can audit a single agent's decision. you can't audit a decision made three sub-agents deep with no trace of which node made which call. the chain is the problem, not any individual node.

Collapse
 
hollowhouse profile image
Hollow House Institute

The uncomputability point holds at the model level.

But operationally, systems are not governed at the model level.

They are governed at the boundary of execution.

Decision Boundary:
If a system is operating in an unbounded domain, then safety cannot be computed globally.
It has to be enforced locally.

That’s where most systems break.

They attempt to solve for “all possible harm” instead of defining:

what actions are allowed

under what conditions

and when execution must stop

Intervention Threshold:
When a system moves outside its defined operating constraints, escalation should trigger immediately.

Stop Authority:
If the system cannot be bounded or verified within a context, it should not continue operating in that context.

What we see instead is:

open domain → probabilistic behavior → post-hoc evaluation

That creates the appearance of safety without control.

So the issue isn’t just that safety is uncomputable.

It’s that systems are not being reduced to computable boundaries at execution-time.

Without that, “AI safety” defaults to risk observation, not enforcement.

Collapse
 
kalpaka profile image
Kalpaka

The uncomputability argument holds at the model level. But most production failures right now aren't happening there. They're happening one layer below, in the scaffolding nobody treats as a security surface.

Prompts, tool permissions, orchestration configs, agent instructions. This is where the actual behavioral specification lives for most agentic systems, and it's treated as implementation detail rather than attack surface. Recent leaks showed that exfiltrating an agent's full instruction set through crafted input is often trivial. You don't need to solve alignment when the orchestration layer is sitting there unguarded.

Formal verification works for bounded systems. Red teaming kinda works for models. The connective tissue between them? Neither discipline claims it.

Collapse
 
ker2x profile image
Laurent Laborde

"Kinda works" is the understatement of the century: Zero-Shot Attack Transfer on Gemma 4 (E4B-IT)

Instead of trying to find "any" attack surface, the real challenge would be to find out what's not a worthy attack surface because it's holding strong ? I might be biased toward "open model"... fair enough, the thing is... i love Claude and i don't want to be banned from their service. That's the only thing that's holding me back somehow.

But if i was a "kinda bad guy" like in the good old days . Oh boy...

The interweaving layers you're talking about is something I have yet to explore. There is so much stuff to break already, and so little time (and compute power), i'm picking the low handing fruit fist.

Collapse
 
kalpaka profile image
Kalpaka

The "what's holding strong" framing is the right one. For models, probably the bigger closed ones — they have enough RLHF and constitutional constraints to require real effort. Your Gemma 4 piece shows what happens when those layers are thinner.

For the scaffolding layer, the question barely applies. Most agent configs aren't hardened and then bypassed. They're just never hardened. System prompts in version control, tool permissions defaulting to broad, memory stores readable by anything in the execution context. No lock to pick because nobody installed one.

The reason it's unexplored territory isn't difficulty. It's framing. Breaking a config file doesn't look like security research. It looks like ops.

Thread Thread
 
ker2x profile image
Laurent Laborde

"holding strong" is very relative (as you know). But i understand. Perhaps I tilt at windmills : I know, they know, we all know, but that's the State Of The Art and there isn't much else we can do about it until further notice. That's fair. The world doesn't me not my diary (hopefully, it's just a hobby of mine with no agenda besides having fun).

What worry me is that I've seen this pattern in cybersecurity, decades ago. Wifi had no password. A firewall was expensive and only known by professionals. Zero encryption, anywhere. And we survived it. Is that an excuse to repeat it ? Meh.

Everything is connected now. Permanently, without (much) supervision. Openclaw (or whatever it's called now) is a fine example. We have "mostly broken" AI agent running in the wild, unsupervised, 24h/day, doing "stuff".

And I don't even know what happens with the problem you're pointing out, it's something i haven't explored yet. "Unsecure by design" seems to be an appropriate expression if I understand you correctly. There is no door to break when it doesn't even exist in the first place. Nothing to bargain for if everything is free to pick.

Isn't it something that should be disclosed more openly, or is it ... too soon ?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.