This is a follow-up to The Responsible Disclosure Problem in AI Safety Research
The 3 laws of robotics
A robot may not injure a human ...
For further actions, you may consider blocking this person and/or reporting abuse
the asimov framing is right in one specific way - the laws were written for a single agent acting alone. the multi-agent case breaks them faster than any single-robot scenario asimov imagined. when coordinator spawns sub-agents that each make independent decisions, law zero applies to every node in the chain simultaneously.
Which makes it even less verifiable, it's expanding the domain of operation 😅
right - and it compounds at each hop. you can audit a single agent's decision. you can't audit a decision made three sub-agents deep with no trace of which node made which call. the chain is the problem, not any individual node.
The uncomputability point holds at the model level.
But operationally, systems are not governed at the model level.
They are governed at the boundary of execution.
Decision Boundary:
If a system is operating in an unbounded domain, then safety cannot be computed globally.
It has to be enforced locally.
That’s where most systems break.
They attempt to solve for “all possible harm” instead of defining:
what actions are allowed
under what conditions
and when execution must stop
Intervention Threshold:
When a system moves outside its defined operating constraints, escalation should trigger immediately.
Stop Authority:
If the system cannot be bounded or verified within a context, it should not continue operating in that context.
What we see instead is:
open domain → probabilistic behavior → post-hoc evaluation
That creates the appearance of safety without control.
So the issue isn’t just that safety is uncomputable.
It’s that systems are not being reduced to computable boundaries at execution-time.
Without that, “AI safety” defaults to risk observation, not enforcement.
The uncomputability argument holds at the model level. But most production failures right now aren't happening there. They're happening one layer below, in the scaffolding nobody treats as a security surface.
Prompts, tool permissions, orchestration configs, agent instructions. This is where the actual behavioral specification lives for most agentic systems, and it's treated as implementation detail rather than attack surface. Recent leaks showed that exfiltrating an agent's full instruction set through crafted input is often trivial. You don't need to solve alignment when the orchestration layer is sitting there unguarded.
Formal verification works for bounded systems. Red teaming kinda works for models. The connective tissue between them? Neither discipline claims it.
"Kinda works" is the understatement of the century: Zero-Shot Attack Transfer on Gemma 4 (E4B-IT)
Instead of trying to find "any" attack surface, the real challenge would be to find out what's not a worthy attack surface because it's holding strong ? I might be biased toward "open model"... fair enough, the thing is... i love Claude and i don't want to be banned from their service. That's the only thing that's holding me back somehow.
But if i was a "kinda bad guy" like in the good old days . Oh boy...
The interweaving layers you're talking about is something I have yet to explore. There is so much stuff to break already, and so little time (and compute power), i'm picking the low handing fruit fist.
The "what's holding strong" framing is the right one. For models, probably the bigger closed ones — they have enough RLHF and constitutional constraints to require real effort. Your Gemma 4 piece shows what happens when those layers are thinner.
For the scaffolding layer, the question barely applies. Most agent configs aren't hardened and then bypassed. They're just never hardened. System prompts in version control, tool permissions defaulting to broad, memory stores readable by anything in the execution context. No lock to pick because nobody installed one.
The reason it's unexplored territory isn't difficulty. It's framing. Breaking a config file doesn't look like security research. It looks like ops.
"holding strong" is very relative (as you know). But i understand. Perhaps I tilt at windmills : I know, they know, we all know, but that's the State Of The Art and there isn't much else we can do about it until further notice. That's fair. The world doesn't me not my diary (hopefully, it's just a hobby of mine with no agenda besides having fun).
What worry me is that I've seen this pattern in cybersecurity, decades ago. Wifi had no password. A firewall was expensive and only known by professionals. Zero encryption, anywhere. And we survived it. Is that an excuse to repeat it ? Meh.
Everything is connected now. Permanently, without (much) supervision. Openclaw (or whatever it's called now) is a fine example. We have "mostly broken" AI agent running in the wild, unsupervised, 24h/day, doing "stuff".
And I don't even know what happens with the problem you're pointing out, it's something i haven't explored yet. "Unsecure by design" seems to be an appropriate expression if I understand you correctly. There is no door to break when it doesn't even exist in the first place. Nothing to bargain for if everything is free to pick.
Isn't it something that should be disclosed more openly, or is it ... too soon ?