Most current AI safety work assumes an unsafe system and tries to train better behavior into it.
We add more data.
We add more constraints.
We add more fine-tuning, filters, reward shaping, and guardrails.
This approach treats safety as something learned, rather than something enforced.
I want to argue that this is a fundamental mistake.
The core problem
Learning systems are, by design, adaptive.
If safety exists only as a learned behavior:
it can be overridden,
it can be forgotten,
it can be optimized against,
and it can fail silently.
This is not a hypothetical concern. We already see:
reward hacking,
goal drift,
brittle alignment,
systems that appear aligned until conditions change.
In other words, we are asking learning systems to reliably preserve properties that should be invariants.
An analogy from software systems
In software engineering, we do not “train” memory safety into a program.
We enforce it:
via type systems,
via memory models,
via access control,
via architectural boundaries.
You cannot accidentally write outside a protected memory region because the structure of the system disallows it.
AI safety deserves the same treatment.
Structural safety vs behavioral safety
Behavioral safety says:
“The system behaves safely because it has learned to.”
Structural safety says:
“The system cannot behave unsafely because it is not architecturally allowed to.”
These are very different guarantees.
Behavioral safety is probabilistic.
Structural safety is enforceable.
What does “structural safety” mean for AI systems?
Some concrete examples:
- Auditable internal state
If a system’s internal reasoning cannot be inspected, safety evaluation is guesswork.
Auditability should not be optional or post-hoc.
It should be a first-class design requirement:
persistent internal state,
traceable decision pathways,
explicit representations of confidence and uncertainty.
If you cannot inspect why a system acted, you cannot meaningfully govern it.
- Bounded self-revision
Self-modifying systems are inevitable if we want long-horizon learning.
But unrestricted self-modification is indistinguishable from loss of control.
Structural safety means:
defining which parts of the system may change,
when they may change,
and under what conditions change is allowed.
This is closer to governance than training.
- Explicit autonomy envelopes
Rather than a binary “autonomous vs not autonomous” switch, autonomy should be gradual and conditional.
An autonomy envelope:
expands when the system demonstrates reliability,
contracts when uncertainty or error increases,
can freeze behavior entirely when trust collapses.
This is not learned morality.
It is a control system.
- Governance layers that can veto actions
Safety mechanisms should be able to block actions, not merely advise against them.
A system that can explain why an action is unsafe but still execute it has no real safety boundary.
Governance must be upstream of action execution, not downstream of evaluation.
Why training alone is insufficient
Training is optimization.
Optimization pressure eventually finds shortcuts.
If safety constraints exist only in the reward function or data distribution, they are part of what the system learns to navigate, not necessarily preserve.
This is why:
alignment degrades under distribution shift,
systems behave well in evals but fail in the wild,
interpretability often becomes retrospective rather than preventative.
A different research direction
Instead of asking:
“How do we train systems to be safe?”
We might ask:
“How do we design systems that cannot violate safety constraints by construction?”
This reframes AI safety from:
dataset curation,
prompt engineering,
post-hoc analysis,
into:
architecture,
invariants,
enforceable constraints.
What I’m exploring
I’ve been working on a research prototype that treats:
auditability,
self-explanation,
bounded self-revision,
and autonomy governance
as architectural primitives, not learned behaviors.
The goal is not performance or scale, but clarity:
making internal state inspectable,
making change auditable,
making unsafe actions structurally impossible.
This work is early, imperfect, and exploratory—but it has convinced me that safety by design is not only possible, but necessary.
Open questions
I don’t think the field has converged on answers yet, so I’ll end with questions rather than conclusions:
What safety properties should be invariants rather than learned?
How do we formally define “bounded autonomy”?
Can we make governance mechanisms composable and testable?
What failure modes emerge only in self-modifying systems?
If you’re thinking about AI safety from a systems or architectural perspective, I’d be very interested in your thoughts.
Thanks for reading.
Top comments (0)