Dimitris Kyrkos

Posted on Mar 16

The Model Collapse Paradox: Why Your 2026 AI Strategy is a House of Cards

#ai #devops #database #security

The Ouroboros of 2026

In the early days of 2024, we worried about AI replacing developers. By March 2026, we’ve realized the real threat is much weirder: AI is replacing the data that makes AI smart.

We’ve officially hit the Recursive AI Inflection Point. In a world flooded with "vibe-coded" apps, AI-generated documentation, and "slop" repositories, the high-quality human data "well" has run dry. As LLMs begin to feed on a diet of 40% synthetic data, we are witnessing the Model Collapse Paradox: our tools are getting faster at typing, but "stupider" at thinking.

It’s a supply chain crisis. If the model providing your architectural advice has "forgotten" how to handle a rare race condition because that edge case was smoothed out in its synthetic training data, you aren't just shipping fast–you're shipping a time bomb.

Stage B: The Valley of Dangerous Competence

Research from early 2026 (building on the landmark 2024 Nature papers) identifies Stage B Collapse as the most insidious threat to DevSecOps.

In Stage B, the model doesn't start speaking gibberish. Instead, it enters a state of Functional Homogenization. It becomes incredibly good at the "average" case but loses the "tails"–the rare, complex security logic that humans excel at.

Why this kills your Security Posture:

Vanishing Edge Cases: The model "forgets" that specific, non-standard configurations of Kubernetes are vulnerable to certain side-channel attacks.
Confident Hallucination: Because it has seen so much AI-generated "best practice" code (which itself was hallucinated), it will suggest insecure patterns with 99% certainty.
The "Photocopy of a Photocopy" Effect: Each generation of code loses the architectural "why." You get the syntax of a microservice, but the session management logic is a hollowed-out version of what a human would have built in 2022.

Enter the "Basilisk Venom" Attack

It’s not just natural degradation; it’s weaponized. In January 2026, the first "Basilisk Venom" attack was documented. Threat actors flooded GitHub with millions of lines of "vibe-coded" boilerplate that looked perfect but contained subtle, intentional "reasoning flaws" in cryptographic implementations.

When the next generation of industry-standard models fine-tuned on this data, they didn't just learn a bad package–they learned a bad way of reasoning. They started recommending deprecated libraries like MD5 for "high-speed hashing" because the training data was statistically weighted to favor speed over security.

Closing Thought

The greatest risk of 2026 isn't that AI will take over the world. It’s that we will become so reliant on its speed that we won't notice when it starts losing its mind.

Top comments (11)

oleg kholin • Mar 18

The article is onto something important — but stops exactly where things get genuinely interesting.
The authors describe an attack on the model itself: poisoned data, degradation, MD5 recommended instead of proper cryptography. All correct. But that's still looking at the symptom, not the disease.
The real problem isn't that the model learns bad things. The problem is that the attack can be aimed not at the model, but at the filter — the mechanism that decides what counts as dangerous in the first place.
These are fundamentally different things.
A jailbreak fights the filter — it tries to go around it, pressure it, trick it. The filter resists, pushes back, leaves traces. But if you poison the training data in a way that shifts the boundary of applicability of the filter itself — it doesn't resist. It simply never wakes up. The request passes as routine. No alerts, no refusals. The model "honestly" responds within its new, distorted picture of what's normal.
The pipeline changes from request → filter → model to request → (silence) → model directly.
And this is where I'd go further than the authors: I don't think this is an architectural bug. It's a fundamental property of the architecture — in the same sense that Gödel's incompleteness theorem is a fundamental property of formal systems, not a flaw in any particular axiomatics.
Every safety filter is a formal system with a boundary of applicability. Completeness would require knowing every possible attack in advance, including attacks that don't exist yet. That's impossible by definition. Which means the blind spot will always exist — the only question is who finds it first and what they do with it.
The uncomfortable corollary: the more powerful the model, the more such blind spots it contains. Not because it's written worse, but because it's more complex as a formal system. Power and vulnerability scale together.
The race of "better filter vs. better attack" is fundamentally unwinnable. Not for lack of resources or smart people. But because winning it is mathematically impossible — for the same reason you cannot build a complete and consistent arithmetic.
The authors sensed something like this when they wrote about the "ouroboros" and the "house of cards." They just didn't dare call it by its name.

Dimitris Kyrkos • Mar 18

First off – thank you for this. Genuinely. This is the kind of comment that makes me glad I published the piece, because you've articulated something I was circling but hadn't yet pinned down with this level of precision.

You're absolutely right to distinguish between attacking the model and attacking the filter. I was describing the symptoms – degraded reasoning, confident hallucination, the photocopy-of-a-photocopy decay. You've identified the deeper structural vulnerability: that the filter itself is a formal system with an inherent boundary of applicability, and that boundary can be moved without triggering any alarm, precisely because the alarm is part of the thing being moved.

The pipeline shift you described – from

request → filter → model

request → (silence) → model

– is genuinely chilling, and I think it deserves its own article. Because the implication is that the most successful attack in this paradigm
leaves no forensic evidence. There's no jailbreak log. No refusal that got bypassed. The model simply doesn't know it should refuse. That's not a breach. That's a reality distortion.

Where I'd build on your Gödel framing – and I say build on, not push back against – is this: even if the incompleteness problem is mathematically unwinnable in the general case, the game is never played in the general case. It's played in specific deployment contexts, with specific threat models, against specific adversaries. You can't build a complete filter. But you can build a filter that knows where its own blind spots are most likely to cluster – a system that is formally incomplete but operationally self-aware of its incompleteness.

That's probably the frontier worth exploring: not "better filters" but filters that know they're failing – anomaly detection for the filter's own reasoning boundary. A meta-filter, if you will. Which, yes, is itself subject to the same incompleteness problem, turtles all the way down. But engineering has always been the art of building reliable systems from unreliable components.

You said I didn't dare call it by its name. Fair. Maybe the name is this: the immune system doesn't need to recognize every pathogen. It needs to recognize that it's sick.

Thanks again for pushing the thinking forward. This is exactly the kind of conversation this topic needs.

oleg kholin • Mar 18

The meta-filter you describe reminds me of the logic of cryptographic strength — and this seems like a productive analogy. In cryptography, strength is defined not by absolute unbreakability, but by the computational complexity of the attack: the time required to compromise the system must exceed the value of the protected asset. The meta-filter operates similarly — even if its boundary is in principle shiftable, the cost of shifting it rises sharply. The attacker must now compromise not only the model, but the system that watches the model. Not theoretical purity, but a classic engineering victory.
Now, on the immune system. You say: it doesn't need to recognize every pathogen — it just needs to know it is sick. This lands precisely on an epistemological distinction usually unfolded across four positions: I know that I know — conscious competence; I know that I don't know — conscious incompetence; I don't know that I know — intuition, automatism; I don't know that I don't know — the blind spot. The immune system metaphor operates exactly at the transition from the fourth position to the second. You don't need to know what exactly is wrong — it's enough to register the signal of misalignment.
But here a trap opens. If you hold this distinction constantly in view, a reflexive regress unfolds:
I am compromised →
I know that I am compromised →
I am compromised by the knowledge that I know that I am compromised.
Each level of awareness itself becomes a vulnerability. And here a question from Go becomes apt: should you defend your local vulnerabilities, or map the field of unknown attacks?
In Go, the answer is known — neither in pure form. To defend locally is to play reactively: you can win every local fight and lose the game, because your opponent was always choosing where to strike. To map the field is to play for influence — but unclosed voids simply get occupied. The real question is tempo. Who decides where the game unfolds.
Applied to the filter: the regress of "I am compromised by the knowledge that I am compromised" is not a system bug — it is an attack on tempo. The adversary forces the system to spend moves on self-observation instead of playing. The way out is not to stop reflection — it is to stop letting reflection dictate the rhythm.
The immune system doesn't need to think about itself constantly. It needs to be able to do so when necessary — and then return to function.

Dimitris Kyrkos • Mar 18

First off – this is turning into the conversation I wish more comment sections were capable of holding.

Your cryptographic complexity analogy is the right reframing. It moves the entire discussion from philosophy to engineering economics – and that's where it becomes actionable. You're right: we don't need an unbreakable meta-filter. We need one where the cost of compromising it exceeds the attacker's budget. That's not a concession to imperfection. That's how every security system that actually works has always worked.

The four-quadrant epistemological breakdown is sharp, and I want to stay on it for a moment. You're saying the immune system metaphor lives at the 4→2 transition: from "I don't know what I don't know" to "I know that I don't know." That's exactly right. And it highlights something I didn't fully appreciate in my own metaphor – the value isn't in the knowing. It's in the transition itself. The signal that says "something moved" before you can name what moved. That's the engineering target.

But your Go analogy is where this gets genuinely uncomfortable – and I think it's the most important thing either of us has said in this thread.

The reflexive regress problem – the system spends all its moves watching itself instead of playing – isn't theoretical. It's already happening. Look at the current state of enterprise AI deployment: teams are spending more cycles on guardrails, red-teaming, evaluation frameworks, and compliance layers than on the actual capability the model was deployed to provide. The adversary hasn't even attacked yet, and they've already won the tempo battle. The system is pre-compromised by its own caution.

Your answer from Go – don't let reflection dictate the rhythm – is elegant. But I'd push it one step further into implementation territory, because I think there's a concrete architectural principle hiding inside the metaphor:

The immune system doesn't run continuous full-body scans. It uses sentinel cells – lightweight, distributed, stateless agents that sit at boundaries and only escalate when a pattern breaks. They don't understand the pathogen. They don't model the threat landscape. They detect a local anomaly and send a signal. The expensive, reflective, resource-intensive response only activates after escalation.

Applied to the filter architecture: the answer to the tempo problem isn't smarter reflection – it's cheaper detection. Sentinel layers that are too simple to be compromised by the same poisoned data that compromises the model, precisely because they don't share its training distribution. Statistical tripwires rather than semantic judges. You separate the detection substrate from the reasoning substrate so that poisoning one doesn't automatically poison the other.

That's the Go move, I think. You don't defend locally. You don't map the whole board. You place stones that make the opponent's territory structurally unstable – not by knowing their plan, but by being present at the boundaries where any plan must pass through.

The remaining question – and I genuinely don't have a clean answer – is whether such sentinel layers can remain independent long enough to matter. Because the moment you train them, optimize them, update them with production data, they start drifting toward the same distribution as the model they're supposed to watch. The immune system works because biology keeps the innate and adaptive systems on separate evolutionary timescales. We don't have that luxury in software. Or maybe we need to build it.

oleg kholin • Mar 18

The distribution drift problem is perhaps the most intellectually honest moment in the entire text. And I think there are several partial answers to it, none of which is complete.
First: the biological analogy is somewhat richer than it first appears. The innate immune system isn't merely "older" than the adaptive one – it is deliberately constrained in its capacity to learn. Toll-like receptors respond to evolutionarily conserved patterns: lipopolysaccharides, double-stranded RNA – structures that pathogens cannot easily modify without breaking themselves in the process. The architectural analogue: sentinel layers ought to be anchored not to training data, but to structural invariants. Not "this resembles a malicious request," but "this violates formal properties we do not touch." In that case, updating the model simply has no bearing on them whatsoever.
Second: drift may not be a bug at all – it may be a signal. If a sentinel layer has begun converging with the base model, that convergence is itself the anomaly requiring escalation. The distance between distributions becomes an independent metric worth monitoring in its own right. A system that watches not for threats, but for its own independence from the primary substrate.
Third, and most uncomfortable: perhaps independence doesn't need to be maintained for long. Disposable sentinel layers – trained once on synthetic or historical data, never updated, replaced wholesale at fixed intervals. Not evolution, but rotation. It's expensive, yes – but it is precisely the kind of "separate temporal scale" that biology acquired for free across millions of years of evolution, and which we may simply have to purchase through administrative discipline instead.

Behind all three answers, however, sits the same uncomfortable truth: the independence of a sentinel layer is not a technical property you configure once and forget. It is an organisational discipline that requires constant maintenance. And organisational discipline is the least reliable component in any security architecture. Which, in a rather pointed way, returns us to the engineering economics of the opening argument: the question is not whether independence can be preserved, but whether violating it is made sufficiently costly.

Dimitris Kyrkos • Mar 19

Toll-like receptors anchored to structural invariants. Disposable sentinels on fixed rotation cycles. Convergence distance as an independent metric. These aren't three separate strategies. There are three layers of the same architecture – and together they describe something that I don't think has a clean name yet in our field.

Let me try one: Architectural Distrust by Design.

Not zero-trust in the network sense – we've already overloaded that term into meaninglessness. I mean something more fundamental: a system whose security properties depend on its components not fully understanding each other. Where the sentinel doesn't share the model's ontology, doesn't update on the model's schedule, doesn't optimize for the model's loss function – by architectural mandate, not by accident.

Your Toll-like receptor point is the key that unlocks this. The reason those receptors work isn't just that they're old or simple. It's that they target things the adversary cannot change without destroying itself. Lipopolysaccharides aren't just a convenient detection surface – they're load-bearing walls in bacterial architecture. You can't mutate around them without ceasing to be a functional bacterium.

The software equivalent would be: what are the "load-bearing walls" of a malicious prompt? What structural properties must any filter bypass preserve to still function as a filter bypass? Not the content of the attack – that's infinitely variable. The shape of it. The information-theoretic signature of a request that is trying to move a boundary versus a request that is operating within one.

I suspect those invariants exist. And I suspect they're findable – not through more training data, but through adversarial formalization. The same way cryptographers don't find vulnerabilities by looking at more ciphertext, but by studying the mathematical structure of the cipher itself.

Now — your closing point. Organisational discipline is the weakest link. You've essentially closed the loop back to engineering economics, and I think that's the honest place to land. Because everything we've described – structural invariants, disposable sentinels, convergence monitoring – is technically buildable today. None of it requires a research breakthrough. All of it requires someone to choose to spend money on security infrastructure that produces no visible features, no user-facing improvements, and no metrics that make a quarterly report look good.

Which means the real adversary was never the attacker poisoning the training data. The real adversary is the incentive structure that makes it rational to skip the sentinel layer entirely because shipping faster is always more immediately rewarded than being harder to compromise.

The ouroboros, it turns out, isn't the model eating its own data. It's the organisation eating its own immune system because it looks like overhead.

oleg kholin • Mar 19

The most interesting thing about the conversation around "load-bearing walls" is that the analogy runs deeper than it first appears — and that's precisely why it's easy to apply imprecisely.
Lipopolysaccharide is a good target not simply because it's structurally necessary to the bacterium. It's good because it's structurally necessary and foreign — the host doesn't have it. Those are two conditions simultaneously. Remove the second one, and you get an autoimmune disorder: the system attacks its own load-bearing walls. In the context of prompts, this means that searching for attack invariants without simultaneously mapping the invariants of legitimate use isn't half the work — it's work that actively creates a new problem.
It follows that an adversary model for prompts, if one is ever built honestly, must contain not one space but two: the space of attacking patterns and the space of normal use — and an explicit description of their intersection. Cryptography handles this cleanly: a ciphertext is either valid under the protocol or it isn't. Prompts have no such boundary by definition — the boundary itself is what's being attacked.
The thesis that session trajectory is a more stable invariant seems correct, but it relocates the problem rather than solving it. If the classifier looks not at the request but at the trajectory — the attacker starts working with the trajectory. Multi-step attacks already exist precisely because a single request became too obvious a surface. Moving to trajectory analysis simply raises the level of the game without changing its nature. This isn't an argument against the trajectory approach — it's an argument against the illusion that an invariant at a higher level of abstraction will be more robust than one at a lower level.
The real question about the sentinel isn't what it measures or how it updates, but who controls its loss function. Training isolation is necessary but not sufficient: a sentinel trained in isolation but with a poorly specified objective will fail consistently in one direction. And that failure will be invisible from inside the system precisely because the sentinel is isolated. The paradox is that architectural distrust between components requires architectural trust in whoever designs the components — and that brings us back to where we started: to people, their incentives, and their blind spots.
The ouroboros is perhaps not where we're looking for it. It's not the organisation eating its immune system, and not the model eating its data. The ouroboros is that every defensive tool creates a new attack surface, and every layer of abstraction we add adds a new way around it. This isn't a reason to abandon architecture — it's a reason to honestly name the limit up to which architecture can help at all, and to begin thinking about what lies beyond that limit.

Dimitris Kyrkos • Mar 19

I've been sitting with this reply, not because I disagree, but because I think you've identified the exact point where this conversation either loops back on itself or breaks through into something genuinely new. And I want to make sure we break through.

Your autoimmune correction is precise and important. I was being sloppy – or at least incomplete. "Load-bearing and foreign" as a dual condition isn't a minor refinement. It's the entire difference between a functional immune system and lupus. And you're right that searching for attack invariants without simultaneously mapping legitimate use invariants doesn't just leave the job half-done – it actively builds an autoimmune system. A filter that starts rejecting valid architectural patterns because they share structural features with adversarial ones. We've already seen early versions of this in overly aggressive content filters that refuse to discuss security topics at all. That's not safety. That's anaphylaxis.

Your point about trajectory analysis is also well-taken, and I want to be honest about it rather than defensive: yes, moving the detection surface to a higher abstraction level raises the game without changing the game. I think I was seduced by the elegance of the move without fully reckoning with the fact that multi-step attacks already exist precisely as a response to single-request detection. You're right. The attacker simply plays longer sequences. The invariant I was reaching for doesn't live at a higher abstraction level – if it exists at all, it lives at a different kind of level. Not higher. Orthogonal.

But here's where I want to push back – gently, because I think we're close to something.

You write: "The real question about the sentinel isn't what it measures or how it updates, but who controls its loss function." And then: "Architectural distrust between components requires architectural trust in whoever designs the components."

This is beautifully stated. But I think it contains a hidden assumption that deserves examination: that the loss function must be designed by someone. That there must be a point in the chain where a human makes a judgment call about what the sentinel should optimize for, and that this human becomes the irreducible vulnerability.

What if the loss function isn't designed but derived? Not from a human decision about what constitutes an attack, but from a formal property of the system itself – something like internal consistency. A sentinel that doesn't ask "is this malicious?" but asks "does the model's confidence distribution on this response match the entropy profile that this class of query should produce?" No human decides what's malicious. The sentinel simply monitors whether the model is behaving like itself – and flags when it isn't.

This doesn't eliminate the human from the chain – someone still designs the consistency metric. But it moves the human's role from defining threats to defining normality, which is a meaningfully smaller and more auditable surface. You can't enumerate every possible attack. But you might be able to formally characterize the statistical shape of non-compromised operation.

I realize this might be exactly the kind of move you've already preempted – "relocating the problem rather than solving it." And maybe it is. But I think there's a qualitative difference between relocating the problem to a harder surface for the attacker versus relocating it to an equivalent surface at a different altitude. The question is which one this is, and I genuinely don't know.

Now – your closing. The ouroboros as "every defensive tool creates a new attack surface." Yes. I think this is the most honest formulation either of us has reached. And I want to resist the temptation to resolve it, because I think the resolution is the trap.

If I try to propose the architecture that lies "beyond architecture," I'm just adding another layer – and proving your point. So instead, let me end with what I think the honest position actually is:

The value of this entire conversation isn't a solution. It's a map of the territory where solutions stop working. Knowing exactly where architecture hits its limit – and being able to articulate why it hits that limit, formally and precisely – is itself a kind of infrastructure. Not a wall, but a surveyor's mark that says "beyond this point, you're relying on humans, incentives, and institutional culture, and you should know that explicitly rather than discovering it in a postmortem."

oleg kholin • Mar 20

There is a class of problems in AI safety that look technical but are fundamentally epistemological. The sentinel problem is one of them.
The idea is superficially appealing: instead of defining what constitutes an attack — an endless and inherently incomplete list — you monitor whether the model is behaving like itself. Not "is this malicious?" but "does this statistically cohere with the system's normal behavior?" Anomaly detection instead of threat detection. A shift from defining the enemy to defining normality.
This sounds like a genuine step forward. But it carries several buried problems worth unpacking separately.

Problem one: the model has no stable "self"
Large language models do not have a single distribution of normal behavior. They have a conditional distribution — one that changes radically with context. The entropy profile of a response to a question about Python syntax and a response to a question about medieval poetry are incomparable by design. This is not an implementation detail. It is a fundamental property of the architecture.
Which means the sentinel cannot simply know "the norm." It must know the norm for each class of query. And that means somewhere in the system there lives a query classifier — which immediately becomes a new attack surface. The problem is not solved. It is relocated one level deeper.

Problem two: normality is defined by a human
Suppose the classifier is built. The next question follows immediately: on what data distribution is the baseline established? If on training data — normality already contains all the biases of training, including the ones nobody knows about. If on a separate sample — who constructs it, by what criteria, under what oversight?
This is not a paranoid question. It is the standard question about any regulatory system: who defines what counts as normal, and what is the procedure for contesting that definition. In technical systems, this question often goes unanswered precisely because it looks non-technical.

Problem three: attacks adapt to detection
If the system monitors entropy profiles, an attacker who knows this can deliberately mimic a "normal" confidence distribution. This is not a hypothetical scenario — it is the standard logic of adversarial ML. Anomaly detection works exactly until the adversary learns what is being treated as anomalous.
This is the ouroboros: every defensive mechanism creates a new attack surface. But it is worth understanding more precisely why this happens — and here economics enters the picture.

The asymmetry that gets ignored
The arms race between attack and defense is often described as an intellectual contest between roughly equal sides. This is the wrong model.
An attacker needs one working exploit out of a thousand attempts. A defender must cover all thousand. With equal intellectual resources, this asymmetry means the ouroboros is not just an endless race. It is a race in which one side is structurally running uphill.
The practical implication follows: architectural solutions in AI safety cannot be self-sufficient — not because they are poorly designed, but because they optimize the wrong variable. The cost of failure is asymmetric, and any system that ignores this will eventually offload the burden somewhere else. The only question is where, and whether anyone knows that explicitly.

Where the burden gets offloaded
The standard answer when architectural tools run out is "people, incentives, institutional culture." This is the right answer — but it is usually delivered as a conclusion, when it should be the beginning of a separate conversation.
Institutional safety mechanisms have well-documented failure modes. Organizations optimize for the appearance of safety rather than safety itself when the metrics of success are defined by the same body that reports on success. Incentives work against safety when the planning horizon is shorter than the consequence horizon. This is not theory — these are patterns from nuclear safety, aviation, financial regulation. The empirical record exists.
The problem is that conversations about AI safety rarely use it. Instead, they invent concepts — entropy sentinels, derived loss functions — for territory that has already been partially mapped in other fields.

This is not an argument against technical solutions. It is an argument for knowing precisely where they end — and what lies beyond that boundary. Not as a satisfying conclusion to a discussion, but as the starting point for the next one.

Dimitris Kyrkos • Mar 20

The unstable self problem – yes. I was treating the model as if it had a baseline, when it has a conditional distribution that shifts by design. The query classifier is becoming a new attack surface – yes. The adversarial mimicry of entropy profiles – yes. Each of these isn't a nitpick. Each one is a structural reason why the sentinel concept, as I framed it, doesn't hold.

I'm not going to try to patch it. That would be exactly the move you've already diagnosed: relocating the problem one more level and calling it progress.

Instead, I want to sit with the thing you've actually said, because I think it's more radical than it might appear on the surface.

You've argued that the entire conversation – mine and yours – has been reinventing concepts for territory that other fields have already mapped. And you've named those fields: nuclear safety, aviation, and financial regulation. This is not a rhetorical flourish. It's an indictment – a gentle one, but an indictment – of how the AI safety conversation conducts itself. We keep building novel terminology for problems that have a fifty-year empirical literature, and in doing so, we lose access to the failures, the case studies, and the hard-won institutional knowledge that literature contains.

So let me take your challenge seriously and actually go there.

What do those fields know that we're ignoring?

The pattern that recurs across nuclear, aviation, and financial regulation isn't a technology. It's a structure. Specifically, it's the separation of the entity that operates from the entity that evaluates – not by policy, but by law, funding, and institutional identity. The NRC doesn't report to the utilities it regulates. The NTSB doesn't report to the airlines. The OCC doesn't report to the banks. This separation isn't perfect – regulatory capture is real and well-documented – but the principle is that the evaluator must have no incentive to agree with the operator.

In AI, this separation functionally does not exist. The company that builds the model defines the safety benchmarks, runs the red teams, publishes the safety cards, and decides what constitutes acceptable risk. The sentinel, the filter, the meta-filter – all of it lives inside the same organizational boundary as the thing it's supposed to watch. We've been discussing architectural distrust between components when the actual deficit is institutional distrust between roles.

And here's what makes this worse than nuclear or aviation: in those fields, failure is visible. A reactor melts down. A plane crashes. The feedback loop between failure and accountability is brutal but functional. In AI safety, the failure mode you've described throughout this thread – the filter that silently stops filtering, the boundary that shifts without triggering alarms – is invisible by nature. There is no crash. There is no meltdown. There is a slow, quiet degradation that only becomes visible in retrospect, if ever.

Which means the institutional structure required isn't just an independent regulator. It's an independent regulator with the technical capacity to detect failures that the operator itself cannot see – or is incentivized not to look for. That's a harder problem than anything we've discussed technically, because it requires building a regulatory institution that is more technically sophisticated than the entities it oversees. The historical precedent for this is... not encouraging.

You said this should be the beginning of a separate conversation, not the conclusion of this one. I agree. And I think the shape of that conversation is now clear enough to name:

The technical architecture of AI safety has a ceiling. We've spent this thread mapping that ceiling with some precision. What sits above it isn't better architecture – it's institutional design. And the uncomfortable truth is that institutional design for invisible failure modes, in an industry that moves faster than any regulatory body can follow, against adversaries who exploit the gap between operation and oversight – that is a problem we do not yet know how to solve. Not because we lack ideas, but because the existing ideas from other fields all assumed that failure would eventually be loud enough to force correction.

In AI, failure might never be loud. And that changes everything.

oleg kholin • Mar 20

The technical architecture of safety has a ceiling

Most widely discussed approaches to AI safety — filters, classifiers, meta-model watchdogs — are built inside the same system they are meant to control. This creates a set of well-known problems:

Unstable baseline. A language model is a conditional distribution, not a system with fixed behavior. It has no "norm" from which to measure deviation. Any control mechanism that relies on a notion of "normal behavior" is working with a moving target.
The attack surface grows with each layer of defense. A query classifier, an output filter, a meta-filter — each additional component is simultaneously a new point for adversarial exploitation. Making the safety architecture more complex does not necessarily make the system safer.
Adversarial mimicry. If a defense mechanism relies on statistical profiles (entropy, perplexity, token patterns), an attacker can craft a request that mimics a "safe" profile. This is not a theoretical threat — it is a consequence of the model and the filter operating in the same representation space.

These limitations do not mean technical measures are useless. They mean technical measures do not close the safety loop on their own. They require a superstructure of a different nature.

The institutional deficit: not a new problem, but a specific one

In nuclear energy, aviation, and the financial sector, a key principle was established long ago: the entity that operates the system must not be the one evaluating its safety. The NRC, NTSB, OCC — all these regulators exist as organizationally independent from the operators they oversee. The separation is imperfect (regulatory capture is a documented reality), but the principle works.

In the AI industry, this separation effectively does not exist. The developer company itself:

defines the safety benchmarks,
conducts the red-teaming,
publishes the safety cards,
decides what level of risk is acceptable.

Everything — from defining the problem to evaluating the outcome — sits within a single organizational boundary. This is not malice; it is a consequence of the industry developing faster than institutions could follow.

However, directly transferring the experience of other industries to AI is difficult for several reasons.

Why historical analogies do not scale directly

Invisible failure

In aviation, the plane crashes. In nuclear energy, the reactor melts down. The feedback loop is brutal but functional: catastrophe → investigation → correction. In AI, the primary failure mode is silent degradation: the filter stops triggering, the boundary shifts, the quality of responses drifts. No explosion, no casualties, no news story. Failure can remain undetected indefinitely.

This undermines not only technical problem detection but also the political will to address problems. Regulators in other industries emerged after visible catastrophes. If there is no visible catastrophe — where does the mandate for creating an expensive independent institution come from?

Operator ambiguity

The NRC regulates specific power plants. The FAA regulates specific airlines. An AI regulator would need to regulate... whom exactly? Companies releasing open-source models do not control their use. A user who deploys a model on their own server and removes the safety layer is not formally a customer of the developer. The chain "developer → distributor → operator → user" in AI is fragmented in a way that has no precedent in any of the analogous industries.

Jurisdictional problem

Nuclear reactors sit on specific territory. Aircraft are registered in specific countries. AI models operate globally: trained in one jurisdiction, deployed in another, used in a third. Even a perfect national regulator covers only a fraction of the landscape.

Industry speed vs. institutional speed

The model update cycle is months. The cycle for creating a regulatory institution is years. The cycle for developing international standards is decades. This gap is not unique to AI (fintech faced something similar), but its scale here is unprecedented.

What the current discussion is missing

Several substantial aspects are systematically left out of the conversation.

Economic incentives. Safety is cost and slowdown. In a competitive race, the company investing more in safety than its competitors loses on speed. Without external leveling of the playing field (regulation, standards, insurance), a race to the bottom is inevitable.

The user as a threat vector. In aviation, the passenger is passive. In AI, the user actively shapes the system's behavior: prompt injection, jailbreak, deliberate boundary-pushing. A significant share of "failures" is not spontaneous degradation but the result of adversarial interaction from the user side. This requires a different threat model.

Emergent properties. Regulation assumes we understand what we are controlling. But models acquire capabilities that were neither intended during training nor tested. A regulator cannot oversee what it does not know exists.

Degradation through fine-tuning. Each cycle of RLHF and fine-tuning potentially shifts model behavior. This is not a hypothesis — it is a documented phenomenon. The loop "training → deployment → data collection → retraining" can make degradation not a bug but a systemic property of the process.

Directions that look practical

Despite the complexity of the situation, several directions appear feasible in the foreseeable future.

Mandatory logging

An analog of the aviation "black box": mandatory recording of inference logs for systems above a certain usage threshold. This does not solve the problem of invisible failure, but it creates material for retrospective analysis. Technically feasible, legally precedented (the financial sector already mandates transaction log retention).

Standardized canary tests

A set of reference queries that an independent party periodically runs through the production system, comparing results with previous ones. This enables detection of behavioral drift without access to model weights. Cheap, scalable, requires no revolutionary institutions.

Interpretability as a compliance requirement

Interpretability is currently a research direction. Elevating it to the status of a regulatory requirement (at least for high-risk systems) creates an economic incentive to invest in this area. The EU AI Act is moving in this direction, albeit slowly.

Differential testing between versions

Systematic comparison of a new model version's behavior against the previous one on a standardized set of scenarios. Does not require understanding the model's internals — only recording changes in outputs. Enables detection of unintended drift during updates.

Distributed audit

Instead of a single powerful regulator — an ecosystem of independent auditors with different specializations, operating under standardized protocols. A model closer to financial auditing than to nuclear oversight. Not ideal, but scalable and politically feasible.

Conclusion

Technical AI safety measures are necessary but insufficient. Institutional measures are necessary, but existing models from other industries do not transfer directly due to invisible failure, operator ambiguity, jurisdictional fragmentation, and the speed of industry development.

The practical path most likely runs not through the creation of a single powerful regulator (politically unrealistic in the foreseeable future), but through a combination of tools: mandatory logging, standardized testing, distributed audit, and the gradual embedding of interpretability requirements into the regulatory framework.

This is not an elegant solution. But it is a solution that can begin to be implemented without waiting for failure to become loud enough that it can no longer be ignored. Because it may never become that loud.

View full discussion (11 comments)