Disesdi Susanna Cox and Niklas Bunzel's recent paper, "Quantifying the Risk of Transferred Black Box Attacks," marks an important milestone in adversarial risk research. By foregrounding the challenge of transferability and proposing surrogate-model testing guided by Centered Kernel Alignment (CKA), the authors provide organizations with a pragmatic framework for quantifying risk in compliance-driven environments.
Yet the very insight their work surfaces—that adversarial subspaces are high-dimensional, transferable, and computationally intractable to map exhaustively—points to a deeper structural issue. Current neural architectures lack any cryptographic or state-integrity boundary to constrain how those subspaces evolve. Because transformers expose their reasoning surface through embeddings, timing, attention distributions, and cross-call correlations, adversarial behavior does not remain confined to a "subspace." It propagates, recursively widening the attackable manifold.
This dynamic echoes the Ouroboros-style recursive inference collapse seen in systems trained on their own outputs: perturbations introduced during evaluation become part of the model's persistent reasoning space. Without cryptographic anchoring, adversarial signals risk being absorbed into the model's state, expanding vulnerabilities over time.
The surrogate-model approach is essential, but the field must also explore topology-level defenses: mechanisms that constrain how adversarial subspaces can evolve within the architecture itself. Otherwise, we risk mapping adversarial subspaces while the subspaces themselves continue to widen beneath us.

Static adversarial subspace (left) vs. recursive expansion through model reasoning space (right). Current defenses attempt to map the star while the hexagons multiply.
How Adversarial Subspaces Expand: Three Propagation Vectors
To understand why architectural defenses matter, we need to trace how perturbations move through transformer models. Unlike traditional software vulnerabilities that remain localized, adversarial signals in neural architectures propagate through multiple channels simultaneously.

Three vectors of adversarial propagation: attention manipulation (left), embedding cascade (center), timing leakage (right). Each widens the attackable manifold.
1. Attention Weight Manipulation
Attention mechanisms determine which parts of the input the model considers relevant at each layer. An adversarial input can cause attention heads to weight irrelevant tokens or contexts, effectively misdirecting the model's "focus." The angular geometry of this vector is deliberate: attention flows where it shouldn't, crossing boundaries it should respect.
What makes this particularly insidious is persistence. When a model learns to attend incorrectly during one inference, that pattern can influence subsequent queries if the model maintains any form of context or state. The misdirection doesn't stay confined to a single forward pass—it reshapes how the model allocates attention across related inputs.
This becomes especially problematic in production systems where models process sequential queries from the same user or domain. Each misdirected attention pattern slightly alters the model's effective reasoning surface, creating a feedback loop that amplifies the initial perturbation.
2. Embedding Space Cascade
Adversarial perturbations don't just affect individual layers—they reshape the geometric relationships between embeddings as information flows through the network. When attention manipulation occurs in early layers, it distorts which embedding neighborhoods become activated downstream.
Think of embeddings as occupying a high-dimensional space where semantic relationships are encoded as distances and angles. A perturbation that moves one embedding slightly can cause cascading effects as subsequent layers process that shifted representation. The crossed vertical structures in the glyph capture this: layers intersecting, perturbations propagating downward through architectural strata, each transformation amplifying or redirecting the distortion.
Unlike attention manipulation, which is relatively localized, embedding cascades affect the entire geometric structure the model uses for reasoning. This is why adversarial examples often transfer between models: they're not just fooling specific attention patterns, they're exploiting shared geometric properties of how neural networks represent information.
The cascade effect means that even small perturbations at early layers can produce significant behavioral changes by the time information reaches the output layer. And because these geometric distortions are continuous rather than discrete, they're difficult to detect through simple input validation or output checking.
3. Timing Side Channels
Even when adversarial inputs don't successfully manipulate model outputs, they often leave observable traces in execution timing. Different attention patterns take different amounts of time to compute. Certain embedding retrievals are faster than others depending on cache locality. Token generation speeds vary based on the confidence distribution over the vocabulary.
These timing variations reveal which computational paths the model activated—effectively creating a side channel that leaks information about the model's internal reasoning process. The flowing, rhythmic pattern in the glyph represents this: temporal oscillations and cadences that betray what's happening beneath the surface.
For an adversary, timing channels provide reconnaissance data. They can probe the model systematically, observing timing patterns, and use those observations to refine their understanding of the model's decision boundaries. Each query reveals a little more about the internal geometry, making subsequent attacks more precise.
More subtly, timing variations can become part of the adversarial signal itself. If a model learns that certain input patterns correlate with particular timing profiles, and those timing profiles correlate with successful attacks, the model may inadvertently encode timing as a feature. This creates yet another feedback channel through which adversarial patterns can propagate.
Why This Matters for Organizations Deploying AI
The technical dynamics described above translate into concrete organizational risks that existing security frameworks are not equipped to address.
Most compliance and risk management approaches treat AI systems as static artifacts with fixed attack surfaces. You perform surrogate model testing (as Cox and Bunzel recommend), identify high-risk transfer scenarios, and implement controls around those specific vulnerabilities. This is valuable and necessary work—but it assumes the adversarial subspace remains constant.
In reality, production AI systems accumulate adversarial signals over time. If your chatbot learns from user interactions, it may be incorporating manipulated attention patterns into its behavior. If your content moderation system processes adversarial examples, those geometric distortions can shift its decision boundaries. If your recommendation engine is probed repeatedly, timing leakage may be revealing its internal logic to attackers.
Traditional software security relies on cryptographic boundaries: authentication gates, encrypted channels, isolated execution contexts. These mechanisms prevent adversarial signals from propagating freely through the system. But neural architectures lack equivalent boundaries. There's no "authentication" step between transformer layers. There's no cryptographic commitment that embeddings haven't been subtly shifted. There's no isolation preventing adversarial learning from contaminating the model's long-term behavior.
This architectural gap creates a category of risk that audits and penetration testing cannot fully capture. You can test your model against known adversarial examples today, but those tests don't reveal whether the model is quietly widening its own attack surface through interaction with a hostile environment.
For organizations deploying AI in security-critical contexts—content moderation, fraud detection, medical diagnosis, autonomous systems—this represents an underappreciated threat vector. The model may be functioning correctly today while simultaneously absorbing adversarial patterns that will manifest as vulnerabilities tomorrow.
Three Topology-Level Defenses
Addressing these dynamics requires moving beyond attack-surface mapping toward architectural mechanisms that constrain how adversarial signals can propagate and persist.
1. Sealed Telemetry: Cryptographic Isolation of Reasoning Surfaces
The timing side channel problem illustrates a broader vulnerability: neural models leak information about their internal state through observable signals. Attention distributions, embedding queries, token generation speeds, confidence scores—all of these constitute a "reasoning surface" that adversaries can probe.
Sealed telemetry treats model internals as privileged state that must be cryptographically isolated from external observation. Similar to how TLS encrypts transport-layer data to prevent eavesdropping, we need protocols that prevent reconnaissance of model reasoning.
This doesn't mean making models completely opaque (which would harm interpretability and debugging). Rather, it means implementing authenticated channels for accessing internal state. If an application needs to observe attention weights for explainability purposes, that access should be mediated through a cryptographic protocol that prevents unauthorized probing while allowing legitimate monitoring.
The technical implementation would involve securing the interfaces between the model and its execution environment—essentially treating each query to the model as a request that must be authenticated and its responses logged in a tamper-evident way. This prevents adversaries from running silent reconnaissance: every probe leaves a cryptographic trace that can be audited.
For organizations, this means deploying AI systems with the same security discipline applied to API gateways or database connections: explicit access control, audit logging, and rate limiting on queries that might reveal internal model geometry.
2. Dynamic Cryptographic Boundaries Between Layers
The embedding cascade problem stems from the fact that perturbations flow freely through transformer layers without any integrity checking. Information enters at the input layer, transforms through multiple attention and feedforward operations, and emerges at the output—with no verification that intermediate states haven't been subtly corrupted.
Dynamic cryptographic boundaries introduce state commitments at layer boundaries. Before a representation propagates from one layer to the next, the system verifies that it satisfies certain geometric or statistical properties. If a perturbation has distorted embeddings beyond acceptable bounds, the transition is rejected or flagged.
Think of each transformer layer as a security domain with authenticated transitions. Just as microservice architectures use mutual TLS to verify that services are communicating with authorized peers, neural architectures could use cryptographic commitments to verify that layer-to-layer information flow hasn't been adversarially manipulated.
The challenge here is defining what "acceptable bounds" means without destroying the model's ability to process novel inputs. This likely requires statistical techniques: establishing a baseline distribution of embedding geometries during training, then detecting statistical anomalies during inference. Cryptographic commitments ensure that these checks can't be bypassed—the model literally cannot propagate representations that fail integrity verification.
For practical deployment, this could be implemented as a middleware layer that wraps the model, performing integrity checks on intermediate activations without requiring modifications to the model architecture itself. Organizations could apply different security policies to different layers based on risk profiles: tighter constraints on early layers where perturbations have cascading effects, looser constraints on later layers where the model needs flexibility to produce diverse outputs.
3. Reasoning Space Isolation: Preventing Persistent Contamination
The Ouroboros problem—models trained on their own outputs experiencing quality collapse—has a parallel in adversarial learning. If a model processes adversarial inputs and those inputs influence the model's future behavior, the adversarial signal becomes persistent. The model has effectively learned to be vulnerable.
Reasoning space isolation implements mechanisms to prevent adversarial signals from contaminating the model's long-term behavior. The core principle is that each inference call should operate in an ephemeral context that doesn't leak into subsequent calls.
For stateless models (those that don't explicitly learn from production traffic), this means ensuring that internal state doesn't persist across queries in ways that could encode adversarial patterns. Attention caches, embedding indexes, and other optimization structures should be wiped or cryptographically reinitialized between unrelated queries.
For models that do learn from interactions—chatbots, recommendation systems, adaptive interfaces—this requires more sophisticated techniques. One approach is maintaining cryptographic separation between "trusted" training data and "observed" interaction data, with explicit human-in-the-loop review before observed data influences model behavior. Another is implementing anomaly detection on the model's learning signals: if interaction patterns suggest systematic probing or manipulation, those signals are quarantined rather than incorporated.
The practical implication for organizations is treating AI systems more like dynamic security targets than static software. Just as intrusion detection systems monitor network traffic for attack patterns, AI deployments need monitoring systems that watch for adversarial learning signals and prevent them from becoming embedded in model behavior.
Complementing the Mapping Approach
Cox and Bunzel's framework for quantifying transfer risk is essential. Organizations need practical methods to assess whether adversarial examples crafted against surrogate models will transfer to their production systems. CKA-guided surrogate testing provides exactly that: a way to measure representational similarity and predict transferability risk.
But the dynamics explored here suggest that mapping alone is insufficient. As organizations deploy AI systems at scale, those systems operate in hostile environments where adversarial signals propagate through multiple channels. Without architectural mechanisms to constrain that propagation, the adversarial subspace will continue to expand—not just in theory, but in practice as production models accumulate subtle corruptions.
The path forward requires both approaches: rigorous testing to map current vulnerabilities, and architectural defenses to prevent those vulnerabilities from growing unbounded. We need to quantify transfer risk and build systems where adversarial learning doesn't compound over time.
This is particularly urgent as AI systems become more capable and more integrated into critical infrastructure. The same qualities that make large language models powerful—their ability to learn from context, adapt to new domains, and generalize across tasks—also make them susceptible to adversarial contamination. Each capability that improves model utility also widens the potential attack surface.
By combining transfer risk quantification with topology-level defenses, we can build AI systems that are not just tested against known adversarial examples, but architecturally resistant to adversarial evolution. That's the standard of security required for deploying AI in contexts where failures carry real consequences—and it's the standard organizations should be working toward now, before deployment patterns become ossified and harder to change.
For Organizations Implementing AI Security
The architectural challenges outlined here require both theoretical frameworks and practical implementation guidance. I've developed several resources for teams working to secure AI systems:
Myth-Tech Framework for AI/ML Security - A comprehensive threat modeling approach that maps security patterns to operational clarity
SMB AI Security Kit - Forensic-first implementation guide for resource-constrained teams
Both designed for organizations without dedicated AI security staff, emphasizing defensible architecture over compliance theater.
This analysis is part of ongoing work on human-centered AI security at [Soft Armor Labs]. The visual frameworks referenced here are available as educational tools through the Cybersecurity Witwear methodology.
Top comments (0)