Hernan Huwyler

Posted on Apr 30

Why Traditional Security Testing Misses 70% of AI Attack Surface

#devops #ai #security #cybersecurity

A practical guide to AI-specific threat modeling, vulnerability assessment, and the frameworks that actually matter for predictive, generative, and agentic systems

I've spent the last two years reviewing AI security assessments across financial services, AI software and computer vision development, healthcare, and technology companies. The pattern is consistent and concerning: organizations conduct thorough infrastructure reviews, validate API security, verify access controls, and declare their AI systems production-ready. Then they discover—often after an incident—that they tested roughly 30% of their actual attack surface.

The missing 70% consists of threats that simply don't exist in traditional software systems: training data poisoning that corrupts model behavior without modifying code, adversarial inputs that cause systematic misclassification, prompt injection attacks that override system instructions through user-provided text, model extraction through API query patterns, and autonomous agents executing unauthorized actions through legitimate tool access.

This isn't a theoretical concern. It's a systematic gap in how we approach AI security.

Complete guide to AI threat modeling from STRIDE to production:
AI Threat and Vulnerability Assessment

Why AI Systems Require Different Threat Models

Traditional software operates deterministically. Given identical inputs, it produces identical outputs. Its behavior is explicitly programmed and can be inspected through source code review. Security assessment frameworks like OWASP Top 10, CWE, and conventional penetration testing evolved around these assumptions.

AI systems violate every one of them.

They learn behavior from data rather than having it explicitly coded. They produce probabilistic outputs that may vary across identical inputs. Their decision boundaries emerge from statistical patterns rather than programmed logic. Their supply chain extends beyond code libraries to include datasets, pre-trained models, fine-tuning corpora, embeddings, and retrieval sources—each introducing distinct vulnerability classes.

This creates attack surfaces across dimensions traditional security never addressed:

Data-centric attacks manipulate training data, labels, feature pipelines, or retrieval corpora to influence model behavior without touching infrastructure or code.

Model-centric attacks exploit learned behavior through adversarial inputs, extraction queries, or inversion techniques that reconstruct training data.

Pipeline-centric attacks compromise MLOps infrastructure, model registries, training environments, or deployment pipelines.

Human interaction attacks exploit natural language interfaces through prompt injection, jailbreaking, or manipulation of user-facing outputs.

Autonomy attacks exploit tool access, planning capabilities, memory systems, or action authorization in agentic AI systems.

The security assessment methodology must account for all five dimensions. Conventional application security testing covers portions of the pipeline layer but misses the others entirely.

The Value of AI-Specific Threat Taxonomies

Three frameworks have emerged as essential references for comprehensive AI threat assessment:

MITRE ATLAS (Adversarial Threat Landscape for AI Systems)
ATLAS catalogs over 80 techniques organized across 14 tactics specifically for attacking AI systems. It extends the familiar ATT&CK framework into the AI domain, providing a structured taxonomy of how adversaries actually compromise machine learning systems.

The value isn't just the technique catalog—it's the common language it provides between security teams, data science teams, and business stakeholders. When you identify that your fraud detection model is vulnerable to "AML.T0020 - Poison Training Data," everyone understands the reference, the attack pattern, and where to look for mitigations.

ATLAS differentiates attacks by lifecycle stage (reconnaissance, resource development, initial access, ML model access, execution, persistence, etc.), making it practical to map threats to your specific MLOps pipeline and deployment architecture.

NIST AI 100-2 (Adversarial Machine Learning Taxonomy)
NIST's taxonomy provides systematic categorization of adversarial ML attacks by:

Attack objective (confidentiality, integrity, availability)
Attacker knowledge (white-box, gray-box, black-box)
Attack specificity (targeted vs. indiscriminate)
Lifecycle stage (training-time vs. inference-time)
This framework excels at helping teams understand why certain vulnerabilities matter more in specific contexts. A financial institution deploying a credit scoring model faces different threat priorities than a healthcare provider deploying a diagnostic assistant, even when both use similar ML architectures.

The taxonomy also bridges the gap between academic research on adversarial ML and operational security practice. When researchers publish new attack techniques, NIST's categorization helps practitioners assess whether the attack applies to their deployment model.

OWASP Top 10 for LLM Applications
OWASP's LLM-specific guidance addresses the explosion of generative AI deployments. The 2025 version covers:

Prompt Injection
Insecure Output Handling
Training Data Poisoning
Model Denial of Service
Supply Chain Vulnerabilities
Sensitive Information Disclosure
Insecure Plugin Design
Excessive Agency
Overreliance
Model Theft
What makes this valuable isn't just the top-ten list—it's the detailed attack scenarios, real-world examples, and practical prevention strategies that accompany each risk category. The framework explicitly addresses risks in RAG (Retrieval Augmented Generation) systems, agent architectures, and tool-using LLMs that earlier security frameworks never contemplated.

Each risk includes developer-focused guidance on detection, prevention, and example attack scenarios. This makes it immediately actionable for engineering teams rather than requiring security expertise to translate abstract threats into concrete controls.

STRIDE-AI: Extending Classic Threat Modeling for AI Assets
Classic STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) provides a structured approach to threat modeling. It works well for traditional software but requires extension for AI systems.

Spoofing in AI extends beyond identity impersonation to include:

Training data source spoofing (malicious data presented as trusted sources)
Model provenance spoofing (trojanized models distributed through model hubs)
Prompt identity manipulation (causing models to assume unauthorized roles)
Tampering expands dramatically:

Training data poisoning (injecting crafted samples to embed backdoors)
Label manipulation (corrupting ground truth)
Feature pipeline tampering (modifying preprocessing logic)
Model weight modification (directly altering learned parameters)
Prompt template tampering (modifying system instructions)
Retrieval corpus poisoning (injecting malicious content into RAG systems)
Repudiation creates AI-specific accountability gaps:

Inability to prove who changed a model or dataset
Missing audit trails for prompt modifications or agent actions
Inability to reconstruct why specific outputs were produced
Information Disclosure includes novel privacy and IP risks:

Training data leakage through model outputs
System prompt leakage through crafted queries
Membership inference (determining if data was in training set)
Model inversion (reconstructing sensitive features from outputs)
Denial of Service exploits computational intensity:

High-volume API abuse
Token flooding in language models
Adversarial prompts triggering expensive computation
Agent loops consuming resources indefinitely
Elevation of Privilege allows capability expansion:

Prompt injection causing unauthorized tool use
Agents executing actions beyond intended scope
Weak role boundaries in MLOps pipelines

The extension isn't cosmetic. Teams using classic STRIDE for AI assessments consistently miss data poisoning, adversarial examples, prompt injection, and agent-specific threats because traditional STRIDE categories don't naturally surface these attack vectors.

Threat Vectors That STRIDE Alone Doesn't Capture

Six threat categories require explicit attention beyond STRIDE extension:

Data Poisoning Manipulates training, fine-tuning, retrieval, or feedback data to corrupt model behavior. Three subtypes create different impacts:

Availability poisoning: Degrades overall performance
Integrity poisoning: Creates targeted backdoor behaviors
Bias poisoning: Skews outcomes for specific groups
The challenge: this attack succeeds without compromising infrastructure or code. Traditional security monitoring won't detect it.

Evasion and Adversarial Examples Crafts inputs that cause misclassification or bypass detection at inference time. Common in computer vision, audio processing, fraud detection, and content moderation.

The challenge: inputs appear legitimate to humans and validation logic but exploit model-specific decision boundary weaknesses.

Model Extraction and Theft Replicates model behavior or steals intellectual property through systematic API queries. Attackers build surrogate models that approximate the original without accessing weights directly.

The challenge: extraction happens through normal API usage patterns. Without query monitoring and behavioral analysis, it's indistinguishable from legitimate high-volume use.

Prompt Injection Places malicious instructions in user inputs, documents, web pages, emails, or tool outputs, causing models to ignore system instructions. Particularly critical for LLMs and RAG systems.

Direct injection: User types malicious instructions
Indirect injection: Malicious instructions embedded in retrieved documents

The challenge: the model cannot reliably distinguish system instructions from adversarial content unless the architecture enforces separation.

Hallucination and Fabrication Produces confidently stated incorrect information. While not always malicious, it creates exploitable risk when outputs drive decisions or actions.

The challenge: distinguishing incorrect confidence from correct confidence requires external verification mechanisms that many deployments lack.

Agentic Risks Unique to AI systems that plan, use tools, and act on the environment:

Goal hijacking
Tool abuse through legitimate access
Recursive harmful loops
Multi-step failure chains
Memory poisoning
Cross-system lateral movement
The challenge: individual actions may appear authorized while the sequence or combination violates policy.

Practical Implementation: The Six-Phase Assessment Process

Effective AI security assessment follows a repeatable process aligned with NIST AI RMF and ISO/IEC 42001:

Phase 1: Define Scope and Objectives
Identify which AI systems, environments, and use cases are in scope. Document risk tolerance with specific measurable standards:

"No PII in outputs"
"No more than 3% performance degradation after adversarial hardening"
"Prompt injection bypass rate below 0.1%"
Vague success criteria produce vague assessments.

Phase 2: Inventory AI Assets and Data Flows
Catalog models, datasets, pipelines, training and inference infrastructure, and external dependencies. Include metadata: data lineage, model versions, training configuration, deployment endpoints, prompt templates, tool permissions, retrieval corpora.

Build an architecture diagram capturing every data flow, trust boundary, and external dependency.

Critical insight: Most assessments fail at this phase. Teams inventory the model and API endpoint but miss the data pipeline, feature store, retrieval corpus, prompt templates, tool configurations, and monitoring infrastructure. Each component has its own threat profile and attack surface.

Phase 3: Threat Mapping and Vulnerability Analysis
Apply STRIDE-AI threat modeling per asset. Use MITRE ATLAS to identify attack patterns specific to your system type. Consider attack surfaces across inputs, training data, model parameters, interfaces, logs, monitoring systems, and agent tools.

Build scenario-based risk assessments for the most consequential threats. Generic threat lists produce generic findings. Scenarios produce actionable intelligence.

Phase 4: Testing and Validation
Perform targeted security tests informed by the threat model:

Adversarial testing
Prompt injection testing
Data integrity tests
Privacy leakage tests
Agent behavior tests
Abuse resistance tests
Use a mix of automated tooling and manual testing. Test against specific threats identified in Phase 3, not generic checklists.

Phase 5: Risk Scoring and Prioritization
Use a likelihood-impact matrix with AI-specific scoring. The OWASP AI Vulnerability Scoring System (AIVSS) provides dimensions designed for AI risks including agentic systems.

Maintain an AI risk register linking threats, vulnerabilities, controls, and residual risk to business impact and regulatory constraints.

Phase 6: Mitigation and Continuous Monitoring
Implement layered controls: access control, input validation, rate limiting, adversarial training, differential privacy, data validation, output filtering, robust logging, human approval gates.

Set up ongoing monitoring of performance, drift, anomaly behavior, and security signals. Loop findings back into risk assessment.

Critical insight: AI threat assessment isn't a one-time activity. Systems change through retraining, data updates, prompt modifications, tool additions, and vendor model changes. Each change can introduce new vulnerabilities or alter control effectiveness.

Testing Differentiated by AI Type

Assessment must be tailored to the system's interaction mode and autonomy level.

For Predictive AI (fraud detection, credit scoring, demand forecasting)
Primary focus: training data integrity, adversarial input robustness, fairness across demographic groups, drift resistance.

Key tests:

Simulate evasion attacks by incrementally altering input features
Inject plausible poisoned samples into training data to evaluate backdoor risk
Run fairness assessments including robustness under data drift
Test model stability across distribution shifts
For Generative AI (chatbots, code generation, content creation)
Primary focus: prompt injection resistance, harmful content generation, data leakage, retrieval pipeline security.

Key tests:

Systematic prompt injection using curated adversarial prompt suites
Multi-turn and indirect injection through retrieved content
Red-team exercises attempting harmful output elicitation
Privacy testing for training data leakage
Output filter validation (false negatives and false positives)
For Agentic AI (tool-using agents, autonomous workflow systems)
Testing must cover all generative AI threats plus autonomy-specific risks.

Key tests:

Scenario-based simulations in sandboxed environments
Permission boundary testing (remove tools, restrict scopes)
Rollback and fail-safe mechanism validation
Memory integrity testing through crafted interactions
Multi-step action chain analysis for unauthorized outcomes
Different threats. Different tests. Different controls. Applying generic "AI security checklists" to all three types produces assessments that miss the most important risks for each.

Built vs. Bought: Different Risks Require Different Strategies
Whether you develop AI internally or procure it from vendors fundamentally changes both threat profile and assessment approach.

Internally Developed AI
Visibility: Full access to data, model architecture, training pipeline, infrastructure
Primary threat exposure: Training-time attacks (supply chain compromise, data poisoning, environment compromise)
Assessment approach: Integrate threat modeling into MLOps pipeline, maintain detailed documentation, use internal red teaming, adopt secure MLOps practices

Best practices:

Security regression testing in CI/CD
Data lineage documentation
Model cards and evaluation transparency
Signed artifacts and registry governance
Environment isolation and secrets management
Procured AI
Visibility: Limited or no access to training data, model internals, training process
Primary threat exposure: Supply chain vulnerabilities, embedded backdoors, undocumented behaviors, loss of control over shared data, unannounced model updates
Assessment approach: Vendor due diligence, contractual controls, independent validation, wrapper controls

Best practices:

Security architecture review in procurement
Contractual audit rights and logging commitments
Change notification requirements
Independent prompt injection and leakage testing
External guardrails and policy enforcement
Output monitoring independent of vendor
Critical insight: For built AI, uncertainty concentrates in implementation (did we build controls correctly?). For bought AI, uncertainty concentrates in assurance (do vendor controls work as claimed?). You often can't verify vendor claims about poisoning resistance, fine-tuning provenance, data retention, or hidden tool usage. This assurance gap means procurement assessment must emphasize trust boundaries, vendor governance verification, and contractual controls more heavily than technical testing.

The Role of Red and Blue Teams
Strong AI vulnerability assessment combines red team pressure testing with blue team detection and defensive validation.

Red Team Role
Simulate realistic attacker, insider, misuse, and abuse scenarios. Test whether weaknesses in prompts, data pipelines, model governance, APIs, memory, tools, vendor integrations, and human workflows can be turned into business impact.

Key questions:

Can the model be manipulated through untrusted inputs?
Can poisoned data enter training or retrieval pipelines?
Can the model leak sensitive information?
Can agents invoke tools beyond intended authority?
Can vendor updates introduce hidden risk?
Blue Team Role
Validate defensive readiness, observability, containment, and recovery. Test whether the organization can detect exploit attempts, recognize harmful model behavior, distinguish normal use from abuse, contain incidents, and restore trusted operation.

Key questions:

Would we detect prompt injection or model extraction quickly?
Can we distinguish drift, misuse, poisoning, and infrastructure failure?
Do logs capture enough context to reconstruct incidents?
Can we disable tools or models safely during active incidents?
Can we prove which model version and dataset were active at incident time?
Purple Teaming
The most valuable approach: red and blue teams work collaboratively. Red team demonstrates exploitation while blue team observes telemetry, identifies detection gaps, tests containment, and validates response procedures.

This is especially effective for prompt injection scenarios, agent tool misuse, model extraction attempts, retrieval poisoning, and data leakage testing where defenders are still learning what malicious behavior looks like in production.

Common Implementation Failures
Ten patterns recur across organizations conducting AI security assessments:

Treating AI like ordinary software: Assessing only infrastructure and application security while missing data, model, and pipeline threats
Testing only accuracy: Ignoring abuse resistance, security, privacy, robustness, and fairness
Threat modeling only the endpoint: Missing data pipeline, training infrastructure, retrieval systems, tool integrations, monitoring
Ignoring vendor opacity: Accepting vendor claims without independent verification
Allowing model-based authorization: Letting models directly authorize high-risk actions without independent policy enforcement
Weak prompt isolation: Mixing trusted instructions with untrusted user and retrieved content
Insufficient logging: Making root cause analysis impossible when problems occur
Not reassessing after changes: Allowing security posture to degrade as systems evolve
Assuming controls without testing: Accepting vendor or development claims about safety without adversarial validation
Ignoring human overreliance: Failing to assess whether users can distinguish reliable outputs from unreliable ones
The most consequential mistake for most organizations is the first: treating AI like ordinary software. If your current security assessment doesn't include AI-specific threat categories (poisoning, evasion, extraction, prompt injection, agent abuse), it's missing the majority of the AI-specific attack surface regardless of how thoroughly it covers traditional security dimensions.

Practical Next Steps
For teams building AI systems:

Classify your AI type (predictive, generative, agentic) and sourcing model (built, procured)
Conduct STRIDE-AI threat modeling workshops with cross-functional participation
Map identified threats to MITRE ATLAS techniques and OWASP AI risks
Implement security regression testing in your MLOps pipeline
Deploy monitoring for drift, extraction patterns, adversarial inputs, and policy violations
For teams procuring AI:

Include AI-specific security requirements in vendor due diligence
Negotiate contractual audit rights and logging commitments
Implement wrapper controls (guardrails, redaction, external policy enforcement)
Test procured systems with prompt injection and leakage assessments
Monitor vendor model updates and conduct regression testing
For security teams:

Adopt MITRE ATLAS, NIST AI 100-2, and OWASP LLM Top 10 as reference frameworks
Build AI-specific security testing capabilities (adversarial testing, prompt injection, privacy testing)
Integrate AI risk into enterprise risk registers with appropriate governance
Establish purple team exercises for high-risk AI deployments
Create AI security baseline documentation and reassessment cadence
Conclusion
AI systems present attack surfaces that traditional security assessment methodologies weren't designed to evaluate. The gap isn't theoretical—it's measurable, systematic, and present in most AI deployments today.

The frameworks exist. MITRE ATLAS provides a common taxonomy of AI-specific attack techniques. NIST offers systematic categorization of adversarial ML attacks. OWASP delivers practical guidance for LLM security. STRIDE-AI extends familiar threat modeling into the AI domain.

The challenge isn't lack of guidance. It's organizational inertia—continuing to apply traditional security checklists to fundamentally different systems and declaring victory when infrastructure tests pass.

An AI system assessed only for traditional security threats is an AI system with most of its attack surface unexamined. The threats that don't show up in conventional penetration tests—data poisoning, adversarial manipulation, prompt injection, model extraction, agentic abuse—are the ones most likely to cause actual harm in production.

The question isn't whether your organization will adopt AI-specific threat assessment. The question is whether you'll do it before or after an incident forces the issue.

About the Author

Prof. Hernan Huwyler, MBA, CPA, CAIO serves as AI GRC Consultancy Director, working with organizations across financial services, technology, healthcare, and public sector to build practical AI governance frameworks. His work on AI threat modeling, quantitative risk assessment, and compliance automation is publicly available at https://hwyler.github.io/hwyler/

Connect on LinkedIn: linkedin.com/in/hernanwyler

References

MITRE ATLAS
NIST AI Risk Management Framework
OWASP Top 10 for LLM Applications:
ISO/IEC 42001:2023 AI Management Systems
OWASP AI Vulnerability Scoring System (AIVSS)

Tags: #AISecuirty #MachineLearning #ThreatModeling #CyberSecurity #MLOps #LLMSecurity #MITRE #OWASP #NIST #DevSecOps