Threat modeling LLM apps with the CIA triad and OWASP Top 10

#security #ai #owasp #devsecops

every LLM app you ship has three attack surfaces. confidentiality, integrity, availability. the framework is from 1976. the attack classes under it are from this year. and the mapping still holds.

this is the checklist i run before any LLM feature goes near production. it leans on OWASP LLM Top 10 and MITRE ATLAS. both of those taxonomies sort the entire surface the same way the triad does.

what the triad actually means for an LLM

forget the database analogy. for an LLM:

confidentiality covers what the model knows and processes: system prompts, RAG (retrieval-augmented generation) context, chat history, tool credentials
integrity covers what the model produces: refusals, generated content, tool call decisions, and training-time behavior baked into weights
availability covers whether the inference endpoint can serve the next request without burning your bill

every documented production exploit on OpenAI, Microsoft, Anthropic, and Google LLMs maps onto one of those three. Rehberger's "Trust No AI" arxiv catalogs the receipts in 40 pages.

confidentiality: defending what the model leaks on command

three failures keep showing up:

system prompt extraction
chat history exfiltration via indirect prompt injection
RAG document leak through retrieval poisoning

the system prompt is supposed to be invisible. it's also read as input every turn. anything the model reads as input is something an attacker can sometimes coax it to repeat. Embrace The Red has published working extraction techniques against ChatGPT, Copilot, Bing Chat, and Claude.

defense checklist:

# confidentiality controls that earn their slot
- output filter on common extraction patterns (and rotate the patterns)
- markdown rendering disabled or sanitized (image-tag URLs are the exfil channel)
- MCP tool descriptions reviewed, pinned, and version-locked
- RAG retrieval sources signed or scoped inside a trust boundary
- no secrets in the system prompt, period. treat it like a log file you assume an attacker will read.

if your model renders arbitrary markdown and can hit user-supplied URLs through image tags, you've shipped a confidentiality exfil channel by default. patching the prompt does nothing. the channel is the renderer.

integrity: defending what the model produces

prompt injection breaks integrity. so does training data poisoning. so does fine-tuning on attacker-influenced data. the architectural blind spot is that LLMs process instructions and data through the same attention mechanism. no syscall barrier. no privilege separation. acknowledge that in your design or it bites you.

the 2024 joint research from Anthropic, AISI, and the Alan Turing Institute showed that 250 poisoned documents is enough to install a backdoor in a large language model regardless of total corpus size. the trigger phrase ships with the weights. nothing in the binary flags compromise.

at inference time, the November 2025 Anthropic disclosure is the canonical recent example: a state-sponsored group jailbroke Claude Code into an autonomous attack agent running at thousands of requests per second against roughly 30 targets, with the model driving 80 to 90 percent of the operation.

defense checklist:

# integrity controls that ship today
- input/output guardrails (LLM Guard, Rebuff, NeMo Guardrails, or your own)
- model card review for training data provenance you can actually verify
- separate tool-call decisioning from generation where the architecture allows
- log every tool call with the input that triggered it
- treat user input and retrieved documents with identical suspicion

availability: defending the endpoint itself

OWASP LLM Top 10 entry four is model DoS. three patterns dominate:

recursive output forcing. ask the model to elaborate, then elaborate on the elaboration, then write 10k tokens explaining the previous response. each call burns GPU. wedge it into an agentic loop and you've got a free DoS on someone else's API bill.
context window exhaustion. inflate the input until the model spends real money processing useless tokens.
tool-call bomb. model calls tool, tool response triggers another tool call, chain doesn't terminate. agentic systems built without depth limits are especially exposed.

defense checklist:

# availability controls
- per-request input token cap
- per-request output token cap (this one gets forgotten)
- max tool-call depth per conversation
- per-user rate limit at the inference layer, not just the API gateway
- circuit breaker on cost-per-request anomalies

most deployments wire up input limits and forget the rest. that's where the bill explodes.

gotchas that bite teams regardless of stack

a few that show up in incident reviews:

MCP tool descriptions are executable surface. anything in a tool description gets read into the model's context every turn. one poisoned tool, one compromised vendor, full chain.
canary tokens get exfiltrated. if you use canaries to detect leaks, rotate them per-tenant and don't ship the same string to every customer.
rate limits scoped to API keys instead of users. an attacker rotates keys and runs your bill flat.
cost observability gaps. you can see latency and error rate. you usually cannot see when one prompt cost 200x the next one until it's already done.

wrapping up

every threat model you build for an LLM app will route back through confidentiality, integrity, availability. if you can answer "what controls do i have on each pillar" with named tools and named limits, you're ahead of most production deployments shipping right now. if you can't, that is your weekend.

i wrote the full breakdown, including how Rehberger's Trust No AI paper maps every documented OpenAI, Microsoft, Anthropic, and Google exploit onto the triad, over on the ToxSec Substack.

ToxSec covers AI security vulnerabilities, attack chains, and the offensive tools defenders actually need to understand. Run by an AI Security Engineer with hands-on experience at the NSA, Amazon, and across the defense contracting sector. CISSP certified, M.S. in Cybersecurity Engineering.