Kuboid Secure Layer

Posted on Mar 13

RAG Security — How Attackers Poison AI Knowledge Bases and What to Do About It

#ai #security #hacking #webdev

TLDR: You built a RAG system to make your AI accurate and contextual. You fed it your documentation, your product data, your knowledge base. You tested it carefully for accuracy. But did you consider what happens if an attacker gets something into that knowledge base before your users query it? Research published at USENIX Security 2025 showed that just five carefully crafted documents among millions can achieve a 90% attack success rate. Poisoning just 0.04% of a corpus can achieve 98.2% attack success and 74.6% system failure. The most dangerous variant — the Phantom attack — stays dormant until a specific keyword is queried, evading every standard monitoring approach. Your knowledge base is an attack surface. Most teams building RAG systems don't treat it like one.

What RAG Is and Why It's Everywhere

Retrieval-Augmented Generation (RAG) has become the standard architecture for AI applications that need to work with private, current, or domain-specific knowledge. The core idea is simple: instead of relying entirely on what an LLM learned during training, you maintain an external knowledge base — your documentation, product data, internal policies, support tickets, whatever is relevant — and retrieve the most pertinent content each time a user asks a question. The LLM then generates its answer grounded in what was retrieved.

The result is an AI that can answer questions about your specific company, product, or domain without expensive retraining. It's why enterprise AI assistants, customer support bots, internal knowledge tools, and AI-powered search products almost all use RAG under the hood.

The security implication follows directly from the architecture: whatever ends up in your knowledge base ends up in your AI's answers. If an attacker can influence what gets into that knowledge base — however subtly — they can influence what your AI tells every single user who asks a related question.

The Attack Surface: Who Can Write to Your Knowledge Base?

Before considering how RAG poisoning works, consider a more basic question: who and what can add content to the data sources your AI indexes?

For most organisations, the honest answer is: more than you think. RAG systems typically index SharePoint sites, Confluence wikis, Google Drive folders, Slack channels, GitHub repositories, ticketing systems, and shared databases. In many deployments, the write permissions to these sources are governed by general content management policies — not security policies designed with AI indexing in mind.

A compromised employee account can write to your Confluence wiki. A contractor with document upload permissions can add files to SharePoint. An external vendor whose content you've indexed as a trusted source can update their documentation. In each case, content flows from that source into your vector database on the next indexing cycle — with no additional validation, no security review, and no record that anything changed.

The attack surface isn't the vector database itself. It's every content source upstream of it.

How RAG Poisoning Works

The PoisonedRAG research, accepted at USENIX Security 2025 and originally published in February 2024, is the most rigorous public study of knowledge corruption attacks against RAG systems. Its finding is striking: injecting five malicious texts into a knowledge database with millions of texts achieves a 90% attack success rate.

The attack works in two stages. First, the attacker crafts documents designed to be retrieved for specific target queries — using the same semantic similarity principles that make RAG retrieval work, but in reverse. Second, the crafted documents are written to steer the LLM toward an attacker-chosen answer when they appear in the retrieved context. Because the LLM is designed to trust and synthesise its retrieved context, it does.

The goals vary. Misinformation — making the AI state false facts as authoritative. Commercial manipulation — making the AI recommend one product or vendor over another. Credential or data exfiltration — making the AI surface sensitive information embedded in other retrieved documents. Or, combined with the indirect prompt injection techniques covered earlier in this series, triggering the AI to take real-world actions.

Poisoning merely 0.04% of a corpus can lead to a 98.2% attack success rate and 74.6% system failure. At that scale, a single well-placed document in a large enterprise knowledge base is a meaningful threat.

The Phantom Attack: Dormant Until Triggered

Standard defences against RAG poisoning focus on monitoring for degraded system accuracy — if the AI starts producing wrong answers, something may have been poisoned. The Phantom attack, introduced in late 2024, was specifically engineered to defeat this approach.

A Phantom attack injects a single malicious document that remains dormant during normal queries, maintaining system performance metrics, and activates selectively only when specific trigger keywords appear. The poisoned document causes no measurable degradation to general system accuracy and produces no unusual retrieval patterns. It sits in the knowledge base, invisible to standard monitoring, until the specific trigger condition is met — at which point it executes: generating harmful content, exfiltrating data, or triggering a downstream action.

This is significant because it directly undermines the most common approach organisations take to detecting RAG poisoning: monitoring output quality. A Phantom attack passes that test by design.

Cross-Document Injection: The Architectural Complication

Research into RAG defences has surfaced a subtler problem rooted in how LLMs process multiple retrieved documents simultaneously.

Under causal attention, tokens in one retrieved document can attend to — and be influenced by — tokens from other retrieved documents. When the retrieved document set contains conflicting information — correct information versus incorrect information injected by an attacker — cross-document attention can fall into adversarial patterns.

In practical terms: even if your poisoned document is retrieved alongside several legitimate documents, the poisoned content can influence how the model interprets and weights the legitimate content around it. The attacker doesn't need to dominate the retrieved context. They only need to introduce enough signal to shift the model's synthesis.

This is why simply adding more legitimate documents to "dilute" a poisoned one is not a reliable defence strategy.

Supply Chain Attacks on RAG Data Sources

Beyond direct injection into your own knowledge base, a more systemic risk is the supply chain of external content many RAG systems index.

Organisations frequently index third-party documentation, vendor knowledge bases, publicly accessible content, or RSS feeds as part of their RAG pipeline. Any of these represents a trust boundary — if the external source is compromised, updated maliciously, or simply accepts user-contributed content, the poisoning vector exists without the attacker ever touching your infrastructure.

The Slack AI vulnerability discovered in August 2024 combined RAG poisoning with social engineering — demonstrating that real-world exploits don't neatly fit single-category attack descriptions. Attackers who understand your RAG pipeline will probe every upstream source, not just the most obvious ones.

Defence: What Actually Works

Defending a RAG system against poisoning requires treating the knowledge base as a security-sensitive boundary, not just a content management problem.

Access control on write paths, not just read paths. Most RAG security thinking focuses on who can query the system. The more important question is who and what can write to the indexed sources. Applying strict, role-based write permissions to every content source your RAG pipeline ingests — and auditing those permissions regularly — is foundational.

Content validation on ingestion. Documents entering the knowledge base should be checked for anomalous patterns: embedded instruction-like text, unusual metadata, content that semantically conflicts with existing authoritative documents on the same topic. This won't catch sophisticated attacks, but it catches the naive ones and raises the cost for everything else.

Source trust levels and retrieval filtering. Not all sources in your knowledge base are equally trustworthy. Internal, admin-controlled documentation should be weighted differently from content contributed by external parties. Architecturally separating high-trust and low-trust content sources, and restricting what the model retrieves from each for different query types, reduces the blast radius of any poisoning that occurs in lower-trust tiers.

Output monitoring and anomaly detection. Log what your RAG system retrieves and what it generates. Establish baselines for expected retrieval patterns and output characteristics. Deviations — a document being retrieved unusually frequently, answers diverging from established content on the same topic, outputs including unexpected URLs or instructions — are detectable signals if you're looking for them.

Regular knowledge base audits. Periodically review what has been added to your indexed sources and when. Unexplained additions, especially in the period before an anomalous output is noticed, are worth investigating. Git history, SharePoint change logs, and document version tracking are all audit tools that most teams have but don't apply to their RAG pipeline.

Retrieval transparency for end users. Where your application surface allows, showing users which sources contributed to an answer creates a feedback mechanism. Users familiar with your knowledge base are sometimes the fastest way to detect an anomalous source appearing in citations.

What RAG Security Testing Involves

Testing a RAG system for poisoning vulnerabilities is distinct from traditional application security testing. It requires understanding the full data ingestion pipeline — every source, every update frequency, every permission model — and applying adversarial inputs at each stage.

A structured assessment covers: mapping every content source indexed by the RAG pipeline and their write permission models; attempting injection through each write-accessible source to test retrieval effectiveness; testing whether crafted documents can influence answers to target queries without triggering output anomalies; probing retrieval behaviour across trust boundaries; and evaluating whether output monitoring and access logging would detect a real poisoning event.

The goal is a clear picture of which injection vectors exist in your specific architecture, what an attacker could achieve through each one, and which defences are actually in place versus assumed.

How Kuboid Secure Layer Can Help

At Kuboid Secure Layer, RAG security assessments are part of our broader AI application security services. We work with engineering teams building RAG-based products to map their full knowledge ingestion pipeline, test each stage for poisoning vectors, and establish the access controls and monitoring that most RAG deployments are currently missing.

If you're building a RAG application — or running one in production that has never been assessed for these attack patterns — get in touch here. You can also read more about our approach to understand how we work with technical teams.

Final Thought

The reason RAG poisoning is underappreciated as a risk is the same reason it's effective: RAG systems are built to trust their knowledge base. That's the feature. The attack exploits the feature.

Five documents. Millions in the database. 90% success rate. The numbers from PoisonedRAG aren't a warning about future risks — the research was accepted at USENIX Security 2025, built on a technique first documented in early 2024. The attack methodology is mature. The defences, in most production RAG deployments, are not.

If your AI application reads from a knowledge base that anyone — internally or externally — can influence, you have an attack surface. The question is whether you understand its boundaries before an attacker does.

DEV Community