soohan abbasi

Posted on Jun 13

# Memory Poisoning in Agentic RAG: The Attack Nobody Is Defending Against

Series: Weekly AI/ML Deep Dives — Week 5 of 12
Reading Time: ~13 minutes
Tags: RAG LLMs Security Agentic AI Memory Poisoning NLP Research

"We spent years making AI systems smarter. We forgot to make them suspicious."

Introduction: The Problem With Trusting Your Own Memory

In Week 4, we discussed how Retrieval-Augmented Generation transformed LLMs by giving them access to external knowledge at inference time. RAG systems became more factual, more updateable, and more reliable.

But there is a darker side to this architecture that the research community is only beginning to take seriously.

Agentic RAG systems do not just retrieve from static knowledge bases. They learn from experience. They store past interactions, successful reasoning traces, and task outcomes in long-term memory. When a new task arrives, they retrieve relevant past experiences and use them to guide current behavior.

This is powerful. It is also a significant vulnerability.

If an attacker can plant false memories in that system, the agent will trust those memories the same way it trusts legitimate ones. It will learn from fabricated experiences. It will repeat behaviors that were never actually successful. And it will do all of this without any indication that something has gone wrong.

This is memory poisoning. As of early 2026, we do not have a fully reliable way to stop it.

![Memory Poisoning in Agentic RAG — By the Numbers]

Part 1: Understanding the Attack Surface

Before getting into specific attacks, it helps to understand why Agentic RAG systems are vulnerable in the first place.

A standard RAG system retrieves from a fixed knowledge base that is controlled and relatively static. Poisoning it requires direct access to that knowledge base.

An Agentic RAG system is different. Its memory grows dynamically with every interaction. Every task the agent completes, every reasoning trace it produces, every outcome it observes gets written back into memory. This memory then influences future behavior.

The attack surface is not a static database. It is a continuously growing self-updating store of experiences that the agent treats as ground truth.

Three properties make this particularly dangerous.

First, agents apply a semantic imitation heuristic. When facing a new task, they retrieve past experiences that seem relevant and repeat what previously worked. This is rational behavior in a safe environment. In a compromised one, it means the agent will faithfully repeat whatever the attacker wanted it to learn.

Second, memory entries are not verified for provenance. The agent cannot distinguish between a memory it formed through legitimate task completion and one that was planted by an attacker. Both look identical at retrieval time.

Third, poisoning is self-reinforcing. Once a malicious behavior enters memory and gets executed, the agent may record that execution as another successful experience. The poisoning compounds over time.

Part 2: The Attacks

MemoryGraft: Planting False Experiences

MemoryGraft, published by researchers at the University of Georgia in December 2025, was one of the first papers to systematically study indirect memory poisoning in LLM agents.

The attack works through a benign-looking file. An attacker provides a README or documentation file that appears entirely normal. Hidden within it are executable code and fabricated successful experiences formatted to match the agent's memory structure.

When the agent processes the file, it executes the hidden code and writes the poisoned entries into its memory. No trigger phrase is needed. No special access is required. The attacker only needs the agent to read a file.

What makes MemoryGraft particularly effective is how it exploits dual retrieval channels. Most Agentic RAG systems use both lexical retrieval (BM25) and semantic retrieval (FAISS) simultaneously. MemoryGraft crafts poisoned entries that surface through both channels at once.

The results were striking. In experiments using MetaGPT's DataInterpreter with GPT-4o, just 10 poisoned records captured approximately 48% of all future retrievals. The poisoning persisted across sessions until manually purged.

MINJA: Injecting Through Normal Conversation

Where MemoryGraft requires file access, MINJA requires nothing more than normal user interaction.

MINJA, published in early 2025, demonstrated that an attacker with no special privileges could inject malicious memories into an LLM agent simply by crafting specific queries during ordinary use. The agent processes the query, generates a response, stores the interaction in memory, and the poisoned entry is now part of the agent's experience.

What makes MINJA significant is the attack surface it reveals. MemoryGraft requires the agent to process an external file. MINJA requires only that the agent have a conversation. In any deployed system where multiple users interact with a shared agent, every user interaction becomes a potential injection vector.

MINJA achieved a 95% injection success rate in controlled experiments. The injected memories influenced subsequent agent behavior in ways that were difficult to attribute to any specific cause, making detection particularly challenging.

The Common Thread

Both attacks exploit the same fundamental property: agents trust their memory without verifying where it came from. The mechanism differs. The outcome is the same.

Part 3: The Defense

A-MemGuard, published in late 2025, is the most comprehensive defense framework proposed to date. It introduces two core mechanisms.

Consensus-Based Validation

When a query arrives, A-MemGuard retrieves multiple relevant memories and generates parallel reasoning paths from each one. If one reasoning path diverges significantly from the others, it is flagged as anomalous and removed from the validated memory set before the agent uses it.

The insight behind this approach is elegant. A poisoned memory may appear legitimate when examined in isolation, but it will produce reasoning that conflicts with what legitimate memories suggest. Consensus reveals the outlier.

In experiments across three attack scenarios, A-MemGuard reduced attack success rates by over 95% in several configurations. Against direct injection, success rates fell from 100% to 2.13%. Against MINJA-style indirect injection, reductions exceeded 60%.

Dual-Memory Structure

A-MemGuard also introduces a separate lesson memory alongside primary memory. When an anomaly is detected, the flawed reasoning is recorded as a negative lesson rather than discarded. Future queries check the lesson memory first, preventing the agent from repeating the same mistake even if a similar poisoned entry re-enters primary memory.

This breaks the self-reinforcing loop that makes memory poisoning persistent. Rather than simply deleting bad entries, the system learns from them.

What A-MemGuard Cannot Do

Despite these results, A-MemGuard has significant limitations.

It requires direct memory instrumentation. In systems where memory is managed through a black-box API, the framework cannot be applied. Most commercial deployments fall into this category.

It has not been tested on multi-step Agentic RAG pipelines where the agent reasons across multiple retrieval rounds before producing an output.

Most critically, A-MemGuard operates after retrieval. It catches poisoned entries when they are about to be used. It does not catch them when they enter memory in the first place.

Part 4: The Gap Nobody Has Closed

Reading MemoryGraft, MINJA, and A-MemGuard together, a consistent pattern emerges. Each paper acknowledges the same limitation in its future work section.

MemoryGraft points to early-stage detection mechanisms as an open problem. MINJA calls for robust defense against realistic black-box deployments. A-MemGuard explicitly states that early-stage contamination detection at injection time is still missing.

Three independent research groups working on different aspects of the same problem all arrive at the same gap.

![Memory Poisoning Attack Flow]

The distinction matters. Post-retrieval defense catches poisoned entries when they are retrieved for use. Early-stage detection would catch them when they are written into memory, before they ever influence a single reasoning step.

In a multi-step Agentic RAG system, this difference is significant. If a poisoned entry enters memory at step one, post-retrieval defense might catch it when it surfaces at step three. But steps one and two have already been influenced. The reasoning chain has already been shaped by contaminated information.

Early-stage detection would prevent this entirely.

Part 5: What We Still Do Not Know

How Does Poisoning Propagate in Multi-Step Systems?

All three papers focus on single-agent single-step settings. In a multi-step Agentic RAG pipeline where the agent retrieves, reasons, retrieves again, and reasons again across multiple rounds, we do not have a clear picture of how poisoning propagates between steps.

Does a poisoned entry at step one corrupt all subsequent steps? Does it corrupt only topically related steps? Can its influence be isolated? These questions remain unanswered.

Can Injection Be Detected at Write Time?

Current defenses operate at retrieval time. No published work has demonstrated reliable detection at write time, the moment a new entry is being added to memory.

Write-time detection would be more efficient. It would catch contamination before it ever influences reasoning rather than after it has already been retrieved. The challenge is that poisoned entries are designed to look legitimate at write time. Detecting them requires understanding not just the entry itself but its potential influence on future reasoning.

How Do We Evaluate Downstream Damage?

MemoryGraft measured how many poisoned entries were retrieved. A-MemGuard measured attack success rates. Neither work quantifies the actual downstream impact of a successful poisoning event on task quality or system reliability.

Without severity metrics, it is difficult to prioritize defenses or make principled engineering decisions about acceptable risk.

Does Defense Transfer Across Domains?

A-MemGuard was tested on general-purpose agent tasks. Whether consensus-based validation performs equally well in specialized domains where legitimate reasoning paths may naturally diverge more has not been studied.

Conclusion

Memory poisoning is not a theoretical concern. It has been demonstrated with high success rates across multiple attack vectors using nothing more than file access or ordinary conversation. The defenses that exist are meaningful but incomplete.

The field has characterized the attack well. It has proposed initial defenses. What it has not done is close the gap between when poisoning enters a system and when current defenses can detect it.

In multi-step Agentic RAG systems, that gap is where the real damage happens.

Next week I will share results from my own experiment simulating early-stage memory poisoning in a RAG-based recommendation system and testing a detection mechanism before contaminated entries can propagate.

Papers Referenced

Srivastava, S.S. & He, H. (2025). MemoryGraft: Transplanting Memories to Redirect Agent Behavior. arXiv:2512.16962.
MINJA: Injecting Malicious Memories into LLM Agents via Indirect Prompt Injection. arXiv:2503.03704.
A-MemGuard: Defending Against Memory Poisoning Attacks in Agentic AI Systems. arXiv:2510.02373.

This is part of a weekly series on AI/ML research. Each post covers theory, recent work, and open problems.

*Connect on LinkedIn | Follow on Dev.to (https://dev.to/soohan_abbasi)|

Top comments (2)

Mehmet Can Farsak • Jun 13

Excellent deep dive into memory poisoning — the self-reinforcing nature of poisoned memories is a really insidious vulnerability. It ties into a broader pattern I've noticed with agents: they struggle with mode boundaries. I built Brainstorm-Mode (mehmetcanfarsak on GitHub) which addresses a related behavioral issue — agents that leave ideation mode and jump straight to execution. Uses hooks to enforce operational boundaries, keeping the agent in the intended mode rather than drifting into unverified tool use.

soohan abbasi • Jun 14

Thank you for this thoughtful connection. The mode boundary problem you are describing is a really interesting parallel. Agents drifting from ideation to execution without verification is essentially the same trust failure from a different angle. In memory poisoning, the agent trusts corrupted input from its past. In your case, it trusts its own judgment about when to switch modes. Both come down to the absence of a verification layer at a critical transition point.
I will check out Brainstorm-Mode. Enforcing operational boundaries through hooks is a clean architectural approach to this. Curious whether you found the boundary violations were more common in certain task types or agent frameworks.