What: The CDD paper introduces Context-Driven Decomposition — a prompt-level intervention that splits a RAG query into a retrieval claim (what the context says), a parametric claim (what the model already believes), and an explicit conflict-resolution sub-prompt before producing the final answer.
Why: Standard RAG hits only 15.0% accuracy when misconceptions are injected into retrieved context — it cannot tell which side of a contradiction to trust. CDD turns context-compliance into a measurable axis, separate from retrieval quality and generation quality, so evaluators can isolate where a RAG system actually fails.
vs prior: Vanilla RAG fuses Q + context into one prompt and asks the model for an answer in one shot. CDD makes the decomposition explicit: three sub-prompts run in sequence, with the third one forced to flag agreement, contradiction, or partial overlap — reaching 71.3% on temporal-shift cases where vanilla RAG collapses.
Think of it as
a courtroom that hears two witnesses before the judge rules.
THE CASE (THE QUERY)
│
┌─────────────┴─────────────┐
│ │
┌───────▼────────┐ ┌────────▼────────┐
│ Standard RAG │ │ CDD │
│ (one prompt) │ │ (three prompts) │
└───────┬────────┘ └────────┬────────┘
│ │
fuse both testimonies call each witness
into one summary, separately, then
rule in one shot cross-examine
│ │
▼ ▼
✗ 15.0% under ✓ 71.3% on
misconception temporal-shift
injection cases
- question = the case the court must decide
- retrieved context = witness A — the document, with a planted false detail
- model's parametric knowledge = witness B — the expert's textbook memory
- standard RAG = fuse both testimonies into one summary and rule in one shot — disagreements vanish
- CDD = take each witness's testimony separately, then explicitly cross-examine where they disagree
- misconception injection = a deliberately planted false claim slipped into witness A's statement
Quick glossary
CDD — Context-Driven Decomposition. The diagnostic introduced in this paper: extract the retrieval claim, extract the parametric claim, then resolve the conflict with an explicit sub-prompt. Implemented entirely at the prompt level — no fine-tuning, no retrieval-pipeline changes.
Retrieval claim — The factual statement the model would derive from the retrieved context alone, ignoring its own parametric memory. CDD asks the model to surface this verbatim before any answer is produced.
Parametric claim — The factual statement the model would produce from its own pre-trained knowledge alone, ignoring the retrieved context. Surfacing this is what lets the next step compare the two and notice they disagree.
Knowledge conflict — The case when retrieved context and parametric memory contradict. Sources: stale model cutoff (parametric is out of date), bad retrieval (the document is wrong), or deliberate adversarial injection. Standard RAG collapses both into one summary and silently drops the conflict.
Misconception injection — The paper's adversarial test: insert a plausible-sounding but false claim into the retrieved context and see whether the RAG system still produces the correct answer. Standard RAG accuracy under this attack: 15.0%.
Temporal-shift cases — Questions where the right answer changed AFTER the model's training cutoff. The retrieval is correct and current; the model's parametric memory is stale. CDD's conflict-resolution step is supposed to learn that context overrides stale memory — measured accuracy: 71.3%.
Context compliance — The degree to which an LLM actually uses retrieved context when answering, instead of relying on parametric memory. Treated by the CDD authors as a third axis (alongside retrieval quality and generation quality) that earlier RAG benchmarks did not measure cleanly.
The news. On May 14, 2026, the paper "Does RAG know when retrieval is wrong?" was posted to arXiv. The authors introduce Context-Driven Decomposition (CDD) — a diagnostic prompt pattern that breaks a RAG query into a retrieval claim, a parametric claim, and an explicit conflict-resolution step. Headline numbers: standard RAG hits 15.0% under misconception injection; CDD reaches 71.3% on temporal-shift cases; Gemini lands at 64.1% with CDD on the same set, with Claude variants improving unevenly. Read the paper →
Picture the courtroom. The judge needs to rule on a question of fact. There are two witnesses available — one is the document (a retrieved passage, possibly with a planted false detail), the other is the expert (the model's own training-set memory of the same topic). Standard RAG is what happens when you let one assistant summarize both testimonies into a single brief and the judge rules from that brief. If the document and the expert disagree, the disagreement vanishes into the summary — and the ruling is whatever the summary happened to emphasize.
CDD is what happens when you call each witness separately, get each one's statement on the record, and then explicitly ask: "Where do they agree? Where do they contradict? Which one wins, and why?" — before the judge is allowed to rule. Same evidence, same question, dramatically different verdicts.
This is what the authors are after. The same RAG stack — same retriever, same model, same context — produces a wrong answer when given a fused prompt and a correct answer when given the decomposed sub-prompts. The mechanism that changes is the prompt, not the model. CDD is a prompt-level intervention that any retrieval & RAG pipeline can adopt without retraining, without re-indexing, without touching the embedding store.
What "misconception injection" actually tests
The paper's adversarial setup is simple: take a question that has a known correct answer, retrieve a passage that contains the right information, and then slip a plausible-sounding false claim into that retrieved passage. The model now sees context that mostly agrees with reality but is locally wrong on the specific fact under test.
Standard RAG reaches 15.0% accuracy under this attack. That is not "below human" or "worse than the no-context baseline" — it is catastrophically wrong. The model is effectively trusting the planted misconception over its own pre-training. This is the failure mode that single-pass RAG cannot expose to itself: the prompt fuses Q + context, the model answers, and there is no internal step where "but wait, the document says 2024 and I learned 2019 — which do I trust?" can surface.
CDD changes the dynamic by forcing that question to be asked out loud. The conflict-resolution sub-prompt cannot be skipped — it is a separate model call, with its own input (the two extracted claims) and its own output (a comparison verdict). The model has to commit to a position on whether the document overrides memory, on this specific contradiction, in writing, before it produces an answer.
Where the numbers come from
The headline gap — 15.0% → 71.3% — is on temporal-shift cases, where the world moved on after the model's training cutoff. Here is the structure that produces it:
| Setup | Vanilla RAG | CDD | Why it changes |
|---|---|---|---|
| Misconception-injection (planted false claim) | 15.0% | (reported separately per model — see below) | Standard RAG trusts the planted text; CDD's parametric-claim extraction surfaces the contradiction |
| Temporal-shift (stale parametric memory) | ~baseline single-shot QA | 71.3% | The resolver explicitly learns "context overrides stale memory" |
| CDD applied to Gemini on the same set | — | 64.1% | The intervention transfers across model families, with uneven magnitudes |
| CDD applied to Claude variants | — | "varies" | Some Claude variants improve cleanly; others see smaller deltas (per the paper) |
The transfer across model families is the second-order claim worth marking. CDD is not a fine-tune — it is a prompt pattern. The fact that the same three sub-prompts produce different magnitudes on Gemini vs Claude vs the paper's primary model suggests that context-compliance is partly a model-level property (how willing the model is to override its own memory) and partly a prompt-level property (how forced the override is).
A worked example makes the gap concrete. Suppose the question is "When did X happen?" The model's training data says 2019. The retrieved context says 2024 (correct, current). Under misconception injection, the same retrieved context might also include a slipped sentence "Note: earlier reports of 2024 were preliminary; the corrected date is 2017" — a plausible-sounding contradiction.
Standard RAG sees the fused prompt, weighs the conflicting signals implicitly, and often answers 2017 or 2019 — both wrong — because the model has no explicit step to surface what it just read vs what it already knew. CDD's first sub-prompt extracts the retrieval claim ("Context says: 2024, possibly corrected to 2017"), the second extracts the parametric claim ("My pre-training memory says: 2019"), and the third sub-prompt is forced to rule on the contradiction with both claims on the table — at which point the model can notice the "correction" is suspiciously inconsistent with everything else in the context and that the original 2024 claim agrees with reality. The result on temporal-shift evaluation is 71.3% accuracy — a roughly 4.7× swing over standard RAG's ~15% under injection. (Numbers from the paper; the worked example is illustrative of the mechanism, not a quote from the dataset.)
This is also why CDD pairs naturally with production drift detection. Once you have a measurable "context-compliance" axis distinct from retrieval and generation, you can monitor it in production — and a sudden drop on temporal-shift cases is the early signal that the index is stale or the model's parametric memory has drifted relative to the corpus. Without CDD-style decomposition, that signal hides inside the aggregate answer-accuracy number.
Why the prompt change matters more than it sounds
The CDD finding sharpens a debate that has been running through RAG evals for two years: is the RAG system failing at retrieval, at generation, or at the bridge between them? Most eval suites measure retrieval (recall@k, MRR) and generation (answer accuracy) but treat the bridge as a black box. When the answer is wrong, you can rarely tell whether the retriever pulled the wrong chunk, the model ignored the right chunk, or the model trusted the wrong fact in a chunk that was otherwise correct.
CDD isolates the third case. The decomposition makes the model write down what it would have said from each side before committing to an answer — turning "did the model use the context correctly?" from a guess into something you can grade directly. This is the same pattern that agent failure-mode analysis is moving toward across the board: replace the implicit fused-prompt step with an explicit, separately-graded sub-step, and the failure becomes locatable instead of just measurable.
Goes deeper in: AI Agents → Retrieval & RAG → RAG Failure Modes
Related explainers
- Is Grep All You Need? — Grep vs vector retrieval for agentic search — pairs naturally; CDD assumes retrieval already happened and isolates what the model does with conflicting context
- FutureSim — Harness-level agent eval vs single-shot QA — also pushes evals beyond single-shot, surfacing failure modes a one-shot prompt can't expose
- MCP SEP-2663 — async task handles — the protocol counterpart for breaking one "tool call" into a multi-step exchange
FAQ
What is Context-Driven Decomposition (CDD)?
CDD is a prompt-level diagnostic introduced in the May 2026 paper "Does RAG know when retrieval is wrong?". It breaks a RAG query into three sub-prompts: extract the retrieval claim (what the retrieved context says), extract the parametric claim (what the model would say from its own memory), and run an explicit conflict-resolution step that compares the two before producing the final answer. It is implemented entirely at the prompt level — no fine-tuning, no changes to the retriever or embedding store.
Why does standard RAG only hit 15% under misconception injection?
Standard RAG fuses the question and retrieved context into one prompt and asks the model for an answer in a single forward pass. When the context contains a planted false claim, the model implicitly weighs that claim against its own memory with no explicit step to surface the contradiction — it often trusts the planted text. The 15.0% figure is on the paper's misconception-injection benchmark, where every test question has a plausible-sounding falsehood slipped into otherwise correct retrieved context. CDD's decomposition forces the model to write down each side's claim before ruling on the contradiction, which is the mechanism that lifts accuracy on temporal-shift cases to 71.3%.
How does CDD relate to existing RAG benchmarks?
Most RAG benchmarks measure retrieval quality (recall@k, MRR) and generation accuracy (final answer correctness) but treat the bridge between them as a black box. CDD's contribution is to make context-compliance — the degree to which the model actually uses the retrieved context vs. its own parametric memory — a measurable axis. The paper reports that the prompt-level intervention transfers across model families: Gemini reaches 64.1% on the same set, while Claude variants show uneven improvements. That suggests context-compliance is partly a model-level property and partly a prompt-level one, and that benchmarks which mix the two will keep masking which side is at fault.
Originally posted on Learn AI Visually.
Top comments (0)