DEV Community: Stacey Schneider

Study: stale documents are RAG poisoning without the attacker

Stacey Schneider — Fri, 26 Jun 2026 15:29:37 +0000

RAG poisoning gets attention as a security problem — an attacker injects a bad fact into the retrieval index, the pipeline serves it confidently, the model answers from it.

Poisoning is the adversarial version of a problem every RAG system already has in production: stale retrieval. A document gets updated. The old version stays in the index. The pipeline retrieves it. The model answers correctly from wrong information.

While this may not be adversarial, it is ever-present and a far more inevitable threat.

Two architectures for private AI knowledge

Most teams building private AI — giving a model access to internal documents, policies, or product knowledge — end up on one of two architectures.

The first is vector RAG. Documents are chunked, embedded into a vector store, and retrieved via approximate nearest-neighbor search. The model gets whatever chunks scored highest against the query. This is the dominant approach and what most hosted "private GPT" products use under the hood.

The second is wiki-style context. Documents live as versioned markdown files with structured metadata. Retrieval is handled by a deterministic selector — you specify what the agent is allowed to read based on document identity, version, and eligibility. No embeddings, no approximate search. ContextNest uses this approach.

Both architectures solve the same problem: getting private knowledge into a model's context window. They diverge sharply on how they handle updates, access control, and auditability — which is exactly what this experiment measures.

In conjunction with Emory University and IBM Research, my company conducted an experiment to see just how often and certain of a threat this is, proving our thesis that governed context is the effective way to solve this pressing problem.

The setup

We created a medium sized 1,060-document vault. Then we tested against three retrieval configurations:

Dense vector retrieval with HNSW (production-configured)
BM25 sparse retrieval
A governed selector that gates retrieval on document eligibility — superseded versions are ineligible by definition, not filtered after the fact

For the stale-version test: updated a set of documents, marked old versions superseded, left them in the index (standard behavior — most pipelines don't delete), then ran 30 queries and measured answer quality pass rate.

This is structurally the same as a poisoning scenario: a bad document is present in the index, retrieval has no mechanism to exclude it, and predictably, the model runs afoul of the deprecated documents.

What we found

Retrieval method	Pass rate	Avg tokens/query
Governed selector (ctx)	97%	~215
BM25	90–93%	655–725
Dense + HNSW	below BM25	similar

BM25 outperforms dense under stale conditions because term matching is less susceptible to approximate similarity between a superseded document and its replacement. The governed selector reaches 97% because the superseded document isn't eligible for retrieval at all — the eligibility check runs before similarity scoring, not after.

That's also why it's a better poisoning defense. Filtering retrieved results after the fact means the poisoned document still reached the ranking step. Eligibility gating means it never did.

Dense retrieval had an additional finding: non-deterministic on 80% of queries (worst-case Jaccard 0.21). Under a poisoning scenario, that variance becomes an attack surface — you can't predict or audit which documents actually reached the model.

One gap to flag: we didn't benchmark a hybrid retrieval baseline (BM25 + dense combined), which is common in production. The paper isolates eligibility and determinism specifically, not general retrieval quality.

What eligibility gating looks like in practice

The unit of governance is the document version. Every document in ContextNest carries frontmatter:

---
title: API Rate Limiting Policy
type: document
status: published
version: 3
created_at: '2026-05-01T09:00:00.000Z'
updated_at: '2026-06-15T14:22:11.000Z'
checksum: 'sha256:9f4a2e...'
---

Rate limiting is enforced at 1,000 requests per minute per API key...

Version history lives in a separate history.yaml — one entry per version, hash-chained:

keyframe_interval: 10
versions:
  - version: 1
    keyframe: true
    edited_by: misha@promptowl.ai
    edited_at: '2026-05-01T09:00:00.000Z'
    published_at: '2026-05-01T09:00:00.000Z'
    content_hash: sha256:a3f2c1...
    chain_hash: sha256:b7c144...
  - version: 2
    edited_by: stacey@promptowl.ai
    edited_at: '2026-05-20T11:15:00.000Z'
    published_at: '2026-05-20T11:15:00.000Z'
    content_hash: sha256:e4d981...
    chain_hash: sha256:f8a203...
  - version: 3
    edited_by: misha@promptowl.ai
    edited_at: '2026-06-15T14:22:11.000Z'
    published_at: '2026-06-15T14:22:11.000Z'
    content_hash: sha256:9f4a2e...
    chain_hash: sha256:2c8b77...

The chain hash is computed as:

chain_hash[n] = SHA-256(chain_hash[n-1] : content_hash[n] : version[n] : author[n] : timestamp[n])

Retrieval serves the current version only — version 3 in this case. Versions 1 and 2 stay in history for audit and rollback, but agents never see them. There is no status: superseded flag to forget to set — old versions simply aren't the head.

This also closes the poisoning vector. A tampered or injected document that isn't the current approved version doesn't reach the retrieval surface. Any break in the chain hash is detectable — a silently edited version, a deleted entry, a backdated change.

Stewardship adds the approval layer. In governed mode, edits save as drafts — invisible to agents until a Reviewer steward approves them. Just like in GitHub CI/CD pipelines, authors can't approve their own work (enforced server-side). Stewards are scoped by document, tag, or nest, so a #compliance tag can route that document class to the legal team without touching anything else. Again, this process shields against poisoning responses by gating access control before retrieval, not after.

Running it yourself

ContextNest is an open source project that can run seamlessly behind tools you already use — like Claude, Cursor, or Antigravity. In fact, it can connect them with a unified context, finally pulling all your work together, and lessening the aggravation of running out of tokens for the next 5 hours.

The community edition is free and self-hosted. You'll need a free PromptOwl account to generate a license key — go to Overview → Community License → Create a Community License key. It starts with pk_ and takes about 30 seconds.

Once you have a key, spin up the server:

npx @promptowl/contextnest-community

That starts the server at http://localhost:3838. On first boot it drops you into a setup page where you paste the key. After that it's live — no restart required. To scaffold a new vault from scratch first:

npx @promptowl/contextnest-cli init

Claude Code and Cursor pick up the MCP server automatically. Plain markdown files, no hosted database, no embeddings service, no additional API keys.

The community edition source is on GitHub at github.com/PromptOwl/ContextNest.

Full methodology, tables, and reproducible experiment code: promptowl.ai/resources/verifiable-context-governance/

Disclosure: I work at PromptOwl. Joint research with Benn Konsynski (Emory/Goizueta) and Gabe Goodhart (AI Open Innovation).

Orchestration Is Officially A Commodity. Context and Governance Are the Moat.

Stacey Schneider — Thu, 30 Apr 2026 14:51:26 +0000

The industry is sending flares galore that orchestration is a commodity. Symphony dropped last week. An orchestration spec for Codex, handed to the community, with a note that OpenAI has no plans to keep it as a standalone product. You open-source the pipes. You don't open-source your moat.

GitHub's ACE is in technical preview. Warp's Oz ships with Claude, Codex, and Gemini on day one. Every major player in Agentic Workflow Orchestration (AWO) is landing on the same primitives at the same time. That's not competition. That's commoditization.

I've been building agentic systems for years at PromptOwl and watched this happen from the inside. A year ago we had a full drag-and-drop orchestration editor—think n8n. In practice it wasn't useful for agentic systems, especially for regular business users, so we buried it: first behind a logic block editor, then behind setup wizards. Orchestration became plumbing because that's what it is.

From my perch though, it is crystal clear that context and governance are the real battlegrounds. That's where the actual competition starts.

Context: The Thing That Makes Agents Useful

Most agents underwhelm. They lie, they forget, and they fall over. All fireable offenses for humans, but agents are still in kindergarten. We keep them around hoping to raise them up to be valuable workers.

Their problem isn't that the models are bad—frontier models are remarkably capable and even the generic models are extremely useful today. Agents keep doing all the wrong things because they don't know your business.

Orchestration platforms treat context like a config file. You paste a system prompt, maybe attach a few documents, and call it done. That's not context. That's a briefing. And briefings go stale.

Real context is institutional and organic in nature. It's how your team makes decisions when two good options conflict. It's what "done" means in your org versus what it means in the ticket description. It's the three approaches that were tried and abandoned before you joined. None of that is in a README. It lives in people's heads and it leaks out slowly over Slack threads and PR comments and post-mortems you forgot about.

Agents operating without that context will be productive in exactly the way an eager new hire with no onboarding is productive—they'll move fast, they'll complete tasks, and they'll occasionally make decisions that anyone who'd been around for six months would have caught immediately.

Winning here isn't about better routing. It's about maintaining a living, curated layer of institutional context that agents can draw on—and that teams actually own, update, and move.

Warning: Context portability is the canary in the lock-in coal mine. If your agent's intelligence lives in the platform's proprietary memory system, you're not building on a tool. You're feeding a database you don't own.

Governance: The Thing That Makes Agents Trustworthy

The second problem is trust—and it's why most enterprises are still in the stands watching the AI wars play out.

Governance at the agent layer isn't a dashboard feature. It's a standards problem. And the standards are starting to arrive.

AAuth—a new specification from Dick Hardt, the architect behind OAuth 2.0 and OpenID Connect—gives agents cryptographic identity. That's the foundation, but what it unlocks is more interesting than the spec itself: standard payloads that document what the agent did, under what authority, with what inputs. Not logs you search after something goes wrong. Structured, signed records of responsibility that travel with every action.

That same chain-of-custody logic needs to extend across the entire stack. Which version of the prompt ran this interaction? Was that version reviewed and approved before it went anywhere near production? What did the agent actually say, and can you prove it? Full traceability of interactions—signed—isn't an audit feature. It's the baseline for any organization that plans to let agents act autonomously on anything consequential. Evals belong here too: you can't trust an agent you've never tested, and test results need to be part of the record, not a spreadsheet someone ran once before launch.

The pattern is consistent regardless of stack or team size: organizations hit the context wall first, then the traceability wall, then realize they were the same wall the whole time. You can't govern what you can't trace, and you can't trace what was never signed.

Context needs the same rigor. Institutional knowledge that agents draw on—your SOPs, your decision frameworks, your accumulated organizational memory—has authors and owners. Who wrote it, who approved it, when was it last verified? Right now most teams treat this as vibes. It needs to be treated as provenance. If an agent makes a bad call because it was working from a policy that was outdated or never formally ratified, that's not a model problem. It's a stewardship problem.

AWO platforms building only for speed are going to hit this wall hard. Identity, traceability, and context stewardship aren't enterprise add-ons. They're the price of admission for anything consequential.

What to Ask Before You Hand an Agent the Keys

Forget asking whether your orchestration platform is multi-model. Warp already is. That question is over. Here's what to ask instead.

One. Where does your context live? Is your institutional knowledge—your prompts, your memory, your agent configuration—stored in a portable format you own? Or is it accumulating inside the platform's proprietary system, getting harder to move with every passing week?

Two. Can you audit what your agents did? Not just "did the task complete"—but what decisions did the agent make, what information did it act on, and who authorized it to do so? If you can't answer that after the fact, you can't run agents on anything that matters.

Three. Do you have approval flows for high-stakes actions? Before an agent merges a PR, sends an external communication, modifies production data, or spends money—is there a human checkpoint? Can you configure where that checkpoint sits without writing custom middleware?

Four. Can you scope agent access? Does the platform let you define what each agent can see and touch? Or is the default "everything the authenticated user can access," which in practice means your agent has more access than most of your employees.

Five. Can you roll back? If an agent makes a bad call—and eventually one will—what does recovery look like? Is it a config change, a button click, or a three-day archaeology project through logs nobody labeled?

How to Evaluate Platforms Right Now

Skip "how many agents can it run in parallel" and "which models does it support." Ask: does this platform make my agents smarter over time as they accumulate context about my business? And does it give me the controls I need to trust agents with decisions that have actual consequences?

A platform that routes tasks quickly and has no answer for context decay or governance is fast plumbing. Impressive at the demo. Underwhelming in production.

The orchestration race is already decided. Every major platform is converging on the same primitives and OpenAI just open-sourced the proof. What happens next is the interesting part—who owns the context layer, who can prove what their agents did and why, who built stewardship into the foundation instead of bolting it on after a compliance scare.

The pipes are done. Now comes the hard part.

Running AI on a Budget: 11 Tactics for Enterprise-Scale Efficiency

Stacey Schneider — Wed, 15 Apr 2026 19:56:58 +0000

At my company, PromptOwl, everyone coworks with AI for 90 to 100% of their work. It's everywhere, in every process, and increasingly connecting everything. Engineering, marketing, sales, leadership—AI is in every workflow, every day. Getting there took us just over a year, and taught us a lot about what that actually costs.

Running AI at that scale boils down to two optimization problems: money and time.

Money shows up on the monthly invoice, and if you are not paying attention it will floor you in costs. AI is expensive to run blindly. We expect to pay $2-300 per developer, per day on the frontier models, but we had to make sure we weren't wasting money on things that didn't matter.

Time is the second problem. It's the hours lost to waiting for generated responses, to rerunning prompts because bad context resulted in wrong outputs, and workflows that require constant manual intervention to function and not fall over.

Optimizing for time and money is an evergreen effort. Too much is evolving, and we will always need to adapt. But these eleven tactics are the foundation of how we run AI at scale.

The Setup

Get this right once. It pays back on every session.

1. Organize your Prime Documents

On our journey to 90%, the first obvious problem was re-explaining core business tenets to the models for every conversation. So, I started writing context files for the models—in markdown, to cut down on the size of the messages sent to the LLMs.

A Prime Document is a structured context file written specifically for AI use—not a deck you'd send to a colleague, but a document the model can actually use. Your brand brief, product spec, customer profile, team norms. Every function that runs AI regularly should have at least one.

Most teams don't have these. They improvise context in the chat window every time, and wonder why the outputs aren't consistent.

2. Write them for AI—shrink your files

I like to think that every character I send to AI lights up a GPU in some data center. And while I love a light show, I visualize a huge energy waste in padding the data. It also can contort the results, as the context window for each LLM is fixed. Your whole conversation has to fit in it. So every unnecessary character, eats away at what the models can do for you.

Uploading a full 60-page brand guide every time may provide detail, but it will cost you processing time, tokens—and potentially the right answer. The model re-reads everything it doesn't need, every single time. The more it has, the more it can confuse.

Your Prime Documents should contain only what's relevant, written in a format the model can parse efficiently—not formatted for a human presenting to a board. Ask it to distill these documents for itself, then start a new chat with it. Smaller, cleaner input means identical or better output quality, faster responses, and a lower cost per call.

3. Use a context management tool

Eventually, you will collect a nest of these machine optimized documents. And then maintaining them becomes the new burden.

We started to centralize the management of our Prime Documents, publishing them on our Google Drive. I became the arbiter of them. But since everything in AI moves so fast, our context needed near constant updating. This new task of resyncing them ended up taking so much of my time, I took on the title Chief Context Officer.

Plus the whole thing was such a drag. Literally. I was constantly dragging documents to every chat, plus reviewing and updating them if they needed. Only to have to do it all over again in a half hour when my conversation ran out of space.

The solution was to build a context management layer that operates over a wiki-style system of markdown files. Accessing it via a CLI meant I could still use Claude and Antigravity the exact same way, but the system actively improved the context by creating tags and links to live data and other docs. Now my chats have the ability to look at everything and pick up where the other left off, with no retraining of the conversation.

This pattern reduces tasks by 2-300% in my workflow, and mitigates the risk of having to start over if something goes sideways. I don't maintain it nearly as often either, as it usually just learns as I work.

This is the same theory that Andrej Karpathy wrote about a couple weeks ago. We published a ContextNest whitepaper on the details of how this works two months ago, too. And bonus! It's also an open-source project: ContextNest. There should be a desktop client any day now too.

4. Set up skills

When customers ask me what they should apply AI to first, my answer is always the same—build what you repeatedly spend the most time on. Sometimes folks do a time journal to determine where this happens. I have a bias for rapid results, so I think this can be conducted as a mental exercise. At least to find the low hanging fruit.

For me, I would spend 4-5 hours every week doing strategy work. I'd pull all the numbers, have AI distill all the standup notes, make sure action items were tracked and ticked off. Then I'd think about what to prioritize this week. Finally, I'd communicate it to the team so everyone is aligned and moving in unison.

At first, I attempted to automate the SOP for doing this with another markdown file. Then I built tools to handle the repeatable research processes, like summarizing the chatter on the engineering channel in Slack. These Skills are reusable, pre-configured AI behaviors you define once and improve through actual usage.

They can also be shared. In our case, that means that everyone who meets with a prospect, could create the same deep dive research brief and proposal based on our status today—what new features are out, what promotion is running, what the customer wanted to meet about in the first place. With very little coordination, we can create the same output.

Without shared skills, every person on your team reinvents the wheel and gets slightly different results. Skills create consistency, reduce token cost on repeated tasks, and make AI output auditable across the org.

The 7 Habits of Highly Successful Prompting

Now that we have the system, these are the habits that govern every session to get the most out of them.

1. Plan first. Build last.

The expensive moment isn't generating the final artifact. It's generating it three times because the spec wasn't clear—and losing 20 minutes each time to recreate it.

Before you ask for the web page, the strategy doc, or the campaign copy, use a few cheap messages to get alignment—figure out your structure, identify the edge cases, and establish naming conventions and expectations.

Thinking is cheap. Building is expensive. Iteration in planning costs almost nothing in tokens and almost always saves you time. Iteration in generation costs both.

2. Run a murderboard on the plan

One of the best things AI does is help you think through things like other people. Other professionals, your customers, or future prospects. This is your opportunity to learn from what they would predictably say. I have a markdown file of various personas that I pull from to review my plans, ruthlessly tear them apart, and find not just the holes, but their recommendations to satisfy them.

I call this group-think exercise a "murderboard," but you can just tell it to run a focus group antagonistically or to act like a customer and complain. As long as you try to use multiple perspectives and explicitly get the model to break its tendency for sycophancy, it will help you find problems before you codify them into production.

3. Tell the AI to ask you questions

A 600-word fully-specified prompt is often the most expensive way to get a mediocre result. With that much detail, the models think they should know everything and usually make some terrible assumptions.

Describe what you need, focusing on the results and how it will be used. Tell the model to ask clarifying questions before it starts. Each exchange costs a fraction of a re-generation. You get better output from a conversation than from a wall of text that leaves the model guessing where your spec was ambiguous.

4. Edit the message. Don't stack on top of it.

You sent a prompt, spotted a typo, realized you left out a constraint. Most people send a correction as a new message.

That's a mistake for two reasons.

Every new message adds to the context window—the model is now reading your original error and your correction simultaneously and trying to reconcile them. It has to think through this each time.

Plus, new messages can take time to ingest—or worse, distract the model from your original question.

Find the edit button. Replace the message. Then the next response doesn't carry your mistake forward, and you don't spend ten minutes untangling an output that went sideways because of a fixable prompt.

5. Turn off what you're not using

Web search has a cost. Extended thinking has a cost. Document connectors have a cost. Most of the time none of them are needed for the task at hand. Frontier models have huge costs.

If you can streamline what you don't need (especially if switching costs are low because you have a ContextNest), you can save a lot of time and tokens.

Enable them in the moment you actually need them—not as default-on settings running in the background of every request.

6. Work in smaller sections

Engineers called this out a long time ago. Models can not handle large codebases well. Using context and focusing on smaller sections at a time means models return results more quickly and have less risk of choking on the task.

The same is true for business efforts. Don't ask for the 5,000-word strategy document in one prompt. Ask for the outline. Then expand each section. Don't ask for the full function—ask for the structure first, then fill in each piece.

Smaller sections mean faster iteration, easier course correction, and lower cost when something needs to be redone. It also gives you natural checkpoints so the output doesn't drift through a long generation you can't easily fix at the end.

7. Match the model to the task

Every call to a frontier model that didn't need to be one is money you didn't have to spend.

As of April 15, the cost to generate 500 words on Opus 4.6 is about 1.67 cents. For that same 1.67 cents:

Sonnet 4.6 gives you ~835 words
Haiku 4.5 gives you ~2,500 words

This means Haiku is 5x cheaper than Opus for output. For content generation, listicles, and drafts—Haiku earns its place. Opus earns its place when nuance, analysis, or voice precision actually matters. Sonnet feels like the safe middle ground, but often the flash models are enough.

Route simple triage, summarization, and single-turn questions to a lightweight model. Save the heavy models for work that actually requires it—the analysis feeding a real decision, the writing carrying your company's voice, the code review that can't afford a miss. The right tool for the job is a standard engineering principle. Apply it here.

The foundation, not the ceiling

These eleven tactics cut both bills—the invoice and the productivity drain. But there is more to come, especially when you think about sharing a living context across a team or an organization.

This is the area I am studying now. I'll be writing (and releasing commercial software) about how workflow and tools are adapting across organizations in future posts. Stay tuned!