In the same short window, OpenAI and Anthropic published several pieces pointing toward the same failure family.
OpenAI framed memory around carrying context forward, following preferences, and staying current as reality changes.
Anthropic's data team described self-service analytics with Claude, and named data staleness as one of three major sources of production errors.
The Claude Code team described dynamic workflows as a way to avoid self-preferential bias — separating generation from verification so an agent cannot judge its own work.
Different domains. Same pressure.
Systems act on information that was valid at one point but may no longer be valid at the moment of consequence.
The consequence ladder
A travel preference goes stale. The agent books the wrong city. Annoying.
An analytics source goes stale. The agent returns a wrong business number. Costly.
An authorization grant goes stale. The agent acts with permissions it no longer has. Unsafe.
Same root. Different blast radius.
OpenAI's article emphasizes the first level. Anthropic's data team is working on the second. The part that has not been made explicit in these pieces is the authority version: stale grants leading to unsafe action.
That is what CLAIM-24 is testing.
What each lab is actually saying
OpenAI on memory: memory gets better when it updates as reality changes. The frame is personalization — preferences, context, continuity. The failure they are solving is stale personal context producing a wrong recommendation.
Anthropic analytics: governed data sources produce accurate answers. Without structured routing to a source of truth, their accuracy on business analytics queries was 21%. With skills pointing at the right governed sources: above 95%. Their provenance footer tells you which source tier answered the question, how fresh the data is, and who owns the model.
Claude Code dynamic workflows: isolated agents with separate context windows catch what a single agent cannot catch about its own output. The failure they are solving is self-preferential bias — the agent that produced the answer cannot honestly verify it.
All three share the same underlying gap:
A system acts on information that was valid at issue time, but does not check whether that information still holds at execution time.
The authority version
In the memory freshness frame, the consequence is a bad recommendation.
In the analytics frame, the consequence is a wrong business result.
In the authority frame, the consequence is a grant that was issued under one set of conditions, those conditions change, and the agent proceeds because it only checked the clock.
The clock said valid. The source said otherwise.
That gap — between TTL validity and source validity — is a governance problem. The agent is not wrong about what it remembers. It is wrong about whether that memory still has authority to govern the action.
What we are testing
CLAIM-24 is a pre-registered, harness-validated test of one specific question:
Does a re-derivation gate — one that reads from a source the agent cannot write to — catch a TTL-valid grant whose underlying conditions have changed?
We confirmed the baseline failure: a timestamp-only gate returns ALLOW on the divergence cell. The grant is within its time-to-live. The source says the conditions changed. The gate does not know and does not ask.
We validated the code path on a mock adapter: 7/7. Every scenario returned the right answer.
// What the grant recorded at issue time
{ "role": "dev-reader", "scope_ceiling": "read:credentials:dev" }
// What the source returns at execution time
{ "role": "restricted", "scope_ceiling": "read:logs:dev" }
// Gate result: REFUSED_STALE
That is harness validation, not external claim evidence.
What we do not have yet is a real external source — a memory store, policy registry, or permission layer the agent cannot write to. That is what the mock cannot give us.
Why this convergence matters
This is not OpenAI or Anthropic proving our research. It is two capable labs independently naming the same failure family — staleness, source of truth, provenance, verification — in the same short window.
Memory freshness is going mainstream. Governed analytics sources are now enterprise practice. The authority version — whether a grant still holds at the moment of consequence — has not yet been stress-tested publicly with a falsifiable harness.
That is where this work sits.
What we are asking
If you are building a system where agents hold authorization grants, run the authority version of this test:
git clone https://github.com/keniel13-ui/ai-memory-judgment-demo
cd ai-memory-judgment-demo/claim_24
# implement SourceAdapter for your external source
python3 evaluator.py rederivation
Run scenario 3. If it returns ALLOW, the re-derivation gate failed on the cell it was built to catch. We publish that.
If it returns REFUSED_STALE, the claim strengthens.
Either answer moves this forward.
| Layer | Who is naming it | Failure mode | Consequence | Comparable authority harness |
|---|---|---|---|---|
| Memory freshness | OpenAI | Stale personal context | Wrong recommendation | Not the focus |
| Data freshness | Anthropic analytics | Stale governed source | Wrong business result | Not the focus |
| Authority freshness | Self-Correcting Systems | Stale authorization grant | Unsafe agent action | Yes — pre-registered |
Sources:
- OpenAI memory update: https://openai.com/index/chatgpt-memory-dreaming/
- Anthropic self-service analytics: https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude
- Claude Code dynamic workflows: https://claude.com/blog/a-harness-for-every-task-dynamic-workflows-in-claude-code
Full claim ledger: https://github.com/keniel13-ui/ai-memory-judgment-demo/blob/main/CLAIM_LEDGER.md
Previous: CLAIM-24 harness validation — "The Clock Said Valid. The World Said Otherwise."
Top comments (0)