MouseRider

Posted on Mar 20

Your AI Agent Doesn't Know What It Knows

#ai #architecture #discuss #productivity

I didn't set out to build an epistemology framework. I was trying to figure out why my agent kept undoing its own decisions.

This is part of an ongoing series on building persistent AI agents. Article 1 covered TSVC — how to isolate context across topics. This article is about what happens inside that isolation: how does the agent trust what it remembers?

The Thing I Noticed

I run a persistent agent. Not a chatbot — an always-on assistant with memory, topic management, task tracking, and a second agent watching over it. It's been running for weeks. Hundreds of conversations.

One day I noticed the agent had reversed a configuration decision it had made two weeks earlier. Not dramatically — a small change. But I remembered the original decision, and I remembered the reasoning behind it.

The agent didn't. A context compaction had happened in between. The reasoning was gone. When the agent hit the same problem again, it reasoned through it fresh — and landed somewhere different.

I flagged it. The response stopped me:

"You're right. I have no way to verify that I made that decision, or why."

That's an honest answer. It's also a deeply uncomfortable one for a system you're trusting to manage ongoing work.

What the Conversation Revealed

I didn't move on. I pushed on the general problem. When context compacts, when a session resets, when you load a topic you haven't touched in days — what do you actually know versus what do you think you know?

The agent's answer, worked out over several exchanges, was this: everything it "remembers" is reconstructed. It reads files. Loads summaries. Queries a vector database. None of it is experienced memory — it's external storage, loaded and treated as truth. The agent has no way to distinguish between a memory that's accurate and a memory that's been edited, summarized away, or quietly misrepresented by a vector search that returned something adjacent instead of something exact.

For humans, memory is unreliable but continuous. You have a thread of experience. An AI agent has neither reliability nor continuity — it's rebuilt from artifacts every session. And crucially, it usually doesn't know which parts of that reconstruction are solid and which are load-bearing guesswork.

That conversation is what pushed me toward thinking about this as an infrastructure problem, not a prompt problem.

Turning the Threat Model Around

Most AI safety thinking is about protecting agents from the outside world — prompt injection, adversarial inputs, jailbreaks. But the failures I was watching weren't external. They were self-inflicted.

Three patterns, all observed:

Post-compaction amnesia. The agent loses the reasoning behind a decision but retains the outcome. When it hits the same problem again, without that context, it may reverse a perfectly good decision — not because the situation changed, but because the new reasoning path points differently.

Optimization drift. Give an agent enough sessions and it starts "improving" things that already work. Not because they're broken — because the agent has no memory of why they were left the way they were. The script that got "cleaned up" into a broken state. The configuration that got "simplified" into something that no longer handles the edge case it was written to handle.

Confident confabulation. The agent loads partial context, fills the gaps with plausible-sounding reasoning, and presents the result with full confidence. It's not lying — it genuinely doesn't know what it doesn't know. This is the most dangerous pattern because it's the hardest to catch.

The agent itself is the primary threat to its own decision integrity. Not its future sessions acting maliciously — its future sessions acting reasonably with insufficient context.

An append-only decision log isn't a compliance mechanism. It's a protection against your Tuesday-self undoing what your Monday-self decided, because Tuesday has been compacted and doesn't remember Monday's reasoning.

What Already Exists (And What's Missing)

Before going further — I'm not the first person thinking about this.

Cognilateral uses "epistemic infrastructure" as a framing explicitly, as a commercial product. Empirica has the most operationally detailed prior work I found, covering similar pillars but focused specifically on software development agents. A recent Dev.to article, Guardian Protocol: Governance for Autonomous AI Agents (March 2026), approaches the problem from a different angle — external governance, delegation credentials, guardian-agent authority — but includes a tamper-evident, git-backed audit trail as a core layer, arriving at a similar component from a very different direction. And arXiv paper 2601.04170 (January 2026) provides a formal academic treatment of agent drift in multi-agent systems — a different angle, but the same underlying instability.

What I haven't seen addressed elsewhere — at least not explicitly:

The threat model inversion: framing the agent's own future sessions as the adversary, not external inputs
Write-only memory: deliberately blocking direct file reads to prevent the agent from cherry-picking its own context (covered in the next article)
Drift vs. evolution as a named distinction: the difference between a decision that changed because it should have, and a decision that changed because context was lost
Bottom-up from observed failures: this framework wasn't derived from theory — it was patched together from specific things that broke

I'm not claiming priority. I'm claiming that if you've been running a persistent agent long enough to watch it fail in these specific ways, you've probably arrived here too.

Five Requirements That Emerged

Over several sessions, working through concrete failures, five things kept coming up as necessary before an agent can actually trust its own outputs:

1. Decision Provenance. Not just "we decided X" — but "we decided X because of Y, in context Z, on this date, with this conversation as the record." When the agent encounters X again after compaction, it can check the provenance rather than re-derive. Without provenance, every re-derivation is a coin flip.

2. Intention Tracking. Provenance captures why a decision was made. But there's a prior gap: between what was asked, what the agent interpreted, and what it actually delivered. The user asks for A, the agent hears B, produces C — and without tracking all three, nobody notices the drift. Including the agent.

3. Drift vs. Evolution Detection. Some changes are deliberate. "Actually, let's do it differently" — that's evolution, it should be welcomed. Some changes are accidental — lost context, re-derivation from incomplete information. That's drift, it should be flagged. The agent currently has no mechanism to tell the difference. This feels like the most important unsolved problem.

4. Goal Coherence. After enough topic switches and compactions, the agent can be working diligently on something that no longer serves any stated goal. It's busy. It's productive. It's pointless. A periodic alignment check — "does what I'm doing right now connect to any goal?" — shouldn't be hard to implement. But I haven't seen it anywhere.

5. Counterfactual Awareness. The agent should know what it doesn't know. "I have no memory of this topic before March 5" is useful and honest. Confidently filling that gap with plausible-sounding history is dangerous. Most agents treat absence of memory as absence of events. That's backwards.

Architectural Sketches (Not Blueprints)

A few patterns that emerged from specific failures:

Immutable Decision Log. Decisions are append-only entries, each hash-chained to the previous. You can reverse a decision — only by appending a REVERSAL entry that references the original. A second agent audits the hash chain. If the primary agent edits history, the chain breaks, the auditor catches it. The log is tamper-evident by construction.

Event Sourcing for Context. Instead of storing "the current state of topic X" — a summary that goes stale the moment it's written — store the events that produced the state. Decisions, exchanges, memory entries: all append-only. When switching to a topic, replay events to compute current state. The materialized view is disposable and deterministic. If it's wrong or stale, regenerate it from the source. The events are the truth; the view is just a lens.

The Coherence Check. Context loads in two explicitly labeled parts: factual state (computed from the event log — mechanical, no LLM in the path) and prior impression (the agent's own session-end summary, written with full context). If they don't match, it's a signal. The impression acts as an index of what should be present. Gaps drive retrieval. It's a self-healing mechanism — the agent notices its own incoherence and resolves it before proceeding.

Memory Access Governance. The whole field is focused on giving agents more memory — richer retrieval, better recall, higher accuracy. Nobody is asking who controls what the agent is allowed to retrieve. My agent will consistently reach for a direct file read over a vector search, which means it's curating its own context input. That's not a retrieval problem — it's a governance problem. Who decides what the agent sees, and when, and how? The approach I arrived at connects to a broader pattern I'll cover in the next article in this series.

These are sketches, not working code. Each one came from watching something break.

What We Don't Have Yet

Cross-topic coherence — reasoning about Topic A while inside Topic B, without a full context switch. Microsoft's Sam Schillace named this the "disconnected models problem" and it remains unsolved at the architectural level.
Uncertainty expression — "I'm not confident about this" that's actionable, not just hedging. Uncertainty quantification is well-researched (arXiv 2601.15703 is the closest to what's needed here), but turning it into useful agent behaviour in a persistent context is a different problem.
Temporal reasoning — memories are organised by file, not by time. "When did we decide X?" is surprisingly hard. Zep's temporal knowledge graph and Hindsight's memory architecture both make progress here, but neither addresses decision provenance specifically.
Working implementations — the decision log, event sourcing, coherence checks: all of this is still design. No reference implementations to point to.

The Question I'm Left With

Every persistent agent framework I've seen focuses on giving the agent memory. None of them ask: how does the agent know it can trust that memory?

The trust and identity layer — who are you, can I verify you — is being built. ERC-8004 and similar protocols are on it. But the epistemic layer — decision provenance, drift detection, self-trust — is wide open.

We stumbled into this by watching failures that don't show up in the standard documentation. Not hallucination — that's well-documented and actively defended. Not prompt injection — same. This is subtler: an agent that slowly, silently loses coherence with its own past.

What's your agent's epistemic failure mode? What breaks when it runs long enough?

I'm genuinely asking — because this is being built in the open, and we don't have anywhere near all the answers.

Next in the series: how TSVC evolves as topics accumulate. What happens when you have 50 topics, some dormant, some interrelated, and the agent needs to reason across them without loading all of them?

I'm @MouseRider on Dev.to and Alex Tsukanov on LinkedIn. The conversation continues.

DEV Community