A lot of AI coding advice quietly assumes the same thing:
if the output is bad, you probably need a better model, a better prompt, or more tooling...
For further actions, you may consider blocking this person and/or reporting abuse
The CLAUDE.md approach helps with shallow local rules, but architectural decisions are a different category. A rule says "don't do X." A decision record says "we considered X, Y, and Z, chose Y because of constraints A and B, and X becomes valid if constraint A changes." Those aren't equivalent things to encode.
Rules suppress behavior. Decision context guides it. When the model re-encounters the problem in a later session, a rule might block the wrong path — but without the reasoning behind the decision, it can't recognize when circumstances have genuinely changed and the original decision no longer applies.
We've been building around this at Mneme — treating project memory as a first-class artifact, not something that gets flattened into a rules file. The continuity problem is real and underrated compared to the generation quality conversation.
Yes, this is exactly the distinction.
CLAUDE.mdhelps with guardrails and local rules, but it is the wrong shape forload-bearing architectural decisions. A rule can say "do not do X." A decision
record has to preserve why Y won, what constraints made it win, and what would
have to change before X becomes valid again.
That is the difference between suppressing behavior and guiding judgment. Once
the reasoning disappears, a later session can follow the rule mechanically and
still miss that the world around the decision has changed.
Treating project memory as a first-class artifact is the important move.
The "wakes up without durable context" framing is the most accurate description I've read of this failure mode. The code review or PR analogy doesn't quite capture it, because PRs at least carry the explicit history of decisions in their commit messages and review threads. The session boundary just erases all of that.
The pattern that's worked for me on a solo-dev monorepo is a decisions/ folder of single-page memos. Each memo ends with a short footer covering what changed, what's now load-bearing, and what not to touch without re-opening the memo. Then a CLAUDE.md rule that says "before editing $area, read decisions/$area.md." The model reads it. The retention isn't perfect, but architectural-rule violations dropped from around 30% to under 5% across a quarter.
this is the session boundary problem - and better prompts don’t fix it. treating decisions as artifacts you explicitly pass in each time actually helps. ADRs as context prepend, spec files. the model isn’t forgetting - it was never told.
Exactly. “The model isn’t forgetting, it was never told” is probably the cleanest way to put it. ADRs, spec files, and other durable artifacts are what make context portable across sessions instead of trapping it in one chat window.
right - but the format gap matters too. most ADRs are written for human audit, not model ingestion. context-optimized ADRs are shorter, lean on explicit why, skip rationale prose. same artifact class, different structure.
Agreed, and that is a real shift in how we think about specs/artifacts.
If AI is doing 85% of the work, it may actually be the primary audience. So maybe the artifact should optimize for model ingestion first, then generate the human-readable version when needed.
I tried this with ADRs: more structured, schema-like, harder for me to read directly, but easier for an agent to consume consistently. Then a skill can turn that durable context into a readable report for me periodically.
So the artifact stops being “documentation humans read” and becomes “decision memory the system can safely reuse.”
yeah that's the thing. same pattern with deployment docs - condensed into key/value blocks agents can parse, kept narrative version for humans in confluence. the weird part: the condensed version became easier for people to review too.
yep, there are a lot of fun diagnostics rules (for example never say NEVER, and DO NOT think of a pink elephant, not joking, these are generic failure modes on any llm.)
This is the exact mindset I have about Ai . From this post alone, you can tell you are very good at turning from what i consider a complex concept into a clear narrative. That also would indicate you have a lot of hands on “in the trenches “ experience. I don’t really know what im getting at but just want to say this is a legit post to read
Really appreciate that. A lot of this came from hitting the same failure mode over and over on real project work, so I wanted to explain it in plain terms instead of treating it like magic. Glad it landed for you.
Yes, this is exactly the failure mode I keep seeing. I really like your framing of decisions as records with provenance plus explicit supersedes links, because that turns memory into something queryable instead of something buried in old chat logs. Even a small SQLite decisions table gets a team surprisingly far.
It's entirely a tooling problem. If you don't have the proper memory or the proper insights, it's like bringing in a new developer each time and asking them to take the next step.
That “bringing in a new developer each time” analogy is spot on. The painful part is not that the model can’t code, it’s that it doesn’t inherit the expensive decisions unless the tooling gives it durable state. That’s the gap I was trying to point at.
This hits the nail on the head. No matter how high the AI's IQ, without deterministic state management, it’s just "stochastic mediocrity." The solution isn't blindly stacking compute, but building a persistent context that outlives the chat window. An Agent without memory is just a code monkey; one that preserves engineering decisions is a true digital partner.
Persistent context that outlives the chat window is the key. Once engineering decisions are preserved somewhere stable, the agent stops feeling like a fast stateless assistant and starts feeling much closer to a real collaborator.
Completely agree. Persistent context shifts AI from a stateless utility to a true local-first partner. When an agent retains our deep architectural decisions and past bug fixes beyond the session, it masters our logic. That's when it stops being a mere assistant and becomes a reliable, long-term collaborator.
Yes, I like your use of the term collaborator. Thats a good way to look at at, and the better we can make its memory long term, the better it will be at being a collaborator.
AI pair programming is basically:
Session 1:
“we must preserve backward compatibility at all costs.”
Session 7:
“cleaned up deprecated code for simplicity :)”
Production:
“interesting.”
yes and telling claude to "consider every possible error and make no mistake" doesn't work
AI context is expensive. Memory is expensive. That's why they enforce these limits. AI companies are bleeding money left and right and the revenue are not catching up from what I heard.
I think cost is definitely part of why context limits exist. The part I keep coming back to, though, is that even with a bigger window, the workflow still breaks if the important decisions are not externalized somewhere durable. More tokens help, but better state management helps more.
There are plenty of good memory systems out there that address this.
I did a little write up on how i manage context as well, maybe this might inspire you or give you ideas as well for your own workflow.
I think in addition to the memory system, and as you mentioned, good ADR docuementing in repos just for the purpose of collaboration more than anything else.
Interesting post, thanks for sharing!
I will check those out
Strong agreement on this, and the human-makes-the-decisions part is necessary but not sufficient. I build a large multi-chain backend this way, all architecture and infrastructure calls are mine, the AI implements. But I still hit exactly the drift you describe, because making the decision as a human doesn't persist why I made it. The diff survives in git. The reasoning doesn't, unless I write it down on purpose.
What fixed it for me: every session ends with a committed checkpoint plus a written handoff file, and that file records the load-bearing decisions and the reason each one exists, not just what changed. Next session starts by reading it. It turns "first contact every time" into something closer to continuity. Your support-pillar metaphor is exactly right, the fix is just making the pillars and their reasons readable to the next session instead of hoping it infers them from the diff.
Weak continuity breaks it far more than weak generation, in my experience. Generation quality stopped being my bottleneck a while ago. Decision memory still is.
The session boundary problem is real. Every new
context window starts with zero memory of why the
previous session made the choices it did.
This is the same problem I hit building consensus
infrastructure. The chain does not forget decisions.
Every state transition, every capability grant,
every delegation, every payment settlement is an
immutable record with full provenance. The protocol
is the architectural decision log.
That is the difference between application-layer
AI and protocol-layer AI. An agent's reasoning
resets every session. The chain's state persists
across every interaction the agent has ever had.
Identity, reputation, verification records, payment
history. None of it drifts because none of it lives
in a context window.
The "AI writes code but forgets decisions" problem is actually a storage architecture problem at its core. Code is easy to version. Decisions — the why behind an architecture choice, a trade-off made under a deadline, a constraint from a dependency — have no natural home in a codebase.
I've been working on an embedded AI project where this gets acute: sensor fusion code running on a drone, where every architectural decision has physical consequences. We ended up storing structured decision logs in a local Rust database (moteDB) alongside the code — essentially treating "why we built it this way" as first-class queryable data, not just comments that rot. Works better than I expected; the AI assistant can actually retrieve context about past choices before suggesting changes.
The harder question is retrieval quality — when do you surface an old decision vs. assume the codebase has moved on? Have you found a useful heuristic for that boundary?
AI codegen tools optimize for the wrong loss function. They produce code that runs but lose the reasoning chain behind why something is shaped a certain way. We hit this with an agent that kept rewriting our state machine because the LLM didn't see the latency constraint that forced explicit handoffs. The fix was attaching design-doc context to the system prompt as immutable preamble. Code-gen agents need architectural memory, and not just code memory
Very insightful post. The continuity problem is something I’m starting to notice more while building projects with AI tools as well.