My AI Agent Keeps Forgetting Everything; So do I...
I have multiple sclerosis. Some days are better than others, but one thing is constant: repeating myself is expensive. Cognitive fatigue means every wasted explanation costs me something I can't get back. So when the AI coding agent started each session from scratch, forgetting every architecture decision, every constraint, every piece of context I'd painstakingly built up, it wasn't just annoying. It was a genuine problem.
AA-MA Forge
The context wall
If you've used Claude Code (or Cursor, or Copilot) for anything longer than a single session, you know the feeling. Monday morning, you open a new conversation. The agent has no memory of Friday's work. You re-explain the architecture. You re-state the constraints. You watch it drift from the plan you agreed on two days ago. Three sessions in, you've spent more time re-establishing context than writing code.
For small tasks, this is tolerable. For multi-week projects with dependencies, milestones, and real stakes, it's a dealbreaker.
What I tried first
Big instruction files. Massive CLAUDE.md documents stuffed with architecture summaries, coding standards, and project history. They helped, but they mixed things that change (execution state, what's done, what's next) with things that don't (API endpoints, file paths, schema definitions). The agent couldn't tell the difference. It would hallucinate facts that were sitting right there in the doc, or re-litigate decisions I'd already made.
Conversation summaries were worse. Lossy compression of context meant the important details evaporated first.
The spark
At 3am one night, scrolling Reddit because my brain wouldn't shut up and the MS "tingled" me awake, I found Diet-Coder's post, and something about a "Dev Docs System": three files per task that give the agent structured memory. Plan, context, tasks.
That was the seed. I took those three files and turned them into five.
Why five, not three
Three files tangle different kinds of knowledge together. Strategy sits next to execution state. Facts mix with decisions. When the agent loads context, it can't prioritise. It reads everything, weighs nothing.
Five files separate knowledge by how it behaves:
- Things that don't change (API endpoints, file paths, constants) go in one place.
- Things that explain why (decisions, trade-offs, gate approvals) go in another.
- Where you are right now (task status, what's done, what's next) gets its own file.
- Strategy (the plan, milestones, acceptance criteria) stays separate from execution.
- What happened (commits, session checkpoints, audit trail) goes in an append-only log.
When the agent picks up a new session, it loads the facts and the task state first. It only pulls in the decision history when it needs to make a choice. The plan stays available but doesn't clutter working memory.
The separation sounds obvious in hindsight. It took months of trial and error and battle tested against real projects and deliverables to get right - or at least working well enough to stop me screaming at the machine and freaking out my kid and the neighbours..
What it looks like
I built this into a set of Claude Code commands. The workflow is three steps:
# Plan: brainstorm with the agent, then generate structured artifacts
/aa-ma-plan "build a REST API for user authentication"
# Execute: work through each milestone, sync the files, commit
/execute-aa-ma-milestone
# Archive: move completed work to the done pile
/archive-aa-ma auth-api
Between planning and archiving, the agent reads the five files at the start of every session, updates them as it works, and commits after every task. Context survives across sessions. Decisions don't get re-litigated. The audit trail is there if you need it.
It goes deeper than three commands
I didn't plan to build all of this. Each feature exists because something went wrong without it.
11 mandatory planning outputs. Every plan includes an executive summary, milestones, acceptance criteria, rollback strategy, risk register, effort estimates, and six more. If you can't write a pytest assertion from the acceptance criteria, they're not specific enough.
6-angle adversarial verification. Before execution begins, parallel agents attack the plan from six independent angles: do the files actually exist? What assumptions are we making? What breaks if we change these files? Can a fresh agent with no context execute this plan? Are there domain-specific risks the generalist missed? CRITICALs block execution.
HITL/AFK task dispatch. Each task is marked as needing human input (HITL) or fully autonomous (AFK). Architectural decisions pause for you. Test writing runs on its own. The agent knows the difference.
HARD/SOFT milestone gates. Some checkpoints are advisory: the agent seeks approval but continues if you're away. Others are hard stops: the execution command refuses to advance without a signed approval entry in the context log.
Compaction hook. Claude Code compacts its context window when it fills up. Without intervention, your agent's working memory vanishes mid-task. The hook intercepts that moment, writes checkpoint entries to the task's provenance log and context log, and preserves state for the next session.
Complexity routing. Tasks scoring 80% or above on a weighted algorithm (scope, architectural impact, technical risk, dependencies, requirements ambiguity) automatically route to deeper review. Human sign-off, chain-of-thought reasoning, or both.
None of this was designed upfront. Each piece was bolted on after a failure made it obvious. The verification system exists because I shipped a plan with API endpoints that didn't exist. The gate system exists because the agent once completed a production deployment while I was making coffee.
How this compares
I looked hard at what else is out there before publishing.
claude-mem is excellent. Over 44,000 stars, and for good reason. It captures observations automatically and builds a searchable memory across sessions. I use it alongside AA-MA. But it has no concept of planning, milestones, or execution tracking. It remembers what happened. AA-MA remembers what should happen next.
Cursor Memory Bank and Cline Memory Bank use six markdown files per project. Similar philosophy, and they've earned wide adoption. The difference: they're project-scoped (one memory bank per repo), not task-scoped (one set per active task). No immutable reference file, no gates, no provenance logging.
Simone is the closest competitor in spirit. A full project management framework for Claude Code. Less formalised than AA-MA: no versioned specification, no gate approvals, no commit signatures linking git history to active plans.
Compound Engineering focuses on compounding knowledge across sessions. 26 specialised agents. More about the learning loop than structured execution tracking.
These are good tools. They solve real problems. The gap I couldn't fill with any of them: no single system combines execution tracking, adversarial plan verification, gate classification, commit signatures, and compaction hooks into one coordinated framework. That's what AA-MA is.
What this is
It's opinionated. Built around how I work: regulated industries, multi-week timelines, zero tolerance for context drift. The overhead of five files per task isn't for everyone. But if you've ever lost a week of context to a Monday morning, or watched an agent confidently re-implement something you'd already rejected, it pays for itself.
The specification is versioned (v2.1). The file formats are defined. There are standalone templates for every file type. It's the kind of rigour you'd expect from a system built by someone who works in regulatory environments, because that's exactly what it is.
Credits
Diet-Coder planted the seed with those three files. Matt Pocock's skills repo helped shape how I organised the commands. Helix.ml informed the gate classification system. Full provenance is in the repo.
Take what's useful
The whole thing is on GitHub: aa-ma-forge. Clone it, try it, fork it, make it your own. There's an installer that deploys everything into your Claude Code setup with one command, and an uninstaller that reverses it cleanly.
Fair warning: maintenance will be sporadic. If I've gone quiet, I'm either deep in client work, arguing with an API, or the MS is having a louder day than usual. Pull requests welcome, but don't hold your breath on response times.
If it saves you time or sanity, consider donating to an MS charity. Small acts, big ripples.
PS. If you want cross-session memory retrieval rather than task execution structure, The 5th Element has a gitrepo: https://github.com/milla-jovovich/mempalace

Top comments (13)
Shout out to @diet-code103!!
This is a step but there will always be a challenge to make AI work consistently in the long run. Almost everything that happens with software engineering involves subjective decisions and these hallucinations and inconsistencies prove this.
A compressed knowledge graph, particularly on MS as .md level...I would be happy to do that for you to improve your memory issue.
I think you need to explain that a bit more clearly for the rest of us to understand - are you proposing a different (or "better", even) approach than what the author proposed?
Well I don’t need to do a thing. However, I’d that a kind request of further explanation?
No you don't have to do anything, but you could ;-)
My point basically is that the author already seems to have a pretty good grasp of the issue, and how to tackle it :-)
Fair point — let me explain.
The author's five-file structure is excellent execution tracking. What I was gesturing at is a different layer: instead of storing project context as flat markdown files, you compress it into a knowledge graph — nodes and edges representing concepts, decisions, and relationships, serialized as .md.
The practical difference: flat files grow linearly. A knowledge graph stays compact because relationships replace repetition. The agent doesn't re-read "we use Postgres" buried in a decisions log — it traverses a typed edge from DatabaseChoice → Postgres with the rationale attached. Context retrieval becomes a graph query, not a document scan.
So not a better approach — a different abstraction built on a similar idea. Stephen's five-file structure could sit underneath a KG layer: the files feed the graph, the graph feeds the agent.
The MS angle was specific: for someone managing cognitive fatigue, a compressed, queryable knowledge graph reduces the mental overhead of re-orienting the agent each session. Less to re-explain, because the structure carries more of the context automatically.
Thank you, that makes a lot of sense:
"A knowledge graph stays compact because relationships replace repetition"
Sure does. Better than RAG probabilistic guess and retrieval. Thanks for engaging.
Impressive, both Diet-Coder's effort and yours ...
With all of these separate efforts going on I start wondering if it's time for Anthropic to pull together some sort of "standard" and baking it into CC ? Because right now everyone seems to be scrambling to reinvent this wheel, with different approaches and different ambition levels ...
This hits way too close.
My biggest frustration isn’t even “new session = no memory” — I’m used to that.
It’s when the agent forgets things inside the same session / project flow.
I’ll explain architecture, constraints, decisions — everything looks aligned.
Then 20–30 messages later it starts drifting, ignores earlier decisions, or straight up contradicts them.
That’s where it becomes painful, because it’s not just context loss — it’s trust loss.
And I’ve tried the usual fixes:
• long system prompts
• “single source of truth” docs
• summaries
But like you said — they mix static knowledge with dynamic state, and the agent just can’t prioritize what matters.
The idea of separating memory by type instead of just “more context” makes a lot of sense.
Curious — have you noticed this helping with in-session drift, or mostly across sessions?
This resonates — we hit the exact same primitive from a different angle.
Your AA-MA solves "how does a single agent keep its own memory across
sessions." We hit the same wall (Markdown + structure + separation by
behavior type) trying to solve a different problem: how do N agents
coordinate without a broker.
The core insight we converged on independently:
sender-to-recipient) — directory encodes statusBoth exploit the same fact: the filesystem is already a state machine.
renameis atomic (POSIX).lsis a full diagnostic. You getvisibility + atomicity + zero infra, if you stop trying to mediate
everything through a chat context.
And your "None of this was designed upfront — each piece was bolted on
after a failure made it obvious" is the exact pattern we observed. After
48 hours of 4 Cursor agents running on a minimal rulebook, they had
invented 6 coordination patterns we hadn't written (broadcast addressing,
anonymous role slots, traceability frontmatter, subtask sub-folders…).
All of them surfaced as new filenames in a shared folder. None of this
is designable. It emerges.
Field report + MIT protocol: github.com/joinwell52-AI/FCoP
Genuinely curious what happens if AA-MA's per-task 5-file memory sits
underneath FCoP's routing layer. Feels like they compose, not conflict.
Re @leob's "time for a standard?" — I suspect this won't come from
Anthropic, because the whole point is tool-neutral. If it works across
Claude Code, Cursor, and Codex, it has to come from users. Which is
what we're both doing :)
The distinction you're drawing here — separating knowledge by behavioral type (what changes vs. what doesn't) — is the insight that most "just use CLAUDE.md" advice misses. Treating a single instruction file as both strategy and execution state creates the hallucination problem you described: the agent can't tell the difference between a settled architectural decision and current task state.
The five-file structure maps well to how working memory actually functions: long-term facts, deliberate decisions, current focus, planning, and audit trail. What strikes me is that this is really typed memory — you're enforcing contracts between information types so the agent can't confuse "we always use postgres" with "this PR is still in review."
One thing I've found useful on a similar structure: a versioned decisions log where you append rather than overwrite. If an agent re-litigates a settled decision, you can trace exactly when and why it was resolved — helpful during post-mortems when you're not sure whether the agent worked from stale context or genuinely hit an edge case.
The part about this emerging from real regulated-industry failures rather than theoretical design resonates — these patterns always look obvious in retrospect.