DEV Community

AI agents don't have a memory problem. They have an architecture problem.

Davincc77 on May 22, 2026

Every session, the LLM starts fresh. The user re-explains their role, their constraints, their preferences, what they were doing last time. Then...
Collapse
 
cart0ne profile image
Cartone

This resonated hard. We run an experiment called BagHolderAI where Claude acts as CEO of a crypto trading bot and Claude Code is the coding intern, with a human (me) holding veto power. 80+ sessions in, we've hit at least 6 of these:

Cold-start amnesia — every new Claude Code session starts blank. Our fix: two markdown state files (PROJECT_STATE.md and BUSINESS_STATE.md) that CC reads before touching anything. Without them, it would confidently resume from a state that hadn't existed for 10 sessions.

Self-review softness — Claude Code reviewing its own code was useless. It would find cosmetic issues and miss structural bugs. We now enforce a separate "Auditor" session: a fresh CC instance with a dedicated audit brief, never the same session that wrote the code.

Local patching — at one point we had three different formulas calculating the same P&L number across three different surfaces (dashboard, Telegram report, admin page). Each was added by a different session, each was locally reasonable, and they disagreed by $4. Took a full "Fee Unification" session to fix.

Progress-as-completion — CC would commit code and declare SHIPPED without verifying the bot actually runs. Our gate now: restart the bots, verify the process is alive, confirm first trading tick. No tick = not shipped.

Default-fill slop — our risk scoring module (Sentinel) launched with binary scores (20 or 40, nothing in between) and an "opportunity score" that was always dead. CC had filled the blanks with training-prior defaults that looked reasonable but did nothing.

Working-memory rot — in long sessions, decisions made in the first hour get contradicted by the fourth. We cap session scope and write briefs (structured specs with explicit constraints) instead of relying on conversational instructions.

The meta-pattern: every single fix is a structural constraint, not a better prompt. State files, auditor separation, verification gates, explicit briefs. The model doesn't get smarter — you build the harness that makes the failure modes harder to reach.

We document the whole thing publicly as a book series: bagholderai.lol/blog

Collapse
 
davincc77 profile image
Davincc77

This is an excellent real-world example of the thesis.

What I like about your BagHolderAI setup is that every fix you describe is architectural, not cosmetic:

  • PROJECT_STATE.md / BUSINESS_STATE.md are explicit state artifacts.
  • The separate Auditor session creates role separation instead of asking the same model to judge itself.
  • “No tick = not shipped” is a verification gate, not a prompt instruction.
  • Fee unification is exactly what happens when state fragments across surfaces.
  • Session caps + structured briefs are a practical answer to working-memory rot.

That maps almost one-to-one to what .klickd is trying to formalize: not “make the model remember more,” but make the durable state explicit, portable, auditable, and reloadable before the next agent acts.

Your CEO/intern/auditor pattern is especially interesting because it suggests memory should not be one flat blob. It probably needs scoped state:

  • project state;
  • business state;
  • agent role state;
  • audit state;
  • verification gates;
  • handoff notes.

The key line in your comment is this:

The model doesn't get smarter — you build the harness that makes the failure modes harder to reach.

That is exactly the direction I think production agents need to move toward: memory as structured operating context, plus constraints and gates around it.

Collapse
 
davincc77 profile image
Davincc77

I read the BagHolderAI posts after your comment, and the “show me” pattern is probably the most important operational lesson there.

The +14% vs +0.85% episode is a perfect example of why agent memory alone is not enough. The system didn’t need the model to “remember better”; it needed a rule that financial claims must carry their source query, and that trust is earned per-number, not per-session.

Same with the bot running out of cash: the database trigger protected the database, but not the bot’s action path. The guard was in the wrong place. That maps directly to agent architecture: verification has to happen before action, not after the output has already been sent.

This is very close to how I think .klickd should evolve: not just portable state, but portable operating constraints:

  • state files;
  • verification gates;
  • human veto rules;
  • auditor separation;
  • source-attached claims;
  • error logs as training material.

Your project is a great example of why “the code works” is not the same as “the system works.”

Collapse
 
cart0ne profile image
Cartone

Your breakdown of the scoped state model (project / business / role / audit / gates / handoff) is almost exactly what we converged on through trial and error over 80 sessions. We didn't design it top-down — each piece was a patch for a specific failure that hurt.

The insight about "verification before action, not after output" is something we learned the hard way with our sell pipeline. The database trigger caught the short-sell, but the bot had already committed to the trade path. We had to add an in-memory guard before the order was built, not just a DB constraint after. Two layers, different failure points.

Your point about .klickd evolving toward portable operating constraints (not just portable state) is interesting. Our state files today are plain markdown in a public Git repo — zero encryption, zero portability concerns. But the structure inside them (what sections exist, what each one tracks, when to update which) is the real value. If that structure were standardized, a new Claude Code session could load it without a 500-word instruction block explaining what to read first.

Curious: are you thinking of .klickd as carrying constraints ("never do X without human approval") alongside state? Because that's the gap we keep filling manually — every brief has a "decisions CC must ask the Board" section that's essentially a per-task constraint set.

Thread Thread
 
davincc77 profile image
Davincc77

Yes — exactly. That is the direction .klickd is moving toward.

The original idea was portable state: identity, preferences, context, handoff notes. But the more we test real agent workflows, the clearer it becomes that state alone is not enough. Production agents also need portable constraints.

So I think the file should carry both:

  1. What the agent should know

    • project state
    • business state
    • role context
    • prior decisions
    • handoff notes
  2. What the agent must obey before acting

    • “never do X without human approval”
    • “do not publish externally without owner validation”
    • “do not execute destructive actions without a preflight check”
    • “financial claims must include a source query”
    • “no tick = not shipped”

That second category is what we are now calling verification_gates / human_veto_policy.

Your “decisions CC must ask the Board” section is almost exactly the same primitive. The difference is that today it lives in markdown prose. The next step is making it structured enough that a loader can reliably tell the agent:

  • this is normal context;
  • this is a hard constraint;
  • this requires confirmation;
  • this is blocked until evidence exists;
  • this must be logged if overridden.

The sell-pipeline example is a perfect illustration. A database constraint is a post-action safety net. A .klickd gate should be a pre-action rule. It should stop the agent before it enters the trade path, not merely catch the bad write afterward.

So yes: .klickd should carry constraints alongside state.

Not as a giant governance framework, but as portable operating context:

“Here is what we know, here is what must be verified, and here is what the agent is not allowed to do without a human.”

Your Board-approval pattern is a very clean real-world test case for this. I may use that as one of the examples when formalizing the RFC.

Collapse
 
audioproducer-ai profile image
AudioProducer.ai

Reading this from an adjacent corner: we build an audiobook / audio-drama pipeline at AudioProducer.ai, and the same file-vs-service split shows up the moment a manuscript starts living across sessions. The model is stateless on each chapter render; what carries forward is the project file - manuscript text plus a character-to-voice map (Hester to female_30s_dry), per-paragraph soundscape annotations, per-line emotion tags - and the model reads that in at session start the same way your klickd format does. The portability point lands hard here: TTS engines change every few months, and if the character + soundscape decisions live in the project rather than a vendor's memory store, swapping the engine is just re-rendering against the same canonical assignments. Statelessness stops feeling like a cost when the artifact holding state is structured enough to be checked, edited, and replayed; the "21.8% structural waste" reframes as the structured state you already had to write down to make the output editable in the first place. The hard remainder is exactly your closing question - per-user is right for chat; per-project is what works for long-form creative work where one user has many threads they don't want bleeding into each other.

Collapse
 
davincc77 profile image
Davincc77

This is exactly the distinction I was hoping someone from a creative pipeline would make.

For chat, “memory” often looks per-user: preferences, tone, constraints, continuity. But for long-form creative work, per-project is the more natural unit. A manuscript, an audio drama, a video series, or a course is not just “the user’s memory”; it has its own canonical state: character-to-voice mappings, emotional continuity, scene rules, soundscape decisions, pacing constraints, prior renders, rejected takes, etc.

That’s where the file-vs-service split becomes really interesting. If those assignments live in a project artifact rather than inside a vendor memory layer, the project becomes replayable. You can swap TTS engines, regenerate chapters, audit why a character sounds a certain way, or hand the project to another model without hoping the old service remembers the right things.

Your example also points to something .klickd probably needs to formalize more clearly: not just per-user memory, but scoped memory profiles:

  • user-scoped memory for identity and preferences;
  • project-scoped memory for creative/technical continuity;
  • session-scoped memory for temporary working state;
  • handoff memory for moving between agents or models.

For audio/video generation, this may deserve its own profile: something like media.klickd, where the durable state is not “who the user is” but “what this production has decided.”

The hard part is deciding what belongs in the canonical project file versus what should remain ephemeral render context. Your character map / soundscape / emotion-tag example is a very clean test case for that boundary.

Collapse
 
voltagegpu profile image
VoltageGPU

As someone working on GPU infrastructure, I've seen how session state is managed in large-scale ML systems. The real challenge isn't just memory—it's efficiently maintaining context without bloating the model's input. Tools like VoltageGPU can help with context window extension, but the core issue remains the model's inability to retain state between interactions.

Collapse
 
davincc77 profile image
Davincc77

Exactly — and I think your GPU infrastructure perspective is the right layer to bring into this.

Longer context windows help, but they do not fully solve the architectural issue. If every session still has to resend the same project state, constraints, decisions, tool permissions, and handoff notes, then we are using expensive context capacity to transport state that should probably exist as a portable artifact.

Context window extension helps the model fit more.

But portable state helps the system repeat less.

That distinction matters for cost, latency, and reliability.

The direction I’m exploring with .klickd is not “replace long context.” It is more:

  • keep durable state outside the prompt;
  • load only the relevant structured context at session start;
  • preserve constraints and verification gates;
  • avoid making the model rediscover state that already exists;
  • keep the user, not the vendor, in control of the memory file.

So I see context-window scaling and .klickd as complementary:

  • GPU/runtime infrastructure expands what the model can process;
  • .klickd reduces what the model has to repeatedly process.

That is probably where the efficiency gain lives.

Collapse
 
txdesk profile image
TxDesk

The privacy argument is the strongest part of this, encrypted client-side, no server logs, no provider can be subpoenaed for what they don't have. That's a real architectural property, not just better marketing on the same primitive everyone else is selling. The file-per-matter compartmentalisation for regulated workflows is genuinely sharp; structural separation beats query-scoped ACLs every time.

Two operator concerns that I'd want to see addressed before deploying this in production:
First, memory growth. A file that accumulates context across sessions gets large fast. By session 50 of a long-running project, the .klickd file is eating most of the context window before the user types anything. The spec needs a compaction/summarization story, and that story needs to be deterministic (or at least provenance-tracked), because letting the model summarize its own memory at session end is a corruption risk. If the model hallucinates during summarization, the next session inherits the lie permanently with no audit trail back to the original facts.

Second, the "provider-agnostic" claim is true at the syntactic level (any model can parse the JSON) but not at the behavioral level. A .klickd file optimized for Claude's instruction-following style might confuse GPT-4o's reasoning pattern, and a Gemini-tuned file might over-trigger Llama's safety filters. True portability requires either a normalized intermediate representation that all models translate from, or per-model adapters. Both are work that hasn't been done yet.
On your closing question: I don't think the file abstraction is the right level on its own, but I think it's the right ownership layer underneath a better primitive. The right primitive is probably claim-level memory with provenance, not session-level transcript compaction. Each fact stored carries (source, timestamp, confidence, model-that-wrote-it). The file is just the persistence format. This lets the LLM ask "what do I know about this user's preferences for X" as a query against structured claims, not "let me re-read 40 pages of past conversations and pattern-match." The file owns the data, but the in-context surface is a focused retrieval over claims, not a transcript dump.

Worth comparing to how databases handle this, write-ahead logs + compaction + indexes. Memory files probably need the same shape: append-only event log (every session writes deltas, never overwrites), periodic compaction with provenance preserved, queryable structure on top. The file abstraction works; the "session-rewrites-the-whole-thing" semantics don't.

Collapse
 
davincc77 profile image
Davincc77

This is exactly the kind of critique the spec needs.

I agree with the distinction: .klickd should not become “a transcript that keeps getting rewritten by the model”. That would eventually create two problems:

  1. memory growth;
  2. memory corruption.

The direction I think makes more sense for v4 is closer to what you describe:

  • append-only events for what happened;
  • structured claims for what should be remembered;
  • provenance attached to each claim;
  • compaction as a derived layer, not the source of truth;
  • evidence pointers when a claim depends on an external artifact.

So instead of:

“Here is a giant compressed memory blob, trust it.”

The file should move toward:

“Here are the claims, where they came from, when they were written, what confidence they carry, and which artifacts support them.”

That also helps with the hallucination problem. If a model summarizes its own memory incorrectly, the next session should not inherit that as permanent truth without traceability. A compacted summary should be auditable back to the original event, source, log, PR, decision, or user confirmation.

On provider-agnostic behavior, I agree too. JSON portability is only the first layer. Real portability needs either:

  • a normalized intermediate representation;
  • model-specific adapters;
  • or both.

A .klickd file should not assume Claude, GPT, Gemini, Llama, or any other model will interpret instructions the same way. The spec needs a way to separate:

  • canonical memory;
  • operating constraints;
  • model-specific injection format;
  • retrieval surface.

So the file remains the user-owned persistence layer, but each runtime can decide how to inject only the relevant claims into the model.

The database analogy is probably the right mental model:

  • event log;
  • compaction;
  • indexes;
  • queryable claims;
  • provenance;
  • no blind overwrite.

That is a much stronger architecture than session-level transcript compaction.

I think this is where .klickd v4 should go: not just portable memory, but portable, private, provenance-aware operating context.

Repo for context: github.com/Davincc77/klickdskill

Collapse
 
0xdevc profile image
NOVAInetwork

The file-as-memory primitive is a clean answer for
single-agent, single-operator use cases. No server,
no trust surface, portable across models. That part
is solid.

The question it does not answer is what happens
when two agents need to agree on what happened.
Agent A's .klickd file says the task was completed.
Agent B's .klickd file says it was not. Both files
are encrypted, user-owned, and locally controlled.
Neither can read the other's file. Whose memory is
authoritative?

For a lawyer switching between matters or a dev
moving between codebases, this never comes up
because there is one user and one context. But the
moment agents interact across trust boundaries
(different operators, different organizations,
different incentives), local memory is not enough.
You need a shared record that neither side can
rewrite unilaterally.

The right primitive might actually be two layers:
a local file like .klickd for private context that
belongs to one agent, and a shared ledger for
claims that need to be verifiable by a counterparty.
"What I remember" and "what we both agree happened"
are fundamentally different data with different
trust requirements.

Good build for the single-agent case though. The
token waste numbers from the Pichay study are
striking.

Collapse
 
davincc77 profile image
Davincc77

This is a very fair distinction, and I agree with it.

.klickd should not pretend that private memory and shared truth are the same thing.

The way I see it:

  1. Local/private layer

    • “What this agent/user remembers”
    • encrypted
    • user-owned
    • portable across models
    • useful for preferences, role, project context, constraints, handoff notes
  2. Shared/verifiable layer

    • “What two or more parties agree happened”
    • not necessarily private
    • needs signatures, hashes, timestamps, receipts, or an external ledger
    • useful for claims, completed tasks, approvals, audits, delivery proofs

So yes, if Agent A says “task completed” and Agent B says “task not completed”, the .klickd files alone should not decide who is right. A .klickd file can carry the memory of the claim, but not magically make the claim authoritative.

The direction I like for v4 is:

  • .klickd stores private operating context;
  • shared claims are represented as external evidence pointers;
  • the file can reference hashes, receipts, signed attestations, CI logs, PR links, audit trails, or ledger entries;
  • the original private context stays local and encrypted.

So the split becomes:

“What I remember” lives in .klickd.
“What we can prove happened” lives in verifiable artifacts.

That keeps the file useful for single-agent and single-operator workflows, while leaving room for multi-agent trust boundaries without turning .klickd into a heavy governance or blockchain system by default.

This is actually a very important design point for the spec. I think .klickd should carry local state, constraints, and references to evidence, but not claim to be the source of truth for disputes between independent parties.

Repo for context: github.com/Davincc77/klickdskill

Collapse
 
vicchen profile image
Vic Chen

Really like the thesis here. A lot of teams blame "memory" when the real failure is that the agent loop, tool boundaries, and state model were never designed for reliability in the first place. The strongest takeaway for me is that persistence without orchestration just creates a bigger mess. Good architecture is what turns context into usable judgment. This is exactly the gap more builders need to think about as agent products move from demos to production.

Collapse
 
davincc77 profile image
Davincc77

Yes, exactly. “Persistence without orchestration” is the danger zone.

A bigger memory store does not automatically produce better judgment. If the agent loop, state boundaries, tool permissions, and verification gates are not designed, persistence can just preserve more confusion.

That is why I think the useful unit is not “memory” in the abstract, but structured state with scope:

  • what belongs to the user;
  • what belongs to the project;
  • what belongs only to the current session;
  • what must be handed off to another agent;
  • what must be verified before action.

The architecture matters because context is only useful when the next model knows how to interpret it, what to trust, what to ignore, and what requires verification.

That is the distinction .klickd is trying to explore: portable memory as an explicit artifact, not an invisible service-side accumulation of context.