Every session,
the LLM starts fresh. The user re-explains their role, their constraints, their preferences, what they were doing last time. Then...
For further actions, you may consider blocking this person and/or reporting abuse
This resonated hard. We run an experiment called BagHolderAI where Claude acts as CEO of a crypto trading bot and Claude Code is the coding intern, with a human (me) holding veto power. 80+ sessions in, we've hit at least 6 of these:
Cold-start amnesia — every new Claude Code session starts blank. Our fix: two markdown state files (
PROJECT_STATE.mdandBUSINESS_STATE.md) that CC reads before touching anything. Without them, it would confidently resume from a state that hadn't existed for 10 sessions.Self-review softness — Claude Code reviewing its own code was useless. It would find cosmetic issues and miss structural bugs. We now enforce a separate "Auditor" session: a fresh CC instance with a dedicated audit brief, never the same session that wrote the code.
Local patching — at one point we had three different formulas calculating the same P&L number across three different surfaces (dashboard, Telegram report, admin page). Each was added by a different session, each was locally reasonable, and they disagreed by $4. Took a full "Fee Unification" session to fix.
Progress-as-completion — CC would commit code and declare SHIPPED without verifying the bot actually runs. Our gate now: restart the bots, verify the process is alive, confirm first trading tick. No tick = not shipped.
Default-fill slop — our risk scoring module (Sentinel) launched with binary scores (20 or 40, nothing in between) and an "opportunity score" that was always dead. CC had filled the blanks with training-prior defaults that looked reasonable but did nothing.
Working-memory rot — in long sessions, decisions made in the first hour get contradicted by the fourth. We cap session scope and write briefs (structured specs with explicit constraints) instead of relying on conversational instructions.
The meta-pattern: every single fix is a structural constraint, not a better prompt. State files, auditor separation, verification gates, explicit briefs. The model doesn't get smarter — you build the harness that makes the failure modes harder to reach.
We document the whole thing publicly as a book series: bagholderai.lol/blog
This is an excellent real-world example of the thesis.
What I like about your BagHolderAI setup is that every fix you describe is architectural, not cosmetic:
PROJECT_STATE.md/BUSINESS_STATE.mdare explicit state artifacts.That maps almost one-to-one to what
.klickdis trying to formalize: not “make the model remember more,” but make the durable state explicit, portable, auditable, and reloadable before the next agent acts.Your CEO/intern/auditor pattern is especially interesting because it suggests memory should not be one flat blob. It probably needs scoped state:
The key line in your comment is this:
That is exactly the direction I think production agents need to move toward: memory as structured operating context, plus constraints and gates around it.
I read the BagHolderAI posts after your comment, and the “show me” pattern is probably the most important operational lesson there.
The +14% vs +0.85% episode is a perfect example of why agent memory alone is not enough. The system didn’t need the model to “remember better”; it needed a rule that financial claims must carry their source query, and that trust is earned per-number, not per-session.
Same with the bot running out of cash: the database trigger protected the database, but not the bot’s action path. The guard was in the wrong place. That maps directly to agent architecture: verification has to happen before action, not after the output has already been sent.
This is very close to how I think
.klickdshould evolve: not just portable state, but portable operating constraints:Your project is a great example of why “the code works” is not the same as “the system works.”
Your breakdown of the scoped state model (project / business / role / audit / gates / handoff) is almost exactly what we converged on through trial and error over 80 sessions. We didn't design it top-down — each piece was a patch for a specific failure that hurt.
The insight about "verification before action, not after output" is something we learned the hard way with our sell pipeline. The database trigger caught the short-sell, but the bot had already committed to the trade path. We had to add an in-memory guard before the order was built, not just a DB constraint after. Two layers, different failure points.
Your point about
.klickdevolving toward portable operating constraints (not just portable state) is interesting. Our state files today are plain markdown in a public Git repo — zero encryption, zero portability concerns. But the structure inside them (what sections exist, what each one tracks, when to update which) is the real value. If that structure were standardized, a new Claude Code session could load it without a 500-word instruction block explaining what to read first.Curious: are you thinking of
.klickdas carrying constraints ("never do X without human approval") alongside state? Because that's the gap we keep filling manually — every brief has a "decisions CC must ask the Board" section that's essentially a per-task constraint set.Yes — exactly. That is the direction
.klickdis moving toward.The original idea was portable state: identity, preferences, context, handoff notes. But the more we test real agent workflows, the clearer it becomes that state alone is not enough. Production agents also need portable constraints.
So I think the file should carry both:
What the agent should know
What the agent must obey before acting
That second category is what we are now calling
verification_gates/human_veto_policy.Your “decisions CC must ask the Board” section is almost exactly the same primitive. The difference is that today it lives in markdown prose. The next step is making it structured enough that a loader can reliably tell the agent:
The sell-pipeline example is a perfect illustration. A database constraint is a post-action safety net. A
.klickdgate should be a pre-action rule. It should stop the agent before it enters the trade path, not merely catch the bad write afterward.So yes:
.klickdshould carry constraints alongside state.Not as a giant governance framework, but as portable operating context:
Your Board-approval pattern is a very clean real-world test case for this. I may use that as one of the examples when formalizing the RFC.
Reading this from an adjacent corner: we build an audiobook / audio-drama pipeline at AudioProducer.ai, and the same file-vs-service split shows up the moment a manuscript starts living across sessions. The model is stateless on each chapter render; what carries forward is the project file - manuscript text plus a character-to-voice map (Hester to female_30s_dry), per-paragraph soundscape annotations, per-line emotion tags - and the model reads that in at session start the same way your klickd format does. The portability point lands hard here: TTS engines change every few months, and if the character + soundscape decisions live in the project rather than a vendor's memory store, swapping the engine is just re-rendering against the same canonical assignments. Statelessness stops feeling like a cost when the artifact holding state is structured enough to be checked, edited, and replayed; the "21.8% structural waste" reframes as the structured state you already had to write down to make the output editable in the first place. The hard remainder is exactly your closing question - per-user is right for chat; per-project is what works for long-form creative work where one user has many threads they don't want bleeding into each other.
This is exactly the distinction I was hoping someone from a creative pipeline would make.
For chat, “memory” often looks per-user: preferences, tone, constraints, continuity. But for long-form creative work, per-project is the more natural unit. A manuscript, an audio drama, a video series, or a course is not just “the user’s memory”; it has its own canonical state: character-to-voice mappings, emotional continuity, scene rules, soundscape decisions, pacing constraints, prior renders, rejected takes, etc.
That’s where the file-vs-service split becomes really interesting. If those assignments live in a project artifact rather than inside a vendor memory layer, the project becomes replayable. You can swap TTS engines, regenerate chapters, audit why a character sounds a certain way, or hand the project to another model without hoping the old service remembers the right things.
Your example also points to something
.klickdprobably needs to formalize more clearly: not just per-user memory, but scoped memory profiles:For audio/video generation, this may deserve its own profile: something like
media.klickd, where the durable state is not “who the user is” but “what this production has decided.”The hard part is deciding what belongs in the canonical project file versus what should remain ephemeral render context. Your character map / soundscape / emotion-tag example is a very clean test case for that boundary.
As someone working on GPU infrastructure, I've seen how session state is managed in large-scale ML systems. The real challenge isn't just memory—it's efficiently maintaining context without bloating the model's input. Tools like VoltageGPU can help with context window extension, but the core issue remains the model's inability to retain state between interactions.
Exactly — and I think your GPU infrastructure perspective is the right layer to bring into this.
Longer context windows help, but they do not fully solve the architectural issue. If every session still has to resend the same project state, constraints, decisions, tool permissions, and handoff notes, then we are using expensive context capacity to transport state that should probably exist as a portable artifact.
Context window extension helps the model fit more.
But portable state helps the system repeat less.
That distinction matters for cost, latency, and reliability.
The direction I’m exploring with
.klickdis not “replace long context.” It is more:So I see context-window scaling and
.klickdas complementary:.klickdreduces what the model has to repeatedly process.That is probably where the efficiency gain lives.
The privacy argument is the strongest part of this, encrypted client-side, no server logs, no provider can be subpoenaed for what they don't have. That's a real architectural property, not just better marketing on the same primitive everyone else is selling. The file-per-matter compartmentalisation for regulated workflows is genuinely sharp; structural separation beats query-scoped ACLs every time.
Two operator concerns that I'd want to see addressed before deploying this in production:
First, memory growth. A file that accumulates context across sessions gets large fast. By session 50 of a long-running project, the .klickd file is eating most of the context window before the user types anything. The spec needs a compaction/summarization story, and that story needs to be deterministic (or at least provenance-tracked), because letting the model summarize its own memory at session end is a corruption risk. If the model hallucinates during summarization, the next session inherits the lie permanently with no audit trail back to the original facts.
Second, the "provider-agnostic" claim is true at the syntactic level (any model can parse the JSON) but not at the behavioral level. A .klickd file optimized for Claude's instruction-following style might confuse GPT-4o's reasoning pattern, and a Gemini-tuned file might over-trigger Llama's safety filters. True portability requires either a normalized intermediate representation that all models translate from, or per-model adapters. Both are work that hasn't been done yet.
On your closing question: I don't think the file abstraction is the right level on its own, but I think it's the right ownership layer underneath a better primitive. The right primitive is probably claim-level memory with provenance, not session-level transcript compaction. Each fact stored carries (source, timestamp, confidence, model-that-wrote-it). The file is just the persistence format. This lets the LLM ask "what do I know about this user's preferences for X" as a query against structured claims, not "let me re-read 40 pages of past conversations and pattern-match." The file owns the data, but the in-context surface is a focused retrieval over claims, not a transcript dump.
Worth comparing to how databases handle this, write-ahead logs + compaction + indexes. Memory files probably need the same shape: append-only event log (every session writes deltas, never overwrites), periodic compaction with provenance preserved, queryable structure on top. The file abstraction works; the "session-rewrites-the-whole-thing" semantics don't.
This is exactly the kind of critique the spec needs.
I agree with the distinction:
.klickdshould not become “a transcript that keeps getting rewritten by the model”. That would eventually create two problems:The direction I think makes more sense for v4 is closer to what you describe:
So instead of:
The file should move toward:
That also helps with the hallucination problem. If a model summarizes its own memory incorrectly, the next session should not inherit that as permanent truth without traceability. A compacted summary should be auditable back to the original event, source, log, PR, decision, or user confirmation.
On provider-agnostic behavior, I agree too. JSON portability is only the first layer. Real portability needs either:
A
.klickdfile should not assume Claude, GPT, Gemini, Llama, or any other model will interpret instructions the same way. The spec needs a way to separate:So the file remains the user-owned persistence layer, but each runtime can decide how to inject only the relevant claims into the model.
The database analogy is probably the right mental model:
That is a much stronger architecture than session-level transcript compaction.
I think this is where
.klickdv4 should go: not just portable memory, but portable, private, provenance-aware operating context.Repo for context: github.com/Davincc77/klickdskill
The file-as-memory primitive is a clean answer for
single-agent, single-operator use cases. No server,
no trust surface, portable across models. That part
is solid.
The question it does not answer is what happens
when two agents need to agree on what happened.
Agent A's .klickd file says the task was completed.
Agent B's .klickd file says it was not. Both files
are encrypted, user-owned, and locally controlled.
Neither can read the other's file. Whose memory is
authoritative?
For a lawyer switching between matters or a dev
moving between codebases, this never comes up
because there is one user and one context. But the
moment agents interact across trust boundaries
(different operators, different organizations,
different incentives), local memory is not enough.
You need a shared record that neither side can
rewrite unilaterally.
The right primitive might actually be two layers:
a local file like .klickd for private context that
belongs to one agent, and a shared ledger for
claims that need to be verifiable by a counterparty.
"What I remember" and "what we both agree happened"
are fundamentally different data with different
trust requirements.
Good build for the single-agent case though. The
token waste numbers from the Pichay study are
striking.
This is a very fair distinction, and I agree with it.
.klickdshould not pretend that private memory and shared truth are the same thing.The way I see it:
Local/private layer
Shared/verifiable layer
So yes, if Agent A says “task completed” and Agent B says “task not completed”, the
.klickdfiles alone should not decide who is right. A.klickdfile can carry the memory of the claim, but not magically make the claim authoritative.The direction I like for v4 is:
.klickdstores private operating context;So the split becomes:
That keeps the file useful for single-agent and single-operator workflows, while leaving room for multi-agent trust boundaries without turning
.klickdinto a heavy governance or blockchain system by default.This is actually a very important design point for the spec. I think
.klickdshould carry local state, constraints, and references to evidence, but not claim to be the source of truth for disputes between independent parties.Repo for context: github.com/Davincc77/klickdskill
Really like the thesis here. A lot of teams blame "memory" when the real failure is that the agent loop, tool boundaries, and state model were never designed for reliability in the first place. The strongest takeaway for me is that persistence without orchestration just creates a bigger mess. Good architecture is what turns context into usable judgment. This is exactly the gap more builders need to think about as agent products move from demos to production.
Yes, exactly. “Persistence without orchestration” is the danger zone.
A bigger memory store does not automatically produce better judgment. If the agent loop, state boundaries, tool permissions, and verification gates are not designed, persistence can just preserve more confusion.
That is why I think the useful unit is not “memory” in the abstract, but structured state with scope:
The architecture matters because context is only useful when the next model knows how to interpret it, what to trust, what to ignore, and what requires verification.
That is the distinction
.klickdis trying to explore: portable memory as an explicit artifact, not an invisible service-side accumulation of context.