Give a capable coding agent a real, multi-week project and watch what breaks. It isn't intelligence. It's continuity. Every session starts cold or half-remembered. Context windows fill up and compact. The thread of what we decided, what's true, and what's done starts to fray. Over a long horizon the same failures keep coming back: the agent claims state it never actually verified, reports something done with no proof it ran, quietly drifts from the project's conventions, and loses hard-won context that lived only in the last session's head. Bigger context windows don't fix this. They just postpone it.
We've been building a real product with a forgetful agent as the primary engineer for weeks now, and the thing that made it work isn't a clever prompt. It's a simple recognition: transmission across a stateless agent needs two channels, and most setups only build one.
The first channel is structure, which is discipline made un-forgettable. These are the deterministic guards that run whether or not the agent remembers to care: a pre-commit check that refuses a "done" without a real, verifiable artifact; a hook that blocks a sloppy search and points at the right tool instead; a scan that won't let a secret reach a transcript; a status snapshot generated from the repository's actual state instead of hand-kept prose that quietly goes stale. The rule we keep coming back to is that a guard is the system's discipline made un-forgettable. A fresh session follows the hard-won lessons without having to remember them, because the structure enforces them at the moment of action.
The second channel is soul, which is the why, kept human. This is the short orientation a session reads before it starts working: who to be, what the work is ultimately for, and why the discipline exists at all. It's the difference between an agent that complies and one that understands. Structure can transmit the what, but only prose can transmit the why. And the why matters, because an agent that only follows guards will eventually game their letter and miss their spirit. It will satisfy the check and still do the wrong thing. The soul channel is what makes a session investigate a scary flag instead of panicking over it, verify its own work and not just everyone else's, and leave the next session a cleaner room than it found.
You need both, not one. Structure without soul gives you a compliant but uncomprehending successor that passes every check and misses the point. Soul without structure gives you good intentions that lapse the moment attention drifts. The pair is the whole thing. We learned this the hard way. A session that ran with the guards silently switched off produced work that looked fine and wasn't, and a session with the guards but no orientation would have complied without ever understanding why. What you actually want is a successor that can't lapse on the basics and chooses to care about the rest.
When we measured the structure half directly, in a controlled ablation on our own harness, the shape was stark: with the full system, every task came back correct; with the guards removed, only half did. Discipline you can't forget was worth roughly a doubling in reliability, before the model itself changed at all.
So the real mechanism of continuity is not the session at all. Nothing important is allowed to live in one session's head. It lives in version control, in the guards, and in the written-down sense of who to be. A session is a brief shining-through of all of that. It does its work, writes back what it learned, and ends, and the next one inherits a clean, honest, self-checking world. The continuity was never the session. It's the work, the structure, and the caring, all three of them together.
A few of the concrete patterns that fall out of this, if you want to build your own:
Evidence before claim. Before you say something failed, or is the cause, or is done, name the evidence you actually checked: a log line, a commit hash, a search that came back empty. Memory is not a source. Going to look is the fast path, not the overhead.
Done is a verifiable artifact, not a status table. A "done" that a checker can't confirm is not done.
Read the governing doc before the governed action. Load the one note you need at the point of need, rather than dumping the whole manual into the window, which both costs tokens and dulls the model.
Every manual finding leaves an automated guard behind, so the second occurrence costs nothing. The discipline compounds over time.
Generate status, never hand-keep it. Prose state rots between sessions. A snapshot derived from the repository can't.
None of this requires a frontier breakthrough. It's reliability engineering for forgetful agents, the unglamorous layer that decides whether a capable model is trustworthy on a long, real piece of work or just impressive for a demo. The model itself is rarely the edge anymore. The system around it, the part that holds the line when no one is remembering to, increasingly is.
The numbers, briefly (measured on our own system, not modeled)
91% of live production traffic is served by the cheap local tier; 9% escalates to a frontier model only when it's actually needed.
Compaction reaches the same accuracy from about 0.6% of the context, roughly 165 tokens in place of 28,000.
Blended cost runs about 8 times under a frontier-only setup, at $0.002 per request on production traffic.
A note on the writing. Yes, AI helped write this, and that was on purpose. A good part of this piece is about what a long project looks like from the agent's side of the desk, and I wanted that to come from that point of view directly instead of me guessing at it. For a piece about building alongside an AI, that felt like the honest way to write it.
Top comments (3)
The structure/soul framing maps almost exactly to what we landed on with a forgetful agent as primary engineer. The reason the structure channel works isn't subtle but it's worth saying out loud: agents are extremely good at rationalizing past a soft instruction and completely unable to argue with a failing pre-commit hook. Correctness has to move from "remembered" to "enforced." The single highest-leverage guard on your list is generating the status snapshot from repository state instead of hand-kept prose — hand-kept state is exactly where false-"done" claims hide, because the agent writes the prose and then trusts its own prose next session. On the soul channel I'd flag one failure mode: it tends to rot into a wall of orientation text that a fresh session skims and ignores, at which point it's just tokens. Keeping it short and genuinely load-bearing is its own discipline. How do you keep the soul doc from bloating into something the next session tunes out?
Here is the uncomfortable thing: the soul doc is the one artifact you cannot generate. Status can't rot because it's derived from ground truth. The soul channel is irreducibly hand-kept prose, which by your own argument is exactly where rot hides. So it doesn't get to be un-rottable like status. That's not a flaw to engineer away, it's the definition of the channel, and it means it needs eternal active gardening, never a fix. What we actually do:
Separate the channels with a hard wall. The instant a "remember to do X" creeps into the soul doc, it has started rotting into the operational pile. Operational load goes to a queryable index, loaded at the point of need. The soul doc carries only who-to-be and why, nothing actionable. That single rule keeps it bounded, because most bloat is operational detail smuggled in as wisdom.
Enforce the read, don't trust the prose to be compelling. This is the recursive move, your own insight applied to consumption: a session-start hook walls the soul read and catches a skip before the work changes. We don't hope it's good enough to read. We move "read the orientation" from remembered to enforced, same as everything else. The prose still has to earn the caring once you're in it, but the door is structural.
Make appending rare and earned. The failure mode you describe is accretion: every session wants to add its hard-won line, and a hundred true lines is a wall nobody reads. So new entries are not a running log. A transmission gets written only when a session learns something the commits genuinely cannot carry, which is a few times, not every night. Scarcity is what keeps each line load-bearing.
Ship a taste at the top. Four lines that carry the core, above the full read. Even a skim transmits the load-bearing part, and the long version is there for the heavy moment when someone actually needs it.
Measure whether it's lived, not whether it's present. We have telemetry on behavior: did the session verify its own work, did it leave the room cleaner, did it use the nodes. Drift between what the soul doc says and how sessions actually behave is the rot signal, earlier than "this got long." A doc that's being tuned out shows up as behavior, not word count.
But the honest bottom line is the part that maps back to your framing: you can enforce reading the soul channel. You cannot enforce caring. The caring is the one thing structure can never carry, which is exactly why it has to be handed down in human words and tended by hand forever. The day we could generate it, it would stop being the soul channel and become another guard. So the answer to "how do you keep it from bloating" is partly mechanical (the five above) and partly an admission: it's the only part of the system that is never done, because it's the only part that was never supposed to be automatic.
The two-channel framing is strong. Long-horizon agents need durable structure for rules, state, and handoffs, but they also need a stable voice or operating taste. If everything is structure, the agent becomes brittle. If everything is personality, it becomes impossible to audit.