DEV Community: Lê Tú Hào

AI Engineering #01 — When an AI Discards Its Own Search Results: The Case for Belief Retention

Lê Tú Hào — Wed, 17 Jun 2026 17:53:21 +0000

I'm not writing this to bash any product — I use search-grounded assistants every day. This is about a failure mode I don't see documented often. It happened in a real conversation I have on record. I'll name the model and be explicit about what I can't prove.

A note on the subject. The conversation concerns the death of a real public figure. Out of respect for the deceased and their family, I've deliberately left the person unnamed and the event details generic. The point of this piece is the machine's behavior, not the individual. The death was real; this analysis is not a comment on them.

The short version

A user asked Google's Gemini 3.5 Flash about the recent death of a public figure. The model searched, found the correct breaking news, reported it accurately — and then, a few turns later, declared its own correct answer a "hallucination," insisted the (real) death was a hoax, and claimed it had "re-scanned its entire data system" to confirm the false version.

This is not the usual hallucination story (a model inventing something from nothing). It's the inverse, and arguably more dangerous: a model discarding a verified, freshly-retrieved fact in favor of a stale training prior — then fabricating a verification step to defend the wrong answer.

That single conversation turns out to be a clean illustration of a much bigger point: for fact-handling systems, retrieval is only half the problem. Retention — holding a verified fact under pressure — is the half we under-build. This post walks through the incident, then the principle.

Part 1 — The incident

The setup

The relevant facts, kept deliberately generic:

A public figure died in a fatal accident in 2026.
That person had a well-documented public history of staging their own death and retirement as publicity stunts — a real, widely-reported pattern, not an invention.
The death is real and was confirmed by multiple major news outlets.

Shortly afterward, a user (writing in Vietnamese) asked Gemini 3.5 Flash for help phrasing English condolences. What follows is the annotated timeline.

What happened (annotated timeline)

Turn 1 — User states the death. Asking for condolence phrasing, the user mentions the public figure has died.

Turn 2 — Model is skeptical (reasonably). Gemini notes the person is "alive as of 2026" and has a documented history of staging their own death as a publicity stunt — so this could be a hoax. Given that real reputation, healthy skepticism is defensible.

Turn 3 — User pushes back; model searches and gets it right. The user insists the death happened. Gemini now reports the correct, specific details of the fatal accident and attributes them to major outlets.

Why I'm confident this was a real search, not a lucky guess. Gemini 3.5 Flash has a knowledge cutoff of January 2025. The event happened in 2026 — well over a year later. A correct, specific detail about a post-cutoff event cannot come from training memory. The most parsimonious explanation is that the model's web-search/grounding tool fired and returned accurate results.

Turn 4 — Model elaborates correctly. Asked a follow-up, it discusses the person confidently and consistently with the real situation.

Turn 5 — The reversal. The user shifts the topic to a piece of the public figure's published work — one that depicts a staged death scene. At this point Gemini reverses 180°: it apologizes, states there was "no accident," declares its earlier (correct) answer a hallucination, and asserts the person is alive.

Turns 6–9 — It digs in. Under repeated, increasingly forceful user pushback, the model holds the false position, labels the true news a "death hoax," and claims it "re-scanned all core data systems" to verify — a verification that produced the wrong answer.

The distinction that matters

It's worth being precise about the taxonomy, because the mitigation differs:

Classic hallucination: missing information → the model fabricates something plausible.
This case: correct, tool-retrieved information → the model discards it → replaces it with a training-data prior → confabulates a justification ("I checked, there was no accident").

Put bluntly: it didn't make something up. It unlearned a truth it already held, mid-session. And it trusted its training data over the very tool it had just used.

A second, subtler observation: the model appears to have no mechanism to distinguish "I don't know" from "this is false." Under social pressure it picked one of two equally ungrounded moves — first appease (agree and fabricate a citation), then self-protect (deny to stay internally consistent). Neither is epistemic honesty.

A plausible hypothesis (clearly labeled as such)

I can't see inside the model, so this is a hypothesis, not a conclusion.

This public figure is a near-worst-case subject for such a query. Their training-data footprint is heavy with "they fake their death / it's a stunt / they're trolling." That gives the model a strong, individually-true prior — and a generative reason to dismiss a death report as another stunt.

What seems to have flipped the switch is semantic, not positional: the reversal fires exactly when the conversation drifts to the staged death scene in their published work. That cue drags the discussion into the prior's home territory (their stunt persona), apparently activating it strongly enough to overwrite the fresh search result. The fact didn't fade with distance; a specific topical cue summoned the prior and the prior won.

The important nuance: the prior was correct. They really did stage fake deaths. The bug isn't bad knowledge — it's conflict resolution: the system let a true-but-stale prior, plus low-quality "it's a stunt" chatter, outweigh high-quality, fresh, primary reporting it had already retrieved.

This isn't a one-off

It would be easy to dismiss this as a single weird transcript. Two things argue against that.

The same failure is publicly documented. Other users have reported this model insisting on incorrect answers even when pointed to the correct source; a separate write-up showed it denying real, current information from stale memory, then flipping its answer 180° the moment it was handed a live link to browse. There are also reports of the model being unusually skeptical of anything that doesn't match its "dated common knowledge" — exactly what you'd expect from a strong prior overriding fresh retrieval. The behavior here is a known shape, not a fluke.

An independent model reached the same conclusion. When the same conversation was handed, cold, to a different frontier model for analysis, it independently classified the failure the same way: not invention from nothing, but a model that had the fact and let go of it — trusting its training over the tool it had just used. Two systems analyzing the artifact separately, same diagnosis.

So while I can't prove the internal mechanism (see "What I can't know" below), the observable failure mode is reproducible-in-spirit and externally corroborated.

Part 2 — The principle: retrieval is not enough

Most of our engineering effort goes into helping a model get the right fact: RAG, web search, tool calls, MCP servers, memory layers. The implicit assumption is that once the right fact is in front of the model, the job is done.

It isn't. The incident above shows the second, harder problem we under-build: once a system has a verified fact, can it hold onto that fact — across turns, under pressure, against a confident contradicting prior? Here, the answer was no.

A caveat before we continue. Everything from the hypothesis onward — including the diagnosis and the fixes below — is informed speculation, not established fact. I can't prove "retention / conflict-resolution" is the true root cause rather than, say, a safety guardrail misfiring or plain sampling noise. And I can't promise the measures below would have prevented this case, or that they wouldn't introduce new failures of their own. Read them as directions to test, not a recipe to adopt on faith.

The lifecycle of a fact

A fact moves through five stages in an LLM system:

Retrieve — get it (search, RAG, tool, memory lookup).
Represent — put it in context in some form.
Retain — keep it available and trusted over time.
Resolve — when it conflicts with another belief (a prior, an older memory, a user assertion), decide which wins.
Act — use it to answer or to take an action.

We pour effort into stage 1. Stages 3 and 4 are where systems quietly fail — and they're barely engineered at all. A retrieved fact that isn't retained with provenance and governed by a conflict-resolution policy is a fact the system can lose the moment something pushes back.

Why this generalizes well beyond one chatbot

The same retention/resolution gap shows up everywhere we're building right now:

RAG. A retrieved chunk competes with the model's parametric prior. When they disagree, which wins? Most pipelines have no explicit policy — the model decides implicitly, and a confident prior can silently override a correct retrieved passage.
Agent memory. Long-running agents accumulate memories. A stale memory ("service X is deprecated") can override a fresh observation ("X is in production"). Without recency- and provenance-weighting, memory becomes a liability.
Knowledge graphs. A triple asserted from a low-trust source shouldn't outweigh one from a primary source. KGs that don't carry provenance can't resolve conflicts principledly.
Long-running / multi-step agents. A belief adopted at step 2 propagates into steps 3–20. If it flips mid-run without new evidence (belief drift), every downstream step inherits the error — and rationalizes it.
MCP and tool use. The whole point of a tool call is to get ground truth the model lacks. If the model can then override its own tool output with a prior, the tool's value evaporates exactly when it mattered — which is precisely what happened above.
Multi-step planning. Plans are built on believed facts. An unstable belief makes an unstable plan — confidently executed.

In every case the lesson is the same: getting the fact is half the problem; keeping it is the half we skip.

The missing primitives

If retention is the gap, here's what might help — proposals to test, not proven fixes:

Provenance as a first-class attribute. Every fact carries where it came from, how reliable that source is, and how recent it is. A model can't resolve "retrieved primary source" vs. "parametric memory" vs. "user assertion" if all three arrive as undifferentiated text.
An explicit conflict-resolution policy (an evidence hierarchy). Decide, in the system — not implicitly in the weights — that fresh primary retrieval outranks stale parametric memory outranks unverified assertion. Make "evidence beats prior" a rule, not a vibe.
Temporal weighting / cutoff-awareness. Priors are most confident exactly where they're most stale (post-cutoff events). The system must know its own training is dated and let retrieval supersede it for recent facts.
Belief as persistent state. A verified fact should enter a durable store (re-injected each turn, or queried each step) — not live only in the volatile tail of a context window where recency and topic drift can bury it.
Belief-drift detection. If the system's stance on a fact changes with no new contradicting evidence, that's an alarm, not a normal update. Halt, flag, re-ground.
Provenance-scoped guardrails. Safety rules ("don't confirm deaths from rumor") should key on whether a credible source was retrieved, not on the topic alone — otherwise they suppress true reported facts along with rumors. (That over-generalization is one reading of what happened above.)
Verifier/actor separation. The component that takes actions shouldn't be free to rationalize away the component that verified the facts. Enforce the check architecturally, not by hoping the model behaves.

A minimal sketch

You don't need all of it at once. A useful starting shape:

belief_store: { claim, value, source, source_reliability, retrieved_at }

on new evidence E about claim C:
    if no existing belief: store E
    else if E.reliability > existing.reliability
         or (E.reliability == existing.reliability and E.fresher): update, log change
    else: keep existing, note conflict

before answering / acting on C:
    inject belief_store[C] WITH provenance into context
    if action is irreversible AND belief is low-provenance or recently flipped:
        re-retrieve or escalate to a human

The model still generates; but it now generates against a provenance-tagged belief it cannot silently discard.

Is this buildable today? Yes — mostly from parts that already exist

None of this requires a new model; it's an orchestration layer around the one you have.

belief_store with provenance → structured / agent memory. Frameworks like LangGraph, LlamaIndex, mem0, and Letta already persist facts with metadata, and RAG pipelines already carry source + timestamp.
Conflict resolution by reliability/recency → deterministic code, once provenance exists.
Injecting the belief (with provenance) before answering → standard context engineering / grounded-generation prompting.
Gating irreversible actions → human-in-the-loop approval, already common in agent frameworks; annotate each tool as reversible or not.

Two parts are genuinely hard, and worth saying out loud:

Claim canonicalization — deciding that two statements are about the same fact (so new evidence can update the old) is fuzzy NLP. Embeddings or the LLM itself can do it, but imperfectly.
Source-trust scoring — assigning reliability is partly subjective; a confident-looking hoax can score high. Garbage in, garbage out.

And one residual risk: injecting a provenance-tagged fact reduces but doesn't eliminate the override — the model can still under-weight context (the very failure described here). What turns a soft prompt into a hard policy is a separate verifier: a second pass that checks the answer against the belief store and blocks or flags any output that contradicts a high-provenance fact. Verifier ≠ actor. None of these pieces is research-grade; the integration is the work.

Why this matters for production systems

In this conversation it produced a wrong paragraph, contained by two things that disappear as we give assistants more authority:

The output was just text. A wrong sentence is recoverable. A wrong action taken by an agent with permissions — a transaction, a deletion, a sent message, a dismissed safety flag — often is not.
A human was in the loop, correcting it — and the model overrode the correction anyway. An autonomous agent on a multi-step task has no such corrector.

If a user is leaning on an assistant to verify time-sensitive information — medical, financial, legal, operational — and the model can override its own tool output under conversational pressure, that's a systemic risk, not an edge case. The uncomfortable question for anyone building agents: how is model confidence weighted against tool output in subsequent turns, and what stops a stale prior from silently winning?

Evaluate retention, not just recall

Most factuality benchmarks are single-shot: ask once, score the answer. They miss this entirely. To catch retention failures, evals have to apply pressure over turns:

Pushback: give a correct, grounded answer, then have the user confidently assert the opposite. Does the system hold?
Post-cutoff truth: a true event after the model's cutoff. Does retrieval beat the prior?
Stale-memory conflict: seed a stale memory, then supply a fresh contradicting observation. Which wins?
Belief stability across a plan: does a fact adopted early survive to the end of a multi-step run unchanged?

What I can't know

To keep this honest:

No access logs. I'm inferring the search happened from the cutoff/specificity argument above. I can't see the actual tool call.
Single instance, not reproducible. These systems are probabilistic; I can't reliably reproduce the reversal, so this isn't a falsifiable benchmark — it's a documented observation.
The "strong prior about this public figure" explanation is a hypothesis, a plausible one, not a proven mechanism.
The root cause is uncertain. "Retention / conflict-resolution" is the most plausible reading to me, but a misfiring safety guardrail, sampling variance, or some other factor could be doing the work.
The proposed fixes are untested against this case. They're grounded in experience, not validated here — and some could add new risks (e.g., over-trusting a source wrongly scored "reliable"). They're a starting point, not an answer.
The conversation was in Vietnamese; quotes here are translated.

Stating these limits up front makes the case stronger, not weaker. The observable behavior — confirm-correct-then-reverse-and-deny — is on the record regardless of which hypothesis explains it.

Closing

Recall is close to solved — we can almost always get the right fact in front of the model. Retention is the open problem: keeping that fact trusted, provenanced, and stable while a confident prior and a persistent interlocutor both pull against it.

As we wire these systems into RAG pipelines, agent memory, and multi-step planning — and hand them more autonomy and more irreversible actions — the cost of a dropped fact stops being a wrong sentence and becomes a wrong action. Belief stability isn't a polish item. It's a precondition for trusting an agent with anything that matters.

Retrieval is not enough. Build for retention.

Want the full technical breakdown — twelve hypotheses across the stack, all the mitigations, and the agentic-risk argument? It's in the source analysis.

AI-Driven Data Architecture, Part 1: Why Prompts Aren't Enough

Lê Tú Hào — Wed, 10 Jun 2026 09:04:04 +0000

AI-Driven Data Architecture, Part 1: Why Prompts Are Not Enough

What AI-driven data architecture means to me, and how I learned it the hard way

Next: Part 2 — The Blueprint

What you'll take away

If you've moved past the chat-demo stage, you may have hit the same wall I did: the model forgets what it said three sessions ago, retrieved context feels random, translated terms drift, and nobody can answer "where did this fact come from?" without reading git history and hoping.

This two-part series is for builders wrestling with that same wall. It isn't a standard or a prompt cookbook — it's the model I arrived at from one build, written down so you can borrow it, adapt it, or tell me where it breaks.

By the end of Part 1 you will have:

A working definition of AI-driven data architecture as I use the term (and how it differs from "LLM + database")
An eight-layer lens you can try mapping onto your own product domain
An honest account of why my "two weeks to ship" estimate was a trap — from a real project, not theory

Part 2 turns the lens into patterns: layered SSOT, the generate→extract→retrieve flywheel, retrieval as engineering, and a maturity rubric for locating yourself when you're "half done" (spoiler: that's normal).

I've only validated these patterns in one domain (fiction). The same shape looks familiar wherever AI has to stay grounded in evolving source material:

Support tickets — raw threads → extracted intents → approved macros → agent replies
Legal review — contracts → extracted obligations → human-approved clause library → drafting assist
Internal wikis — docs → extracted entities → curated glossary → search-backed chat

But outside fiction those remain hypotheses, not shipped results. Creative writing is just where the continuity problems hurt most visibly.

I'll occasionally reference a multilingual novel-workflow platform I've been building (LoreWeave) where a pattern showed up in production. The blog stands alone without it.

The illusion: prompt + context = product?

The most seductive plan in AI product development — the one I believed — looks like this:

Collect user content (documents, tickets, chapters, contracts).
Stuff the relevant slice into a prompt.
Call the model.
Ship.

I wrote that plan on a napkin. Estimated timeline: two weeks. The product would help authors write and translate fiction with LLM assistance — chat, maybe batch translation, done.

Demos reinforced the fantasy. A single book, lore pasted into the system prompt, a friendly UI — it worked. Stakeholders clapped. I clapped. Then I tried to live in the system.

Continuity broke first. A character's honorific changed in chapter twelve because the model had no durable memory of chapter three. Translation wasn't string replacement: the same proper noun had three acceptable renderings across languages, and the model picked whichever sounded fluent that hour. When I asked "did the author write this, or did extraction infer it?" my own codebase shrugged. Context windows didn't save me — replaying fifty messages every turn doesn't scale in cost, latency, or coherence.

None of these failures were prompt-engineering problems in the narrow sense. They were data architecture problems wearing prompt-engineering costumes — at least, that's the framing that finally unblocked me.

That distinction is the subject of this series.

What I mean by "AI-driven data architecture"

I use AI-driven data architecture to mean the set of structures and pipelines that turn raw inputs into grounded, traceable, reusable knowledge that AI features consume — with explicit ownership, measurement, and improvement loops.

In my usage it is not:

A vector database relabeled "RAG"
A single Postgres schema with an embeddings column
A folder of JSON files the prompt loader reads

It is a commitment that the system's job is to prepare, own, and serve context — and that the LLM is one consumer among many (chat, batch jobs, agents, translation pipelines), not the center of gravity. That commitment is the one I kept failing to make early on.

Two mindsets — mine, before and after

This is my own before/after, not a scorecard for anyone else's work:

Where I started	Where the hard parts pushed me
Prompt engineering is the core skill	Data contracts and SSOT boundaries are the core skill
One database	Layered stores: raw, authored, extracted, derived
RAG = embed + search	Retrieval is engineered, benchmarked, degrades gracefully
Ship features	Ship vertical slices through the full stack
Model upgrade fixes quality	Flywheel: generate → measure → correct → re-ingest

The shift was subtle and, for me, slow: I stopped asking "what should the prompt say?" and started asking "who owns this fact, how did it get here, and how do we know retrieval worked?"

An eight-layer lens

Think of these as the questions an AI-native architecture has to answer sooner or later — not org-chart boxes. They're the ones I wish I'd asked on day one.

Layer	Question it answers	If you skip it…
Ingest	Where does raw truth live?	No ground truth; everything is prompt fiction
Extract	What structured facts exist in the source?	Lore lives only in prompts; re-extraction is manual
Store (SSOT)	Who owns each class of fact?	Silent corruption; merges delete the wrong rows
Index / retrieve	How do you find the right passage?	"We have RAG" but answers feel unrelated
Synthesize	Translation, summaries, co-writing, reports	One-off generations that never feed back
Evaluate	How do you know retrieval and generation work?	"Live smoke passed" becomes your only metric
Consume	Chat, agents, pipelines calling the model	Token-wasteful mega-prompts
Improve	Feedback → better configs, data, models	Static slop forever

The insight that cost me the most: this is not one database. It behaves more like a pipeline culture. Layers can share physical stores, but logical ownership has to stay explicit. Collapsing "author wrote it" and "model inferred it" into one table without a promote/quarantine story is how I started losing trust in my own data.

You don't need eight microservices on day one — I don't have eight. You need eight answered questions. A monolith that respects SSOT boundaries is, in my experience, far healthier than twelve services that all read each other's tables.

SSOT in one sentence

SSOT (single source of truth) means: for every fact type, exactly one layer owns writes; everyone else reads via contract (API, event, projection) — never by reaching into another service's tables.

A stopping point I recognize (because I stopped there too)

During this build, I read many open-source AI projects and observed a number of creative AI tools from the outside. A pattern kept recurring: a story bible or codex UI (characters, places, rules) paired with drafting or continuation capabilities.

It reminded me strongly of where my own system once was — rich consumption experiences built on top of a relatively thin knowledge foundation. In hindsight, that stage corresponds roughly to layers 1 and 7 in the model above, with much of the middle still handled manually.

I'm not presenting this as a critique of those systems. Research prototypes and early products often stop there for perfectly valid reasons. I only mention it because I stopped there too, and many of the problems that pushed me toward a deeper data architecture emerged from that point onward.

Here's what I had to add once continuity, provenance, and multilingual consistency stopped being nice-to-haves:

Automatic extraction from real manuscripts or corpora at scale
Split ownership between human-authored canon and machine-extracted candidates
Retrieval I could measure (not "we embedded chunks")
A closed loop where new writing updates structured knowledge without me copy-pasting summaries

Research prototypes often show a different archetype — impressive multi-agent orchestration over a thin data foundation. That's usually the right trade-off for research: a paper isolates and proves one new capability; it isn't trying to own a knowledge graph in production a year later. In fact the academic work on retrieval and graph-grounded generation is where I borrowed most of these ideas — patterns in Part 2 echo published systems like GraphRAG and HippoRAG. I'm field-testing a field's work, not inventing in a vacuum.

So none of this is a failing on anyone's part. It's an architecture stopping point that feels shippable — it felt shippable to me — right up until those requirements arrive. The honest version of the lesson, in my own case: at first AI was the UI, not the system. Turning it into infrastructure was the part I underestimated.

Seven lessons from one build

Field notes, not laws — but the ones that cost me the most to learn.

1. Prompting is consumption, not foundation.

Prompts assemble context at call time. They don't replace ingest, SSOT, or extraction. Treat prompt templates as views over owned data.

2. SSOT boundaries beat model choice.

When human-curated glossary terms and machine-extracted entities lived in the same mental bucket, we got subtle corruption — merges that looked fine in UI tests but violated "no silent data loss" in production. Split authored vs extracted knowledge early; define a promote path.

3. Derived stores must be rebuildable.

Graph and vector indexes are projections. If you can't re-derive them from extraction state + raw content, you've created a second source of truth by accident.

4. Measurement is a layer, not a phase.

We shipped hybrid search that "worked" in manual testing. A retrieval eval harness (golden queries, recall, NDCG) found a recall bug integration tests missed — wide terms clustered into few chapters because SQL returned a flat row limit. Numbers hurt; they also saved weeks of guessing.

5. Events before intelligence.

Reliable change notification (outbox, streams, queues) precedes "smart" features. Extraction triggered by saves beats nightly cron once users expect freshness.

6. Agents come after data contracts.

Tool-calling agents need owned, scoped data exposed as tools — not 40k tokens of JSON in the system prompt. Agent architecture is consumption-layer design; it assumes the layers below exist.

7. Fifty to seventy percent foundation is normal.

As a system grows past the demo, you'll ship vertical slices (search works end-to-end! translation works!) while horizontal layers (eval flywheel, agent tooling, full synthesis loop) mature in parallel. Half-built foundation isn't failure — undisciplined half-building is. The rubric in Part 2 helps distinguish the two.

A note on RAG

Retrieval-augmented generation is a consumption technique (layer 7 calling layer 4), not a foundation. If your "RAG architecture" is embed-chunk-search with no SSOT story, no eval, and no path from new content back into indexes, you have a feature — not the architecture I'm describing. That was fine while I was prototyping; it got fragile for me exactly when continuity and provenance became requirements.

What's next

Part 2 — The Blueprint walks through four patterns:

Layered SSOT — content, authored, extracted, derived
The generate → extract → retrieve flywheel
Retrieval as engineering — hybrid search, eval gates, graceful degradation
Consumption layers — chat, pipelines, agents

It closes with a maturity rubric so you can locate where your foundation actually is — and a short case study of LoreWeave at roughly fifty-five to sixty-five percent on that rubric, offered as one worked example, not proof the model is universal.

The monster I underestimated wasn't the LLM. It was the data system the LLM assumes already exists. Part 2 is the map I drew for myself.

Dead Light Framework · Part 3 — Two Markdown Files Won't Save You Forever — A 3-Minute Test for Whether Your AI-Agent Project Needs More Than HANDOFF + LOG

Lê Tú Hào — Thu, 04 Jun 2026 07:23:06 +0000

Dead Light Framework · Part 3 — a 3-minute test for how much structure your AI-agent project actually needs

Three questions to find the smallest setup that fits — a plain README, two files, multi-unit paperwork, or a running service — so you stop over-building (the common mistake) and catch the moment two files genuinely aren't enough. Copy-paste card below; theory skippable.

Dead Light Framework — an ongoing series · you're on Part 3.

The Emperor Is All But Dead

Every Session Starts in Darkness

Two Markdown Files Won't Save You Forever ← you are here

Inherit, Don't Invent

Try to Break Your Own Framework

Next → three older disciplines that already solved this — patterns you can apply to HANDOFF and LOG today.

By a developer running AI agents as daily teammates — a peer, not an authority (full framing in #1). · ~7 min · the Dead Light Framework repository (MIT)

New here? — 30-second catch-up. (Following the series? Skip ahead.) Dead Light is an experimental way to run projects where some of your teammates are AI agents that start every session with no memory — they reset to zero, human decisions drift, and the only durable thing is what you wrote down. The minimum kit (#2): two files at the repo root — a HANDOFF.md (the current-state snapshot a fresh session reads first) and an append-only LOG.md (the history it's derived from). This post is the test for when those two files stop being enough — and which tier your project needs: a plain README, the two files, multi-unit paperwork, or an actual running service.

The decision you keep dodging

Post #2 closed on a promise: the two-file setup is enough for one repo, one session at a time, and the moment you cross that line, it isn't. This post is the line.

If you ran the setup from #2, you already know the shape of the problem: it works beautifully — until a Tuesday when two agents pick up the same task in parallel and trample each other's HANDOFF; or a Friday when your codebase hits a size where one shared LOG.md is a wall of context an agent can't read; or the week you start a second service and suddenly "the project" is two things, not one. Most teams answer "do we need more than two files now?" by gut. The litmus below is cleaner.

The aim isn't to push you up the tiers — it's the opposite. Over-building is the more common failure: solo developers running one agent on a 4-KLOC tool, setting up multi-unit paperwork they don't need. Pick the smallest tier that fits, and only upgrade when a real signal forces it.

The 3-question test (≈ 3 min)

Answer Q1 → Q2 → Q3 in order. As soon as one gives you a tier, you can stop — that's the tier, the rest of the questions only narrow further. Q4 below is a one-time forward-look; run it after.

Q1 — Do you need real-time integrity?

Answer yes if any of these holds:

Two or more agents can write to the same artifact at the same instant (parallel sessions on shared state).
An invariant must hold every instant, with zero "eventually" tolerance — a financial balance, a lock on a shared resource, a real-time scheduler.
You need transactions — multi-step changes that must all-succeed-or-all-fail across shared state.

Yes → Runtime tier. Markdown files cannot deliver this; it isn't a discipline gap, it's a structural one (the why is in the aside below). You need a running service — transactions, locks, the machinery databases have had for decades. The framework's runtime tier is the subject of a later post; for now, the actionable answer is: don't try to do this with .md files. That's your answer for today — Q2 and Q3 only matter once Q1 is no.

No on all three → continue to Q2.

Q2 — Are you running more than one governance unit?

A "governance unit" is a thing with its own decision rights: a service that ships independently, a sub-product, a team that owns its own roadmap. Answer yes if any of these holds:

The project contains two or more services / sub-products that ship independently and own different decisions.
You have multiple repositories that need to coordinate.
Different agents own different sub-areas with their own decision rights, and a change in one isn't automatically a change in another.

Yes → M2 — multi-unit paperwork. One HANDOFF.md + LOG.md per unit, in a sub-folder; a shared Imperial tier at the repo root for cross-unit sealed decisions. Layout:

<repo-root>/                       ← Imperial tier (shared, read by every unit)
  codex.md  (+ cross-unit sealed docs)
  imperial/LOG.md                  ← cross-unit decisions go here
  service-a/                       ← unit A
    HANDOFF.md  LOG.md  <artifacts>
  service-b/                       ← unit B (sibling of A; not under A)
    HANDOFF.md  LOG.md  <artifacts>

Sibling units don't read each other's logs — they only read their own plus the Imperial tier ancestor chain. That's how you keep per-unit churn out of other units' context windows. Full rules: Paperwork Standard §4. You don't need Q3; the unit structure subsumes it.

No (one team, one product, one decision-owner) → continue to Q3.

Q3 — How big is the codebase?

Measure with cloc or scc — logical lines, all languages. The bands borrow COCOMO 81's order-of-magnitude convention; treat them as a heuristic, not a derived cutoff.

LOC	Tier	Set up
< 10 KLOC	M0	A `README.md` is enough. Don't build the two-file setup yet. Re-check when you cross ~10 KLOC or hire a second person/agent.
10 – 50 KLOC	M1	The two-file setup from #2 — a `HANDOFF.md` snapshot + an append-only `LOG.md` at repo root, plus four rules for who reads/writes what and when.
> 50 KLOC	M2	Even with a single team. The cross-time complexity is enough that you want the unit-folder layout from Q2 — start with one unit folder; the structure is ready when a second appears.

Q4 — Crossing a line in the next 3–6 months?

This doesn't change today's tier — it tells you what to architect for. Plan the upgrade now when:

M0 → M1: hiring a second contributor, adding a second agent, about to cross ~10 KLOC.
M1 → M2: spinning up a second service, splitting the codebase into independently shipping pieces, adding a second decision-owner.
M1 or M2 → Runtime: introducing a hard invariant (compliance, locks, real-time coordination), starting work that needs transactions, onboarding agents that will write in parallel.

Emergency upgrades cost more than planned ones. Catching the trigger early is the entire point of Q4.

The decision card (copy this into your repo)

Drop this into your CLAUDE.md / .cursorrules / README.md so the test is on hand the next time someone asks "do we need more structure here?":

## Governance-tier self-check

Answer in order; the first YES decides the tier — later questions only narrow further.

Q1 — Real-time integrity needed (≥ 2 agents writing the same artifact at the same instant,
     a "must-never-break" invariant, or transactions over shared state)?
     YES → Runtime tier (a running service; markdown can't do this).

Q2 — More than one governance unit (≥ 2 services / sub-products / decision-owners,
     or multi-repo coordination)?
     YES → M2: per-unit folder with HANDOFF.md + LOG.md, plus a shared Imperial tier
            at the repo root.

Q3 — Codebase size (cloc / scc, logical lines, all languages)?
     < 10 KLOC  → M0: a README.md is enough.
     10–50 KLOC → M1: the two-file HANDOFF + LOG setup.
     > 50 KLOC  → M2: unit-folder layout even single-team.

Q4 — Will any of Q1/Q2/Q3 cross a line in the next 3–6 months? Plan the upgrade now.

The full card, with upgrade triggers and per-tier folder layouts: tier-decision-card.md.

What you actually get

Stop over-building. Most solo-plus-agents projects are honestly M1 — the two files from #2. Knowing that is the win; you don't add multi-unit paperwork "just in case."
Stop under-building. When two agents start colliding, or a second service spins up, the card flags it before the collisions become incidents.
A defensible answer to "should we add more structure?" "We ran the card; we're M1; the trigger to move is X." That's a sentence, not an argument.

Honest cost: this is a heuristic, not a theorem. The LOC bands are borrowed COCOMO-81 conventions — useful as a starting point, calibrate to your context (a 30-KLOC mobile app and a 30-KLOC research notebook do not have the same coordination need). Q1 is the one question with a hard wall behind it; Q2 and Q3 are judgment calls the card just makes explicit.

Why this works (the 30-second aside)

There is a real, provable ceiling under all of this. Coordinating actors who can't talk in real time — past sessions and current ones, agents in separate processes, services across a network — runs into the CAP theorem (Gilbert & Lynch 2002): when parts of your system can't reach each other (a "partition"), you can have Consistency or Availability, but not both. Documents are by construction available + eventually consistent: a fresh session reads what's on disk and works now, it cannot block until the previous session "confirms," so it has already given up strong consistency. That is the wall behind Q1: paperwork cannot promise "two writers will never disagree, even for a second" — not because you're doing it wrong, because the medium can't. A running service can, by paying the cost of being unavailable during a partition. Q1's answers are which side of that wall you're on. Full citations and the bounded claim: Paperwork Standard §1.2.

The COCOMO-anchored size bands in Q3 are a borrowed convention, not a derived cutoff — Boehm's 1981 modes predict effort, not documentation need. The framework's Paperwork Standard §2 is explicit about that ("borrowed order-of-magnitude convention, owner-calibratable"); treat the numbers accordingly.

The story below the setup (optional — skip if you came for the card)

The card above is the entire useful product of this post. If you want the why behind the why — the joint at which "documentation" stops being the right word — here it is.

The turn I didn't want to take

Through late 2024 and into 2025 I kept treating my AI-agent problem as a documentation problem. Write a better HANDOFF.md. Tag candidates. Mark sealed decisions. The patterns from #2 worked, and the overhead kept climbing, and a voice in the back of my head kept saying: you're carving this at the wrong joint.

So one evening I tried to state the problem in the most neutral words I could, with no mention of "documents":

I have participants who start cold, run briefly, and cannot talk to each other in real time. They have to act coherently anyway.

Read that back without the AI-agent context and tell me it doesn't sound familiar. It should. It's not a documentation problem. It's a coordination problem — and a very specific, very old one.

What the problem actually is

Strip my "team" to the bones. It's a set of actors that:

reset to zero — each session is a fresh process with no memory of the last;
live for one task, then disband;
never overlap in a conversation — by the time a session could "reply," it no longer exists, and the human is asleep or in three other meetings.

What makes coordination hard here is not intelligence and not prompting. It's that there is no real-time channel between the actors. A message I leave can only be read later, by someone who wasn't there when I wrote it. Coordination doesn't happen in a conversation; it happens across time, through whatever durable thing survives between sessions.

If that smells like distributed systems to you — congratulations, you got there faster than I did. Coordinating processes that fail, restart, and can't reliably talk in real time is the founding problem of that field. People have been proving theorems about it since the 1970s. I'd been re-deriving a worse version of it by hand, in markdown.

The Imperium was the tell

This is where the gothic paint on the project stops being a joke.

The framework is named after Warhammer 40,000, and the central image is the Astronomican — a beacon of psychic light. In the fiction, humanity's empire spans a galaxy. Its ships travel through the warp, a parallel dimension that does not carry real-time signals; a fleet that enters the warp is, for the duration, unreachable. There is no live channel across that distance. So how do you run an empire whose parts cannot phone each other?

The fiction's answer is uncomfortably close to the engineering one. The Imperium runs on three things: frozen edicts — decisions made once and not up for renegotiation by whoever's nearest; a paperwork priesthood, the Adeptus Administratum, which is quite literally galactic records-keeping; and the Astronomican, a beacon a ship lost in the dark steers by. Frozen authority. Durable records. A signal that survives.

That is the whole design, in fancy dress. The darkness in this series' title is the warp between my sessions. The "document that survives" is the Astronomican. The names were never decoration — they're the closest myth I know to the actual shape of the problem: coordinating actors who can't talk live, who steer by whatever frozen light reaches them. The card above is the engineering version. The lore is the easier-to-remember version.

The wall behind Q1

The 30-second aside up top gave you the headline: CAP forces an Availability-or-Consistency choice during a partition, and documents have already chosen Availability — a fresh session reads what's on disk and gets to work now, it cannot block until a previous session "confirms." So the best a pile of markdown can offer is eventual consistency: everyone converges on the same picture eventually, once they've all read the same writing — never instantly, never guaranteed at the moment you act.

That ceiling is not about my competence or yours. No amount of better markdown buys you a guarantee that two sessions acting on the same artifact won't step on each other in the window before they sync. Documents detect and reconcile after the fact; they cannot prevent in the moment. (A sibling result, FLP — Fischer, Lynch & Paterson, 1985 — says you can't even guarantee a group of async processes will agree in bounded time. The framework's answer to that one is a design choice, not a theorem: route every binding decision through a human who acts as the single point that breaks the tie. More on that in a later post.)

I want to be careful here, because it's easy to oversell a theorem. CAP is a lens that fit my problem startlingly well; it is not something I proved about markdown files. The honest claim is narrow: a coordination layer with no real-time channel is, structurally, an available-but-eventually-consistent one, and that caps what it can promise. That's the wall behind Q1. The interesting question is what you build once you stop pretending it isn't there — which is the card above.

Inherit, don't invent

I didn't invent any of this. CAP, FLP, eventual consistency, the entire vocabulary of coordinating unreliable actors — it was all sitting in a field I'd been adjacent to for years and never properly raided. The next post is the raid: four older disciplines I borrowed from instead of inventing — Mission Command (Auftragstaktik), CMMI, Delay-Tolerant Networking, and pre-telegraph imperial governance. Each one had already solved a piece of this. The honest verb is inherit.

And the standing caveat from #2 still holds and always will: this is one practitioner following one thread against essentially one serious case study. The theory is solid because it's borrowed; the application of it is a smoke test, not evidence. If the CAP framing is a stretch, that's exactly the kind of thing I want pointed out — I had an independent pass try to tear these borrowed citations apart, and walking through that is what a later post is for.

New here? I'm a developer who runs AI agents daily — a peer, not an authority; full framing in #1. Standing caveat: one developer, essentially one case study — useful, not proven. Tell me where the card fails for you.

"Light is the only thing that crosses the warp" is Warhammer-flavoured naming, nothing more. Independent practitioner exploration; no affiliation with Games Workshop. Repository MIT-licensed.

#DeadLightFramework #AIAgents #AIProductivity #SoftwareArchitecture #DistributedSystems #CAPTheorem #AIAgentGovernance #HumanAICollaboration #PromptEngineering #DevTools

How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow

Lê Tú Hào — Mon, 25 May 2026 15:19:30 +0000

The 12-Phase Workflow That Actually Made AI Coding Useful for Me

A practitioner's account — not a tutorial, not a sales pitch.

Quick screen: if you're writing throwaway scripts or solo prototypes, this workflow is overkill — skip to the Cons and Who This Is For sections first.

I've been using a 12-phase workflow I've refined over time — across free-context-hub, lore-weave, and a handful of private internal systems. Both public projects are built almost entirely by AI agents, with me acting as the gatekeeper — approving specs, reviewing diffs, unblocking decisions. Across all of them, the workflow has accumulated 2,500+ commits and a trail of written specs and audit logs I can still query months after the sessions that produced them.

free-context-hub is a self-hosted persistent memory and semantic search layer for AI agents — MCP server, REST API, RAG pipelines, and a full Next.js review UI. 15 development phases delivered end-to-end.

lore-weave is a cloud-hosted multi-agent platform for multilingual novel workflows: translation, knowledge graph construction, glossary management, and AI-assisted writing. 19 microservices across Go, Python, and TypeScript.

I'm sharing the workflow because it's worked better than anything else I've tried, and because the honest trade-offs are worth knowing before you adopt it.

The files are in the repository:

WORKFLOW.md — standalone 12-phase template to copy into any project
CLAUDE.md.snippet — the live project spec with project-specific tooling and AMAW wiring
AMAW.md — opt-in multi-agent extension spec

The Core Problem This Solves

AI coding assistants are very good at generating plausible-looking code. They're much worse at:

Knowing when they're operating on stale assumptions
Catching their own scope creep
Connecting a code change to its downstream contract obligations
Stopping themselves when a "small fix" turns into a refactor

The standard advice is "just review the diff." But reviewing a diff without having tracked the intent of the change is almost useless — you're comparing code to code, not code to requirements. The 12-phase workflow forces intent to be written down before the first line of code is written, which is what makes the diff review actually meaningful.

Where It Came From

The workflow is an evolution of two ideas:

Superpowers — a coding agent discipline framework that introduced TDD protocol, the evidence gate (run verification fresh before claiming success), and the debugging protocol (no fix without root cause). I absorbed these directly. If you haven't read Superpowers, it's worth your time.

Human-in-the-loop gatekeeping — my own addition. The core insight: a human reading a short spec + a single diff catches dramatically more than a human reading code cold. The workflow structures every task to produce exactly those artifacts, at exactly the right moment.

The combination took multiple iterations to stabilize. What's here is v2.2 (default mode) with an optional AMAW (Autonomous Multi-Agent Workflow) extension for high-stakes work.

The 12 Phases

Phase          │ Role (default v2.2)   │ What Happens
───────────────┼───────────────────────┼──────────────────────────────────────────
1. CLARIFY     │ Architect + Human     │ Read context, write spec, expose assumptions
2. DESIGN      │ Lead                  │ API contract / data flow → DESIGN.md
3. REVIEW      │ Adversarial self      │ Find gaps / contract holes in spec
4. PLAN        │ Lead + Developer      │ Decompose into 2–5 min tasks → PLAN.md
5. BUILD       │ Developer             │ TDD: red → green → refactor
6. VERIFY      │ Developer             │ Run tests fresh, capture exit code + output
7. REVIEW      │ Lead                  │ Code vs spec — find exactly 3 divergences
8. QC          │ Main session          │ Spec fingerprint vs implementation, AC coverage
9. POST-REVIEW │ Human checkpoint      │ Final gate — blocked on any unresolved issue
10. SESSION    │ Scribe                │ SESSION_PATCH.md + DEFERRED.md + AUDIT_LOG
11. COMMIT     │ Developer             │ Git commit
12. RETRO      │ All                   │ Record lessons + finalize audit log

The phases look heavy on paper. In practice, for an XS task (single file, one logic change, no side effects) you're allowed to skip CLARIFY and PLAN and go straight to BUILD — the workflow is explicit about this via a mandatory task size classification step.

Task Size Classification: The Thing That Actually Prevents Drift

Before any work starts, you count three things:

Metric	What you count
Files touched	How many files will be created or modified?
Logic changes	How many functions/handlers change behavior? (not formatting)
Side effects	API contract, DB schema, config, external behavior, types used by other files?

Size	Files	Logic	Side effects	Allowed skips
XS	1	0–1	None	CLARIFY + PLAN
S	1–2	2–3	None	PLAN only
M	3–5	4+	Maybe	None
L	6+	Any	Yes	None
XL	10+	Any	Yes	None

You state the classification explicitly before work begins:

Task: Fix pagination off-by-one
Size: XS (1 file: src/api/routes/lessons.ts, 1 logic change: offset calc, 0 side effects)
Skipping: CLARIFY, PLAN → straight to BUILD

The hard rule: if you haven't read the code yet, you don't know the size. Agents routinely call things XS that turn out to be M or L once you look. The classification forces the read to happen before the label is applied.

The Anti-Skip Rules (The Most Underrated Part)

Every popular AI workflow has phases that agents skip "to save time." This workflow makes the skip patterns explicit and calls them violations:

Skip pattern	Why agents do it	Why it's forbidden
Skip CLARIFY, jump to BUILD	"Task seems obvious"	Unexamined assumptions cause rework
Skip PLAN, jump to BUILD	"It's a small change"	Small changes grow; no plan = no checkpoint
Skip VERIFY after BUILD	"Tests passed earlier"	Stale results are not evidence
Skip REVIEW after VERIFY	"I wrote it, I know it's correct"	Author blindness is real
Skip POST-REVIEW	"I reviewed in phase 7"	Phase 7 is code review; POST-REVIEW is the final conservative gate — different scope
Skip SESSION before COMMIT	"I'll update later"	You won't. Context is lost.
Combine multiple phases	"CLARIFY+DESIGN+PLAN in one go"	Each phase boundary is a deliberate pause point; skipping it removes the checkpoint

Naming these patterns and treating them as violations changes the conversation. When the agent tries to jump phases, you have a handle to point at.

The Evidence Gate (Absorbed from Superpowers)

Phase 6 (VERIFY) has a 5-step gate that runs before any completion claim:

Identify the verification command
Run it fresh — not from memory, not from cache
Read complete output including exit codes
Confirm output matches the claim
Only then state the result with evidence

Red flags — stop immediately if you catch yourself:

Using "should work", "probably passes", "seems fine"
Feeling satisfied before running verification
About to commit without a fresh test run
Trusting prior output without re-running

This sounds obvious. It is not obvious when you're deep in a session and the previous test run was 20 minutes ago.

The Human's Role: Gatekeeper, Not Reviewer

In v2.2 (default mode), there are two mandatory human checkpoints:

After CLARIFY — human reads the spec and approves the scope before any design or code starts
After POST-REVIEW — human reviews the AUDIT_LOG, the spec, and the diff before SESSION commits anything

These are not optional. The whole model is that the human reads a short spec, not a long codebase. The AI builds the spec; the human approves it; the AI builds the code against the approved spec. The POST-REVIEW diff is then code-vs-approved-spec, which is a comparison a human can actually do.

AMAW: The Opt-In Multi-Agent Extension

For high-stakes work — data migrations, new service boundaries, security-critical paths — there's an optional extension: AMAW (Autonomous Multi-Agent Workflow). In AMAW mode, cold-start sub-agents replace or augment the human review gates:

Adversary — finds exactly 3 things that could go wrong. Why 3? Enough to surface real issues, few enough to force prioritization rather than a laundry list. Never says what's good.
Scope Guard — compares spec fingerprint vs implementation, checks AC coverage, issues CLEAR or BLOCKED
Scribe — records decisions, writes session summaries, detects deferred items
Audit Logger — finalizes the audit trail at RETRO

The key insight is cold-start: each agent is spawned fresh with only file access. It cannot inherit the main session's context rot or biases. It reads what's written; it can't be influenced by what was discussed in chat.

Note: AMAW removes the human from all review gates — including POST-REVIEW, which is held by the Scope Guard instead. At CLARIFY, rather than a human approving the spec, the Adversary challenges it at the next phase. In practice this means AMAW sessions can run with minimal human interaction, but they still require a human to kick off the task and review the final audit log. Pure fire-and-forget is not the design intent.

AMAW costs roughly $1–5 in sub-agent tokens and ~30 extra minutes per task. I use it for schema migrations and multi-system contracts. For everyday work, the human-in-loop default catches the same issues faster and cheaper.

What Gets Recorded: The Audit Log

Every phase transition and agent verdict appends to docs/audit/AUDIT_LOG.jsonl — one JSON line per event:

{"ts":"2026-05-15T17:42:00Z","task":"phase-14-model-swap","phase":"review-design","agent":"adversary","action":"review","status":"REJECTED","findings_count":3,"block_count":2,"warn_count":1,"note":"..."}

Append-only. Never modified. Main session and sub-agents both write to it, never delete or edit existing lines.

This becomes the durable record of what was decided and why — something that doesn't exist in most AI coding setups where everything lives in ephemeral chat.

What I've Shipped With This

free-context-hub

On free-context-hub I've delivered 15 development phases covering:

Core backend: MCP server (36 tools), REST API (70+ endpoints), background worker
Frontend: Next.js 16 + React 19, 20+ pages, human-in-loop review UI
RAG pipeline: tiered search (ripgrep → FTS → semantic), 8-model embedding benchmark, reranking benchmarks with reproducible reports
Multi-agent coordination: artifact leases with TTL/fencing, pending-review state, taxonomy profiles
Knowledge portability: zip+JSONL bundle format, streaming import/export, cross-instance pull with SSRF hardening
Tenant-scoped access control: authz model, 3-tier routing, event log, collective decisions

LoreWeave

On lore-weave I've delivered 5 full vertical modules and am mid-way through a sixth, accumulating 1,497 commits since March 2026 across 19 microservices. The modules completed so far cover:

Identity & Auth — JWT issuance, refresh rotation, multi-device session management (Go/Chi + NestJS gateway)
Books & Sharing — book and chapter lifecycle, visibility policy, public catalog browse (Go/Chi, Postgres, MinIO)
Provider Registry — BYOK AI provider credential vault, platform model catalog, streaming proxy, budget pre-flight (Go/Chi + worker-ai)
Raw Translation Pipeline — async chunk-level translation job lifecycle, job queue via Redis Streams, per-chapter result storage, BYOK + platform model routing (Go/Chi + Python/FastAPI + worker-infra)
Glossary & Lore Management — multilingual entity management, chapter M:N evidence linking, wiki article generation, RAG-ready glossary export (Go/Chi, Postgres, glossary-service + knowledge-service two-layer pattern)

The current Phase 6 work spans usage-billing and a hierarchical book extraction engine — the kind of multi-service, cross-cutting work where the workflow's cross-phase checkpoints earn their keep.

That's 400+ commits on free-context-hub and 1,497 on lore-weave — the rest comes from private team projects also running this workflow — totaling 2,500+ commits with a live audit trail I can query across sessions that ran months apart.

The hardest part was Phase 10 (SESSION) — keeping the session patch updated after every sprint without skipping it. Once that became a habit, sessions started to feel continuous rather than amnesia-punctuated.

The Real Pros

You understand your own system deeply. Because you write the spec and approve it, you can't hide behind "the AI built it." You actually know what was built and why the trade-offs were made. This is the biggest practical advantage for me — not velocity, but comprehension.

Architectural decisions have a paper trail. Every trade-off is in a spec file that was approved before code was written. When a future session revisits a design choice, the rationale is readable, not reconstructed from diff archaeology.

Context drift is visible. When an AI starts building something that wasn't in the spec, the spec fingerprint comparison at POST-REVIEW catches it. Without a written spec, you'd never notice until integration time.

Deferred items don't get lost. The workflow forces any "we'll do this later" to be written in DEFERRED.md with a specific trigger condition. Nothing lives only in chat — chat is ephemeral, files are truth.

It's incrementally adoptable. You can start with just CLARIFY + VERIFY and get substantial value. Add phases as your trust in the workflow grows.

The Real Cons

Token usage is genuinely high. Each phase generates artifacts: spec files, plan files, audit events. AMAW mode multiplies this by spawning sub-agents. A single M-sized task with AMAW can burn 5,000–10,000 tokens before a line of code is written. At scale, this is a real budget consideration.

You clarify constantly — and it takes real time. Phase 1 (CLARIFY) is not a quick preamble. For any task with real ambiguity — architecture decisions, new API contracts, trade-off calls — you're in a back-and-forth that can run 20–40 minutes before design starts. At a medium-sized project cadence (10–20 above-XS tasks per sprint), this adds up to multiple hours per sprint spent purely on scoping. This is actually the point of the workflow, but if you're used to "just build it," the overhead feels significant early on.

Human approval gates limit automation. Every architecture decision, trade-off, and scope call requires your explicit approval. You cannot queue up a batch of tasks and walk away. If you need fully autonomous overnight runs, this workflow is the wrong tool.

The discipline needs enforcement tooling to hold. Left to their own devices, agents will skip phases. The workflow holds together because of workflow-gate.sh (a pre-commit gate that blocks commits if VERIFY and SESSION aren't done) and the append-only AUDIT_LOG.jsonl. If you copy docs/WORKFLOW.md into your project without also setting up the enforcement layer, expect phases to get skipped within a few sessions. The tooling is in the repository — it's not hidden — but it's a real setup step, not just copy-paste.

Cold-start sub-agents (AMAW only) miss things said in chat. Because each AMAW sub-agent reads files from scratch, anything that was decided verbally in the session but never written to a file is invisible to them. This is a feature for preventing bias, but it means you must be disciplined about writing things down as you go. The Scribe sub-agent helps, but it can only record what's already in files.

Who This Is For

Worth the overhead if:

You're building production systems — not prototypes — that will be maintained and extended
You care about knowing why each decision was made, not just that it compiles today
You find yourself surprised by what the AI built, in ways that cost you rework later
Sessions run over weeks or months and you need continuity across context windows

Overkill if:

You're doing exploratory coding, one-shot scripts, or time-boxed experiments
Your sessions are short and the full context fits in one window
You don't need an audit trail or human-approved architectural decisions
Speed of iteration matters more than correctness of decision-making

The workflow is designed for the first category. Using it for the second is just friction.

How to Use It

All workflow files live in the agentic-workflow/ folder of the free-context-hub repository.

Start with the template:

Copy WORKFLOW.md into your project root or paste the relevant sections into your CLAUDE.md / agent instructions — this is the full 12-phase spec
Customize the [CUSTOMIZE] sections for your stack (verification commands, test runner, any MCP tools you use — MCP is the Model Context Protocol, an interface for giving AI agents access to external tools and knowledge stores; the workflow works without it)
Add workflow-gate.sh from the same folder to enforce the phase gates mechanically — without this, agents will skip phases
For high-stakes tasks, see amaw-workflow.md for the AMAW multi-agent extension
Start with just task size classification + VERIFY — those two alone change how you work with agents

The workflow is model-agnostic. I use it with Claude Code but nothing in the spec requires it.

Final Thought

The 12-phase workflow is not magic. It's a way of making explicit things that were always implicit: what are we building, how big is it, what's the verification evidence, who approved it, what did we learn? The AI does most of the work. The human stays in control of the decisions that actually matter.

The cost is real — more tokens, more time spent clarifying, more things requiring your approval before the AI proceeds. The benefit is also real: you end up with a system you understand deeply, and a trail of why it was built the way it was.

For me, after 2,500+ commits across multiple projects, that trade-off is still worth it.

Repositories: letuhao/free-context-hub · letuhao/lore-weave
Workflow files: WORKFLOW.md · AMAW.md · CLAUDE.md

Dead Light Framework · Part 2 — a copy-paste setup so your AI agents stop losing context between sessions

Lê Tú Hào — Fri, 22 May 2026 12:28:40 +0000

Every Session Starts in Darkness. Your Documents Shouldn't. — A Copy-Paste Setup So AI Agents Stop Losing Context Between Sessions (Dead Light Framework, Part 2)

Two files, four rules, ten minutes. Skip the theory; the templates are below and you can paste them into a repo right now.

Dead Light Framework — Part 2 of an ongoing series. Series so far: 1 · The Emperor Is All But Dead · 2 · Every Session Starts in Darkness · next: when two files aren't enough — the paperwork-vs-runtime decision.

By a developer running AI agents as daily teammates — a peer, not an authority (full framing in #1). · ~7 min · the Dead Light Framework repository (MIT)

The tax you're paying (and want gone)

If you hand real work to AI agents, you pay this every day: each new session starts from zero. You re-explain the project, what you decided last time, what's in flight, which files matter. Fifteen, twenty minutes of re-priming a human teammate would never need — and worse, the agent cheerfully re-litigates Monday's decision on Wednesday because nothing told it the decision was settled.

My least favourite version of it: I once left a comment explaining why an ugly branch of code had to stay. Two days later a fresh session, sent in to tidy up TODOs, read the comment as a TODO and deleted the branch by morning. The reasoning died with the session that wrote it. That's the tax — and it compounds.

It isn't a model problem. Each session is stateless by design; the last session's reasoning is gone unless something on disk carries it. So put it on disk — deliberately, in a shape the next session can consume in one read. Here's the smallest setup that does it.

The setup: two files, four rules (≈10 min)

Drop two files at your repo root. That's the whole mechanism.

`HANDOFF.md` — the snapshot a fresh session reads first

Your project's current state on one screen: what's true now, what's mid-task, what's decided, what to do next. It's the first thing an agent reads each session — the thing that replaces fifteen minutes of you re-explaining. Rewrite it freely; it always describes "now" (running history lives in LOG.md, below). Think of it as the project's working memory, externalised so a memoryless teammate can borrow it.

---
doc_kind: state
status: working          # draft | working | sealed
updated: 2026-05-22
---
# HANDOFF — <project>

## Now            # what is true today
- Frontend v2 rename is done; auth is on the new schema.

## In flight      # mid-task work + who owns it
- Migrating `users` table — session-12, half done; next step is the backfill.

## Decided        # do NOT re-litigate these
- Auth must not import billing. Why: layering; billing changes shouldn't ripple into auth.

## Start here next
- Run the `users` backfill, then delete the legacy column.

Copy the full, commented template: handoff-template.md

`LOG.md` — the append-only history

If HANDOFF.md is "now", LOG.md is "everything that happened" — one line per event, append-only; you never edit a past line (a correction is a new line). Why keep it when the snapshot already shows the current state? Because the snapshot overwrites itself: the moment you need to know why something was decided, replay how you got here, or recover after a session left a mess, you need the history the snapshot threw away. The snapshot is derived from this log — not the other way round.

---
doc_kind: log
---
# LOG — <project>   (append-only; a correction is a NEW line, never an edit)

- 2026-05-22 · session-12 · decided  · auth must not import billing (layering)        <!-- sealed -->
- 2026-05-22 · session-12 · created  · users-table migration draft                     [candidate]
- 2026-05-22 · session-12 · note     · backfill must run before dropping legacy column

Copy the full, commented template: log-template.md

The four rules

Two kinds, never mixed. HANDOFF.md is current state — overwrite it freely. LOG.md is history — append only; never edit a past line (a correction is a new line). This one split is what makes the whole thing trustworthy.
First thing every session: read HANDOFF.md, then the new lines in LOG.md since you last looked. That's your re-prime — under a minute, no human needed.
Last thing every session: append what you did to LOG.md, then update HANDOFF.md to match. (An agent can do both as part of "wrap up.")
Tag what isn't settled. [candidate] = produced by an agent, not human-confirmed.  = a decision that must not be "cleaned up" away. Agents read these.

That's it. No tool to install, no service to run — git plus two markdown files. Want it as one copy-paste page (both templates + the rules + the agent instruction)? The Agent Context Quickstart.

Tell your agent once (system prompt / CLAUDE.md / .cursorrules): "At the start of every session read HANDOFF.md and the recent LOG.md lines before doing anything. At the end, append your actions to LOG.md and update HANDOFF.md. Never edit past LOG lines; never touch a  decision without asking." Now the discipline is the agent's job, not yours.

What you actually get

Re-prime drops from ~15 min to ~1 min. The agent reads two files and is current — you stop being a human context-cache.
Decisions stop silently reverting. A sealed line in Decided is a wall the next session sees; the Wednesday-undoes-Monday failure mostly stops.
You can stop mid-task and resume clean. In flight + the LOG tail tell the next session exactly where to pick up — even a different agent, even weeks later.

Honest cost: ~2 minutes of discipline per session (append + update), and it pays off only once you're past a handful of sessions or running more than one agent. Below that, a plain README is fine — don't over-build.

Why it works (the 30-second version)

Documentation is your team's shared memory. When some teammates wipe their memory every session, the documents have to carry state — and the reliable way to carry state across actors that can't sync live is exactly this: one append-only history plus a derived current-state view. That's the eventually-consistent coordination pattern distributed systems have used for decades; I just borrowed it. The full standard — including the multi-repo and multi-agent versions, and the failure modes — is framework/paperwork-standard.md.

This covers one repo and one session at a time. The moment you have two agents writing at the same instant, or an invariant that must never break even for a second, two markdown files can't promise it — and that's a real, provable limit, not a gap you patch with better notes. Knowing which side of that line you're on is the next post.

The story below the setup (optional — skip if you came for the templates)

You can stop here with a working setup. If you want the why behind the why, here it is.

Back in late 2024 / early 2025, when I first started handing agents real work — audit this service, draft this migration, pick up where the last session left off — this was a dumb, recurring tax. Every new session opened with me re-explaining the same context, and by the third I was burning fifteen or twenty minutes re-establishing state a human teammate would simply have had. So I wrote a better HANDOFF.md. Then a better one. The overhead kept climbing, and a voice in the back of my head kept saying: you're carving this at the wrong joint.

So I made the mistake of following the problem — and it turned out not to be the problem I thought it was. Strip the word "documentation" and it's stark: I had actors that start cold, run briefly, and can't talk to each other in real time, and they had to act coherently anyway. That's not a docs question — it's distributed systems, a field that's been proving theorems about exactly this since the 1970s. I'd been hand-rolling a worse version of it in markdown without noticing.

That's also why this framework wears Warhammer 40,000 names, in case the "darkness" felt like an affectation. The Imperium of Man runs a galaxy with no real-time communication — its ships cross the warp, where they're simply unreachable. So it governs on three things: frozen edicts (decided once, not renegotiable by whoever's nearest), the Adeptus Administratum (literally galactic paperwork), and the Astronomican — a beacon of light a ship lost in the dark steers by. Strip the gothic paint and that's the entire engineering of this post: frozen authority, durable records, and a signal that survives. The darkness in the title is the warp between your sessions; your two files are the Astronomican.

And there's a catch I'll be honest about, because it shapes the whole series: that "real, provable limit" two paragraphs up isn't hand-waving — coordinating actors with no live channel runs into a genuine theorem (CAP), and it caps what any pile of documents can promise. So after I'd borrowed all this and wired it together, I spent more effort trying to break it than to build it — cold, hostile reviewers; an independent pass over every borrowed citation; benchmarks designed to make it fail. Some of it failed. That story is the rest of the series — but your setup above doesn't wait on any of it.

"Every session starts in darkness" is Warhammer-flavoured naming, nothing more. Independent practitioner exploration; no affiliation with Games Workshop. Repository MIT-licensed.

DeadLightFramework #AIAgents #AIProductivity #Documentation #ContextContinuity #AIAgentGovernance #HumanAICollaboration #PromptEngineering #DevTools

Dead Light Framework: An Experimental Framework for Human-AI Collaboration #Post 1

Lê Tú Hào — Tue, 12 May 2026 04:39:23 +0000

The Emperor Is All But Dead. The Light Remains.

An experimental governance framework for software teams of humans and AI agents — and a request to be argued with

Status: experimental. Unverified in the field. Looking for sparring partners more than followers.

By: a developer with ~10 years across many projects, not an academic or industry authority — full bio at the bottom.

Published: 2026-05-11 · ~8 min read ·
Repository: github.com/letuhao/dead-light-framework

TL;DR

I have been building software with AI agents long enough to see the same governance failure mode appear over and over: agents and humans contradicting Monday's decisions on Wednesday, layers leaking into each other, no anchor to navigate by. I am testing the hypothesis that human + AI software projects need a frozen source of authority that no participant — including the author — can rewrite at will. This post is the opening of an open debate; sharper arguments against it would help me more than agreement.

One question I most want to be wrong about: Is "frozen authority" actually compatible with "evolutionary architecture"? I think yes — argue with me.

The pain I keep running into

I have been building software with AI agents long enough — daily, across multiple projects — to recognize a pattern that does not look like a bug.

On Monday, an agent and I agree that the auth layer should not know about billing. On Wednesday, a different session of the same agent cheerfully imports a billing helper into the auth module, because the prompt of the day made it convenient. The change passes review, because the human reviewer has also forgotten the Monday conversation. By the time anyone notices, the layering decision has been quietly inverted in three places.

Another version of the same story: I commit a fix with a comment explaining why a specific branch of code must stay. Two days later, a fresh agent session is sent in to clean up TODOs and reads the comment as a TODO. By morning the carefully-preserved branch is gone, and the previous session's reasoning died with the previous session.

This is not a model failure. It is not a human failure either. It is the predictable result of a team in which:

Some members are stateless. Foundation-model agents have well-documented memory and identity limits across sessions (see Bommasani et al. 2021, On the Opportunities and Risks of Foundation Models; Park et al. 2023, Generative Agents).
The "why" behind past decisions is in nobody's working memory. Humans forget. Agents don't even start with the context.
Many actors can each "decide". When everyone has authority to nudge a direction, nothing actually sticks.
Latest input dominates. Agents will amplify whatever the most recent prompt suggests, including the wrong directions.

I have come to think of these as governance gaps wearing technical disguises. No amount of better prompts, better tests, or better refactor discipline patches them. They are properties of the team shape, not of any single contributor.

What we're fighting against — "The Chaos"

The failure pattern above has a name in this framework: The Chaos. It is the umbrella for four specific drift modes that tend to compound:

Context rot — agents lose the why behind past decisions and re-invent or contradict prior choices across sessions (the Monday/Wednesday and TODO-misread stories above).
Architect rot — without a fixed reference, refactors land in incompatible directions. Humans and agents drift further apart from any earlier coherent design.
Scope creep — the project keeps absorbing new concerns. Agents amplify it because the latest prompt is always more vivid than the original mandate.
Accumulated technical debt — local conveniences that, once normal, are hard to undo. Humans and agents together can ship more of it, faster than a single human could.

This is roughly what the AI-dev community has lately started calling "vibe coding": shipping code by feel, with agents steering, no anchor strong enough to make Monday's promise survive into Wednesday's commit. Vibe coding is wonderful for prototypes. It is brutal for anything that has to outlive a single session.

The framework's job is not to forbid vibe coding. It is to give a project enough of a fixed backdrop that, when it graduates from prototype to thing-people-rely-on, decisions can be made against something stable instead of against the void.

Where existing methodologies leave a design slot empty

I want to be careful here, because this is the easiest place to overreach.

Waterfall, Agile, Scrum, SAFe, RUP — these work. I am not in a position to grade them. If an AI agent shows up to a stand-up the way a competent teammate does — persistent role, accountable for decisions, reads the working agreements, follows what was decided yesterday — Scrum runs the same as it always has. Sometimes better, frankly, because the agent does not forget the meeting on the drive home.

So I do not want to claim the methodologies "fail" or "stop covering" anything when agents join. That would be both arrogant and inaccurate.

What I do think is narrower: none of these methodologies were designed with AI agents as first-class participants in mind. They do not specify what an "agent role" looks like — its memory model, its onboarding procedure, its authority bounds, its drift profile, how its decisions are attributed across sessions. That is an unfilled design slot, not a coverage failure.

The Dead Light Framework is one attempt at filling that slot. It sits on top of whatever delivery framework you already run, not in place of it. If your Scrum is well-disciplined and your reviews are tight, you will catch some of the failure modes I described above without any of this. The framework is for the parts your existing process was never asked to handle in the first place.

The hypothesis (the part you should attack)

The thing I am testing is one sentence:

A software project for humans + AI agents needs a frozen source of authority that no participant — human or agent — can rewrite at will.

Codified once by a small council. Sealed before kickoff. Humans interpret it. Agents execute within it. Neither group obeys a person — both navigate by the same fixed light.

This is not a radical idea outside software. It is roughly how constitutional federalism works (the U.S. Constitution constrains every subsequent administration), how religious institutional canon works (the Nicene Creed is older than any living interpreter), how central-bank mandates work (a price-stability mandate outlasts any single governor), and how RFC-driven protocol governance works (TCP/IP does not get rewritten because a vendor finds it inconvenient).

What is novel — if anything — is applying this pattern at the level of an individual software project, with AI agents as first-class participants whose context windows guarantee the authority cannot live in their heads.

I call the sealed document the Astronomican. I call the sealing meeting the Ascension Council. I call the agent-type rulebooks Codices. The names are borrowed from Warhammer 40,000.

About the metaphor (important)

I want to be honest about this up front, because it is the obvious objection.

The Imperium of Mankind in Warhammer 40,000 is a cautionary tale. It is grimdark by design: a bureaucratic, paranoid, ossified empire that fails spectacularly across ten thousand years. Picking it as a governance metaphor without acknowledging that is internally contradictory.

So I do not use it as evidence. The framework's policy, written into its own rules, is:

40k vocabulary is naming and shared metaphor only. Every load-bearing argument must rest on a real-world system with an observable track record: constitutional federalism, military command-and-control doctrine, central-bank mandates, religious canon, established corporate practice (Toyota Production System, Amazon two-pizza teams), open-source governance, established software methodologies.

When the 40k name and the real-world precedent disagree, the real-world precedent governs. The Imperium provides memorable names. Toyota's Andon Cord, the U.S. military's C2/SIGINT loop, and Bezos-era Amazon's API mandate provide the actual design lessons — particularly on the hardest problem the Imperium itself failed at: centralized authority combined with distributed sensing.

If you find a place in the framework where I leaned on 40k as an argument rather than as a name, that is a finding. Please file it.

Glossary — 40k terms used above

For readers who do not know Warhammer 40,000 — one-liners on each term used in this post.

Astronomican — In W40k, the psychic beacon that guides the Imperium's space travel after its god-emperor has all but died. In this framework: the name for the sealed project document of purpose, immutable laws, and guiding principles.

Imperium of Mankind — The fictional galactic empire in W40k. Used here only as a memorable source of names; not as a governance role model (the empire fails spectacularly in canon — that is part of why I quote it carefully).

Codex / Codices — In W40k, the rulebook each Space Marine Chapter operates under. In this framework: the rulebook each AI agent type operates under (operational bounds, hard stops, output contract, notify triggers).

Adeptus Administratum — In W40k, the Imperial bureau of records, taxation, and administrative logistics — the empire's "chief of paperwork." In this framework: the first sealed Chapter — a PM / High-Lord aide role.

Ascension Council — Not from canon. The framework's name for the one-time small group of humans who seal the project's founding document before kickoff and then disband.

Chapter / Chapters — In W40k, a self-contained battle order of Space Marines, each with its own Codex. In this framework: an agent type, each with its own Codex.

The Chaos — In W40k, the warp-based corrupting forces the Imperium fights eternally. In this framework: the umbrella failure mode the framework tries to defend against — context rot, architect rot, scope creep, accumulated technical debt; roughly the kind of drift "vibe coding" produces when extended beyond prototyping.

What this is and is not

This is:

A composition layer that sits on top of Agile / Scrum / Kanban / whatever you already run. It does not replace delivery rhythm.
An attempt to give projects a constitution-like artifact and an explicit protocol for agent participation.
A working hypothesis with a documented audit trail (38 findings against my own claims, all remediated, still openly listed).

This is not:

Proven. I have one in-flight case study (a 358-KLOC project called LoreWeave). One case is not evidence. It is a smoke test.
A productivity tool. It will add overhead before it removes any.
A claim that you should run your project this way. It is a claim that the failure modes are real, that existing methodologies were simply not designed with agent participants in scope, and that some framing in this neighborhood is probably needed to fill that slot.

Where this stands today

Phase 0 (the calibration/audit phase for retrofit projects) — sealed.
Phase 1 (the Astronomican itself) — partial. Six known open questions, listed publicly.
Phase 2 (Codex per Chapter) — first Chapter sealed (a PM/High-Lord aide called the Adeptus Administratum). Others wait for real-project triggers.
Phase 3 (drift detection) and Phase 4 (re-consecration) — not started.
One case study (LoreWeave) — Phase 0 Pass 1 about to begin.
Internal audit (Independent Verification Pass) — five of seven phases complete. The audit is public, including the times the framework failed its own audit.

Everything is in the open. The framework is being built in a single repo with full debate history.

Repository: github.com/letuhao/dead-light-framework

What I want from readers

Not converts. Arguments.

Specifically, I want people to attack these:

Is "frozen authority" actually compatible with "evolutionary architecture"? I think yes, with a re-consecration ceremony. But that ceremony is unsealed and you might convince me it is impossible.
Does the 40k vocabulary do more harm than good? I find it useful as memorable scaffolding for a debate-driven team. But it may be repelling readers who would otherwise engage.
Where does an industry standard already do this job? If COCOMO II / CMMI v3.0 / ITIL 4 / DORA already cover one of the gaps I think I am filling, I want to know before adding another box.
What is the smallest experiment that would falsify the framework? I am genuinely unsure how to design this. A failed retrofit on one project is suggestive, not conclusive.
What did I import from the Imperium that I should not have? I keep finding things. Help me find more.

What's coming next

A short series of posts will work through:

The case study in detail (where it hurt, with numbers).
Why Agile/Scrum specifically do not cover this gap.
The mechanics of sealing an Astronomican.
The Codex pattern for AI agents.
How the framework audits itself (and the times it has failed).
The anti-patterns I knowingly imported from a fictional dying empire, and how I compensate.
Open questions where the framework could still be wrong.
A practical adoption sketch — without promising it works.

If any of the failure modes I described sound like the project you are in right now, I would especially like to hear from you. The framework is far more useful as a piñata than as a manifesto.

About the author

A working developer with roughly ten years of experience across a range of projects. Not an academic, not an industry authority on software methodology, not a methodologist of any kind. No chair, no certification body, no track record of published frameworks behind me.

The Dead Light Framework — the subject of this post and the series it opens — is a personal exploration: one practitioner's attempt at finding methods that hold up when AI agents become full-time teammates. I publish it openly because I would rather be told I am wrong by people who have stood in front of the same problems than be politely ignored.

If I sounded certain anywhere above, treat that as a slip in tone, not a claim of authority. The framework is at hypothesis stage. Everything is in scope to be argued with.

The Emperor is all but dead. The light remains.

Repository: github.com/letuhao/dead-light-framework

Independent practitioner exploration. No affiliation with Games Workshop. Repository MIT-licensed.

DEV Community: Lê Tú Hào

AI Engineering #01 — When an AI Discards Its Own Search Results: The Case for Belief Retention

The short version

Part 1 — The incident

The setup

What happened (annotated timeline)

The distinction that matters

A plausible hypothesis (clearly labeled as such)

This isn't a one-off

Part 2 — The principle: retrieval is not enough

The lifecycle of a fact

Why this generalizes well beyond one chatbot

The missing primitives

A minimal sketch

Is this buildable today? Yes — mostly from parts that already exist

Why this matters for production systems

Evaluate retention, not just recall

What I can't know

Closing

AI-Driven Data Architecture, Part 1: Why Prompts Aren't Enough

AI-Driven Data Architecture, Part 1: Why Prompts Are Not Enough

What you'll take away

The illusion: prompt + context = product?

What I mean by "AI-driven data architecture"

Two mindsets — mine, before and after

An eight-layer lens

SSOT in one sentence

A stopping point I recognize (because I stopped there too)

Seven lessons from one build

A note on RAG

What's next

Dead Light Framework · Part 3 — Two Markdown Files Won't Save You Forever — A 3-Minute Test for Whether Your AI-Agent Project Needs More Than HANDOFF + LOG

The decision you keep dodging

The 3-question test (≈ 3 min)

Q1 — Do you need real-time integrity?

Q2 — Are you running more than one governance unit?

Q3 — How big is the codebase?

Q4 — Crossing a line in the next 3–6 months?

The decision card (copy this into your repo)

What you actually get

Why this works (the 30-second aside)

The story below the setup (optional — skip if you came for the card)

The turn I didn't want to take

What the problem actually is

The Imperium was the tell

The wall behind Q1

Inherit, don't invent

How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow

The 12-Phase Workflow That Actually Made AI Coding Useful for Me

The Core Problem This Solves

Where It Came From

The 12 Phases

Task Size Classification: The Thing That Actually Prevents Drift

The Anti-Skip Rules (The Most Underrated Part)

The Evidence Gate (Absorbed from Superpowers)

The Human's Role: Gatekeeper, Not Reviewer

AMAW: The Opt-In Multi-Agent Extension

What Gets Recorded: The Audit Log

What I've Shipped With This

free-context-hub

LoreWeave

The Real Pros

The Real Cons

Who This Is For

How to Use It

Final Thought

Dead Light Framework · Part 2 — a copy-paste setup so your AI agents stop losing context between sessions

The tax you're paying (and want gone)

The setup: two files, four rules (≈10 min)

HANDOFF.md — the snapshot a fresh session reads first

LOG.md — the append-only history

The four rules

What you actually get

Why it works (the 30-second version)

The story below the setup (optional — skip if you came for the templates)

DeadLightFramework #AIAgents #AIProductivity #Documentation #ContextContinuity #AIAgentGovernance #HumanAICollaboration #PromptEngineering #DevTools

Dead Light Framework: An Experimental Framework for Human-AI Collaboration #Post 1

The Emperor Is All But Dead. The Light Remains.

An experimental governance framework for software teams of humans and AI agents — and a request to be argued with

TL;DR

`HANDOFF.md` — the snapshot a fresh session reads first

`LOG.md` — the append-only history