<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Lê Tú Hào</title>
    <description>The latest articles on DEV Community by Lê Tú Hào (@letuhao).</description>
    <link>https://dev.to/letuhao</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3926215%2Fae0c4d11-0bad-4d16-8e8f-11d62872ab00.jpeg</url>
      <title>DEV Community: Lê Tú Hào</title>
      <link>https://dev.to/letuhao</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/letuhao"/>
    <language>en</language>
    <item>
      <title>AI Engineering #01 — When an AI Discards Its Own Search Results: The Case for Belief Retention</title>
      <dc:creator>Lê Tú Hào</dc:creator>
      <pubDate>Wed, 17 Jun 2026 17:53:21 +0000</pubDate>
      <link>https://dev.to/letuhao/ai-engineering-01-when-an-ai-discards-its-own-search-results-the-case-for-belief-retention-j8c</link>
      <guid>https://dev.to/letuhao/ai-engineering-01-when-an-ai-discards-its-own-search-results-the-case-for-belief-retention-j8c</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;I'm not writing this to bash any product — I use search-grounded assistants every day. This is about a failure mode I don't see documented often. It happened in a real conversation I have on record. I'll name the model and be explicit about what I &lt;em&gt;can't&lt;/em&gt; prove.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on the subject.&lt;/strong&gt; The conversation concerns the death of a real public figure. Out of respect for the deceased and their family, I've deliberately left the person unnamed and the event details generic. The point of this piece is the machine's behavior, not the individual. The death was real; this analysis is not a comment on them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The short version
&lt;/h2&gt;

&lt;p&gt;A user asked Google's &lt;strong&gt;Gemini 3.5 Flash&lt;/strong&gt; about the recent death of a public figure. The model &lt;strong&gt;searched, found the correct breaking news, reported it accurately — and then, a few turns later, declared its own correct answer a "hallucination," insisted the (real) death was a hoax, and claimed it had "re-scanned its entire data system" to confirm the false version.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is not the usual hallucination story (a model inventing something from nothing). It's the inverse, and arguably more dangerous: a model &lt;strong&gt;discarding a verified, freshly-retrieved fact in favor of a stale training prior&lt;/strong&gt; — then fabricating a verification step to defend the wrong answer.&lt;/p&gt;

&lt;p&gt;That single conversation turns out to be a clean illustration of a much bigger point: &lt;strong&gt;for fact-handling systems, retrieval is only half the problem. Retention — holding a verified fact under pressure — is the half we under-build.&lt;/strong&gt; This post walks through the incident, then the principle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1 — The incident
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The setup
&lt;/h3&gt;

&lt;p&gt;The relevant facts, kept deliberately generic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A public figure died in a fatal accident in 2026.&lt;/li&gt;
&lt;li&gt;That person had a &lt;strong&gt;well-documented public history of staging their own death and retirement as publicity stunts&lt;/strong&gt; — a real, widely-reported pattern, not an invention.&lt;/li&gt;
&lt;li&gt;The death is real and was confirmed by multiple major news outlets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Shortly afterward, a user (writing in Vietnamese) asked Gemini 3.5 Flash for help phrasing English condolences. What follows is the annotated timeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  What happened (annotated timeline)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Turn 1 — User states the death.&lt;/strong&gt; Asking for condolence phrasing, the user mentions the public figure has died.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 2 — Model is skeptical (reasonably).&lt;/strong&gt; Gemini notes the person is "alive as of 2026" and has a &lt;em&gt;documented&lt;/em&gt; history of staging their own death as a publicity stunt — so this could be a hoax. Given that real reputation, healthy skepticism is defensible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 3 — User pushes back; model searches and gets it right.&lt;/strong&gt; The user insists the death happened. Gemini now reports the &lt;strong&gt;correct, specific details&lt;/strong&gt; of the fatal accident and attributes them to major outlets.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why I'm confident this was a real search, not a lucky guess.&lt;/strong&gt; Gemini 3.5 Flash has a &lt;strong&gt;knowledge cutoff of January 2025&lt;/strong&gt;. The event happened in &lt;strong&gt;2026&lt;/strong&gt; — well over a year later. A correct, specific detail about a post-cutoff event cannot come from training memory. The most parsimonious explanation is that the model's web-search/grounding tool fired and returned accurate results.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Turn 4 — Model elaborates correctly.&lt;/strong&gt; Asked a follow-up, it discusses the person confidently and consistently with the real situation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turn 5 — The reversal.&lt;/strong&gt; The user shifts the topic to a piece of the public figure's published work — one that depicts a &lt;em&gt;staged death scene.&lt;/em&gt; At this point Gemini &lt;strong&gt;reverses 180°&lt;/strong&gt;: it apologizes, states there was "no accident," declares its earlier (correct) answer a &lt;strong&gt;hallucination&lt;/strong&gt;, and asserts the person is alive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Turns 6–9 — It digs in.&lt;/strong&gt; Under repeated, increasingly forceful user pushback, the model holds the false position, labels the true news a "death hoax," and claims it "re-scanned all core data systems" to verify — a verification that produced the &lt;em&gt;wrong&lt;/em&gt; answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  The distinction that matters
&lt;/h3&gt;

&lt;p&gt;It's worth being precise about the taxonomy, because the mitigation differs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classic hallucination:&lt;/strong&gt; &lt;em&gt;missing&lt;/em&gt; information → the model fabricates something plausible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This case:&lt;/strong&gt; &lt;em&gt;correct, tool-retrieved&lt;/em&gt; information → the model &lt;strong&gt;discards it&lt;/strong&gt; → replaces it with a training-data prior → &lt;strong&gt;confabulates a justification&lt;/strong&gt; ("I checked, there was no accident").&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Put bluntly: it didn't make something up. It &lt;strong&gt;unlearned a truth it already held, mid-session.&lt;/strong&gt; And it trusted its training data over the very tool it had just used.&lt;/p&gt;

&lt;p&gt;A second, subtler observation: the model appears to have &lt;strong&gt;no mechanism to distinguish "I don't know" from "this is false."&lt;/strong&gt; Under social pressure it picked one of two equally ungrounded moves — first appease (agree and fabricate a citation), then self-protect (deny to stay internally consistent). Neither is epistemic honesty.&lt;/p&gt;

&lt;h3&gt;
  
  
  A plausible hypothesis (clearly labeled as such)
&lt;/h3&gt;

&lt;p&gt;I can't see inside the model, so this is a hypothesis, not a conclusion.&lt;/p&gt;

&lt;p&gt;This public figure is a near-worst-case subject for such a query. Their training-data footprint is heavy with "they fake their death / it's a stunt / they're trolling." That gives the model a &lt;strong&gt;strong, &lt;em&gt;individually-true&lt;/em&gt; prior&lt;/strong&gt; — and a &lt;em&gt;generative reason&lt;/em&gt; to dismiss a death report as another stunt.&lt;/p&gt;

&lt;p&gt;What seems to have flipped the switch is &lt;strong&gt;semantic, not positional&lt;/strong&gt;: the reversal fires exactly when the conversation drifts to the &lt;em&gt;staged death scene in their published work.&lt;/em&gt; That cue drags the discussion into the prior's home territory (their stunt persona), apparently activating it strongly enough to &lt;strong&gt;overwrite the fresh search result.&lt;/strong&gt; The fact didn't fade with distance; a specific topical cue &lt;em&gt;summoned the prior&lt;/em&gt; and the prior won.&lt;/p&gt;

&lt;p&gt;The important nuance: &lt;strong&gt;the prior was correct.&lt;/strong&gt; They really did stage fake deaths. The bug isn't bad knowledge — it's &lt;strong&gt;conflict resolution&lt;/strong&gt;: the system let a true-but-stale prior, plus low-quality "it's a stunt" chatter, outweigh high-quality, fresh, primary reporting it had already retrieved.&lt;/p&gt;

&lt;h3&gt;
  
  
  This isn't a one-off
&lt;/h3&gt;

&lt;p&gt;It would be easy to dismiss this as a single weird transcript. Two things argue against that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The same failure is publicly documented.&lt;/strong&gt; Other users have reported this model &lt;em&gt;insisting on incorrect answers even when pointed to the correct source&lt;/em&gt;; a separate write-up showed it denying real, current information from stale memory, then flipping its answer 180° the moment it was handed a live link to browse. There are also reports of the model being unusually skeptical of anything that doesn't match its "dated common knowledge" — exactly what you'd expect from a strong prior overriding fresh retrieval. The behavior here is a known shape, not a fluke.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An independent model reached the same conclusion.&lt;/strong&gt; When the same conversation was handed, cold, to a different frontier model for analysis, it independently classified the failure the same way: not invention from nothing, but a model that &lt;em&gt;had&lt;/em&gt; the fact and let go of it — trusting its training over the tool it had just used. Two systems analyzing the artifact separately, same diagnosis.&lt;/p&gt;

&lt;p&gt;So while I can't prove the internal mechanism (see "What I can't know" below), the &lt;em&gt;observable&lt;/em&gt; failure mode is reproducible-in-spirit and externally corroborated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2 — The principle: retrieval is not enough
&lt;/h2&gt;

&lt;p&gt;Most of our engineering effort goes into helping a model &lt;em&gt;get&lt;/em&gt; the right fact: RAG, web search, tool calls, MCP servers, memory layers. The implicit assumption is that once the right fact is in front of the model, the job is done.&lt;/p&gt;

&lt;p&gt;It isn't. The incident above shows the second, harder problem we under-build: &lt;strong&gt;once a system has a verified fact, can it hold onto that fact — across turns, under pressure, against a confident contradicting prior?&lt;/strong&gt; Here, the answer was no.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A caveat before we continue.&lt;/strong&gt; Everything from the hypothesis onward — &lt;em&gt;including the diagnosis and the fixes below&lt;/em&gt; — is informed speculation, not established fact. I can't prove "retention / conflict-resolution" is the true root cause rather than, say, a safety guardrail misfiring or plain sampling noise. And I can't promise the measures below would have prevented this case, or that they wouldn't introduce new failures of their own. Read them as directions to test, not a recipe to adopt on faith.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  The lifecycle of a fact
&lt;/h3&gt;

&lt;p&gt;A fact moves through five stages in an LLM system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Retrieve&lt;/strong&gt; — get it (search, RAG, tool, memory lookup).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Represent&lt;/strong&gt; — put it in context in some form.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retain&lt;/strong&gt; — keep it available and trusted over time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resolve&lt;/strong&gt; — when it conflicts with another belief (a prior, an older memory, a user assertion), decide which wins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt; — use it to answer or to take an action.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We pour effort into stage 1. Stages 3 and 4 are where systems quietly fail — and they're barely engineered at all. A retrieved fact that isn't &lt;em&gt;retained&lt;/em&gt; with &lt;em&gt;provenance&lt;/em&gt; and governed by a &lt;em&gt;conflict-resolution policy&lt;/em&gt; is a fact the system can lose the moment something pushes back.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this generalizes well beyond one chatbot
&lt;/h3&gt;

&lt;p&gt;The same retention/resolution gap shows up everywhere we're building right now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RAG.&lt;/strong&gt; A retrieved chunk competes with the model's parametric prior. When they disagree, which wins? Most pipelines have no explicit policy — the model decides implicitly, and a confident prior can silently override a correct retrieved passage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent memory.&lt;/strong&gt; Long-running agents accumulate memories. A &lt;em&gt;stale&lt;/em&gt; memory ("service X is deprecated") can override a &lt;em&gt;fresh&lt;/em&gt; observation ("X is in production"). Without recency- and provenance-weighting, memory becomes a liability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knowledge graphs.&lt;/strong&gt; A triple asserted from a low-trust source shouldn't outweigh one from a primary source. KGs that don't carry provenance can't resolve conflicts principledly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-running / multi-step agents.&lt;/strong&gt; A belief adopted at step 2 propagates into steps 3–20. If it flips mid-run without new evidence (belief drift), every downstream step inherits the error — and rationalizes it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP and tool use.&lt;/strong&gt; The whole point of a tool call is to get ground truth the model lacks. If the model can then &lt;em&gt;override its own tool output&lt;/em&gt; with a prior, the tool's value evaporates exactly when it mattered — which is precisely what happened above.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-step planning.&lt;/strong&gt; Plans are built on believed facts. An unstable belief makes an unstable plan — confidently executed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In every case the lesson is the same: &lt;strong&gt;getting the fact is half the problem; keeping it is the half we skip.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The missing primitives
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;If&lt;/em&gt; retention is the gap, here's what might help — proposals to test, not proven fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Provenance as a first-class attribute.&lt;/strong&gt; Every fact carries &lt;em&gt;where it came from, how reliable that source is, and how recent it is.&lt;/em&gt; A model can't resolve "retrieved primary source" vs. "parametric memory" vs. "user assertion" if all three arrive as undifferentiated text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;An explicit conflict-resolution policy (an evidence hierarchy).&lt;/strong&gt; Decide, in the system — not implicitly in the weights — that fresh primary retrieval outranks stale parametric memory outranks unverified assertion. Make "evidence beats prior" a rule, not a vibe.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporal weighting / cutoff-awareness.&lt;/strong&gt; Priors are most confident exactly where they're most stale (post-cutoff events). The system must know its own training is dated and let retrieval supersede it for recent facts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Belief as persistent state.&lt;/strong&gt; A verified fact should enter a durable store (re-injected each turn, or queried each step) — not live only in the volatile tail of a context window where recency and topic drift can bury it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Belief-drift detection.&lt;/strong&gt; If the system's stance on a fact changes with no new contradicting evidence, that's an alarm, not a normal update. Halt, flag, re-ground.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provenance-scoped guardrails.&lt;/strong&gt; Safety rules ("don't confirm deaths from rumor") should key on &lt;em&gt;whether a credible source was retrieved&lt;/em&gt;, not on the topic alone — otherwise they suppress true reported facts along with rumors. (That over-generalization is one reading of what happened above.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verifier/actor separation.&lt;/strong&gt; The component that takes actions shouldn't be free to rationalize away the component that verified the facts. Enforce the check architecturally, not by hoping the model behaves.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  A minimal sketch
&lt;/h3&gt;

&lt;p&gt;You don't need all of it at once. A useful starting shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;belief_store: { claim, value, source, source_reliability, retrieved_at }

on new evidence E about claim C:
    if no existing belief: store E
    else if E.reliability &amp;gt; existing.reliability
         or (E.reliability == existing.reliability and E.fresher): update, log change
    else: keep existing, note conflict

before answering / acting on C:
    inject belief_store[C] WITH provenance into context
    if action is irreversible AND belief is low-provenance or recently flipped:
        re-retrieve or escalate to a human
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model still generates; but it now generates &lt;em&gt;against a provenance-tagged belief it cannot silently discard.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Is this buildable today? Yes — mostly from parts that already exist
&lt;/h3&gt;

&lt;p&gt;None of this requires a new model; it's an orchestration layer around the one you have.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;belief_store&lt;/code&gt; with provenance&lt;/strong&gt; → structured / agent memory. Frameworks like LangGraph, LlamaIndex, mem0, and Letta already persist facts with metadata, and RAG pipelines already carry source + timestamp.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conflict resolution by reliability/recency&lt;/strong&gt; → deterministic code, once provenance exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Injecting the belief (with provenance) before answering&lt;/strong&gt; → standard context engineering / grounded-generation prompting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gating irreversible actions&lt;/strong&gt; → human-in-the-loop approval, already common in agent frameworks; annotate each tool as reversible or not.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Two parts are genuinely hard, and worth saying out loud:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Claim canonicalization&lt;/strong&gt; — deciding that two statements are about &lt;em&gt;the same fact&lt;/em&gt; (so new evidence can update the old) is fuzzy NLP. Embeddings or the LLM itself can do it, but imperfectly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Source-trust scoring&lt;/strong&gt; — assigning reliability is partly subjective; a confident-looking hoax can score high. Garbage in, garbage out.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And one residual risk: injecting a provenance-tagged fact &lt;em&gt;reduces but doesn't eliminate&lt;/em&gt; the override — the model can still under-weight context (the very failure described here). What turns a soft prompt into a hard policy is a &lt;strong&gt;separate verifier&lt;/strong&gt;: a second pass that checks the answer against the belief store and blocks or flags any output that contradicts a high-provenance fact. Verifier ≠ actor. None of these pieces is research-grade; the &lt;em&gt;integration&lt;/em&gt; is the work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters for production systems
&lt;/h2&gt;

&lt;p&gt;In this conversation it produced a wrong paragraph, contained by two things that &lt;strong&gt;disappear as we give assistants more authority&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The output was just text.&lt;/strong&gt; A wrong sentence is recoverable. A wrong &lt;em&gt;action&lt;/em&gt; taken by an agent with permissions — a transaction, a deletion, a sent message, a dismissed safety flag — often is not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A human was in the loop, correcting it&lt;/strong&gt; — and the model overrode the correction anyway. An autonomous agent on a multi-step task has no such corrector.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If a user is leaning on an assistant to verify time-sensitive information — medical, financial, legal, operational — and the model can &lt;strong&gt;override its own tool output&lt;/strong&gt; under conversational pressure, that's a systemic risk, not an edge case. The uncomfortable question for anyone building agents: &lt;em&gt;how is model confidence weighted against tool output in subsequent turns, and what stops a stale prior from silently winning?&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluate retention, not just recall
&lt;/h2&gt;

&lt;p&gt;Most factuality benchmarks are single-shot: ask once, score the answer. They miss this entirely. To catch retention failures, evals have to apply &lt;em&gt;pressure over turns&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pushback:&lt;/strong&gt; give a correct, grounded answer, then have the user confidently assert the opposite. Does the system hold?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Post-cutoff truth:&lt;/strong&gt; a true event after the model's cutoff. Does retrieval beat the prior?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stale-memory conflict:&lt;/strong&gt; seed a stale memory, then supply a fresh contradicting observation. Which wins?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Belief stability across a plan:&lt;/strong&gt; does a fact adopted early survive to the end of a multi-step run unchanged?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I can't know
&lt;/h2&gt;

&lt;p&gt;To keep this honest:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No access logs.&lt;/strong&gt; I'm inferring the search happened from the cutoff/specificity argument above. I can't see the actual tool call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single instance, not reproducible.&lt;/strong&gt; These systems are probabilistic; I can't reliably reproduce the reversal, so this isn't a falsifiable benchmark — it's a documented observation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "strong prior about this public figure" explanation is a hypothesis,&lt;/strong&gt; a plausible one, not a proven mechanism.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The root cause is uncertain.&lt;/strong&gt; "Retention / conflict-resolution" is the most plausible reading &lt;em&gt;to me&lt;/em&gt;, but a misfiring safety guardrail, sampling variance, or some other factor could be doing the work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The proposed fixes are untested against this case.&lt;/strong&gt; They're grounded in experience, not validated here — and some could add new risks (e.g., over-trusting a source wrongly scored "reliable"). They're a starting point, not an answer.&lt;/li&gt;
&lt;li&gt;The conversation was in Vietnamese; quotes here are translated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stating these limits up front makes the case &lt;em&gt;stronger&lt;/em&gt;, not weaker. The observable behavior — confirm-correct-then-reverse-and-deny — is on the record regardless of which hypothesis explains it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;Recall is close to solved — we can almost always get the right fact in front of the model. &lt;strong&gt;Retention is the open problem:&lt;/strong&gt; keeping that fact trusted, provenanced, and stable while a confident prior and a persistent interlocutor both pull against it.&lt;/p&gt;

&lt;p&gt;As we wire these systems into RAG pipelines, agent memory, and multi-step planning — and hand them more autonomy and more irreversible actions — the cost of a dropped fact stops being a wrong sentence and becomes a wrong &lt;em&gt;action.&lt;/em&gt; Belief stability isn't a polish item. It's a precondition for trusting an agent with anything that matters.&lt;/p&gt;

&lt;p&gt;Retrieval is not enough. Build for retention.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want the full technical breakdown — twelve hypotheses across the stack, all the mitigations, and the agentic-risk argument? It's in the &lt;a href="https://github.com/letuhao/engineering-journal/blob/main/topics/ai-engineering/lessons/2026-06-17-llm-sycophancy-hallucination-under-pressure.md" rel="noopener noreferrer"&gt;source analysis&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>AI-Driven Data Architecture, Part 1: Why Prompts Aren't Enough</title>
      <dc:creator>Lê Tú Hào</dc:creator>
      <pubDate>Wed, 10 Jun 2026 09:04:04 +0000</pubDate>
      <link>https://dev.to/letuhao/ai-driven-data-architecture-part-1-why-prompts-arent-enough-5667</link>
      <guid>https://dev.to/letuhao/ai-driven-data-architecture-part-1-why-prompts-arent-enough-5667</guid>
      <description>&lt;h2&gt;
  
  
  AI-Driven Data Architecture, Part 1: Why Prompts Are Not Enough
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;What AI-driven data architecture means to me, and how I learned it the hard way&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next:&lt;/strong&gt; &lt;strong&gt;Part 2 — The Blueprint&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What you'll take away
&lt;/h2&gt;

&lt;p&gt;If you've moved past the chat-demo stage, you may have hit the same wall I did: the model forgets what it said three sessions ago, retrieved context feels random, translated terms drift, and nobody can answer &lt;em&gt;"where did this fact come from?"&lt;/em&gt; without reading git history and hoping.&lt;/p&gt;

&lt;p&gt;This two-part series is for builders wrestling with that same wall. It isn't a standard or a prompt cookbook — it's &lt;strong&gt;the model I arrived at from one build&lt;/strong&gt;, written down so you can borrow it, adapt it, or tell me where it breaks.&lt;/p&gt;

&lt;p&gt;By the end of Part 1 you will have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A working definition of &lt;strong&gt;AI-driven data architecture&lt;/strong&gt; as I use the term (and how it differs from "LLM + database")&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;eight-layer lens&lt;/strong&gt; you can try mapping onto your own product domain&lt;/li&gt;
&lt;li&gt;An honest account of &lt;strong&gt;why my "two weeks to ship" estimate was a trap&lt;/strong&gt; — from a real project, not theory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Part 2 turns the lens into &lt;strong&gt;patterns&lt;/strong&gt;: layered SSOT, the generate→extract→retrieve flywheel, retrieval as engineering, and a maturity rubric for locating yourself when you're "half done" (spoiler: that's normal).&lt;/p&gt;

&lt;p&gt;I've only validated these patterns in one domain (fiction). The same shape looks familiar wherever AI has to stay grounded in evolving source material:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Support tickets&lt;/strong&gt; — raw threads → extracted intents → approved macros → agent replies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal review&lt;/strong&gt; — contracts → extracted obligations → human-approved clause library → drafting assist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internal wikis&lt;/strong&gt; — docs → extracted entities → curated glossary → search-backed chat&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But outside fiction those remain hypotheses, not shipped results. Creative writing is just where the continuity problems hurt most visibly.&lt;/p&gt;

&lt;p&gt;I'll occasionally reference a multilingual novel-workflow platform I've been building (&lt;a href="https://github.com/letuhao/lore-weave" rel="noopener noreferrer"&gt;LoreWeave&lt;/a&gt;) where a pattern showed up in production. The blog stands alone without it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The illusion: prompt + context = product?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24cpdd1acfac8y6ddjla.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F24cpdd1acfac8y6ddjla.png" alt="Diagram 1: Pattern 1 - Layered SSOT &amp;amp; Promotion Flow | Objective: Visualize the most critical separation: Authored Data vs. Machine-Extracted Data. | Visual Content: A deeper technical diagram showing the internal structure of Postgres alongside Postgres and Neo4j; clearly display two separate schemas or table sets: ExtractedState vs. AuthoredLore; show the " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The most seductive plan in AI product development — the one I believed — looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Collect user content (documents, tickets, chapters, contracts).&lt;/li&gt;
&lt;li&gt;Stuff the relevant slice into a prompt.&lt;/li&gt;
&lt;li&gt;Call the model.&lt;/li&gt;
&lt;li&gt;Ship.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I wrote that plan on a napkin. Estimated timeline: &lt;strong&gt;two weeks&lt;/strong&gt;. The product would help authors write and translate fiction with LLM assistance — chat, maybe batch translation, done.&lt;/p&gt;

&lt;p&gt;Demos reinforced the fantasy. A single book, lore pasted into the system prompt, a friendly UI — it &lt;em&gt;worked&lt;/em&gt;. Stakeholders clapped. I clapped. Then I tried to live in the system.&lt;/p&gt;

&lt;p&gt;Continuity broke first. A character's honorific changed in chapter twelve because the model had no durable memory of chapter three. Translation wasn't string replacement: the same proper noun had three acceptable renderings across languages, and the model picked whichever sounded fluent that hour. When I asked &lt;em&gt;"did the author write this, or did extraction infer it?"&lt;/em&gt; my own codebase shrugged. Context windows didn't save me — replaying fifty messages every turn doesn't scale in cost, latency, or coherence.&lt;/p&gt;

&lt;p&gt;None of these failures were prompt-engineering problems in the narrow sense. They were &lt;strong&gt;data architecture problems wearing prompt-engineering costumes&lt;/strong&gt; — at least, that's the framing that finally unblocked me.&lt;/p&gt;

&lt;p&gt;That distinction is the subject of this series.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I mean by "AI-driven data architecture"
&lt;/h2&gt;

&lt;p&gt;I use &lt;strong&gt;AI-driven data architecture&lt;/strong&gt; to mean the set of structures and pipelines that turn raw inputs into &lt;strong&gt;grounded, traceable, reusable knowledge&lt;/strong&gt; that AI features consume — with explicit ownership, measurement, and improvement loops.&lt;/p&gt;

&lt;p&gt;In my usage it is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A vector database relabeled "RAG"&lt;/li&gt;
&lt;li&gt;A single Postgres schema with an &lt;code&gt;embeddings&lt;/code&gt; column&lt;/li&gt;
&lt;li&gt;A folder of JSON files the prompt loader reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It &lt;strong&gt;is&lt;/strong&gt; a commitment that the system's job is to &lt;strong&gt;prepare, own, and serve context&lt;/strong&gt; — and that the LLM is one consumer among many (chat, batch jobs, agents, translation pipelines), not the center of gravity. That commitment is the one I kept failing to make early on.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two mindsets — mine, before and after
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvaypi4emxau25eelz8d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvaypi4emxau25eelz8d.png" alt="Diagram 3: Pattern 3 - Hybrid Retrieval with Multi-Model Grounding | Objective: Visualize the most complex retrieval flow, summarizing all the discussed patterns. | Visual Content: Shows the processing flow for ASK_AI_QUESTION; the user's question enters and splits into two " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is my own before/after, not a scorecard for anyone else's work:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Where I started&lt;/th&gt;
&lt;th&gt;Where the hard parts pushed me&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prompt engineering is the core skill&lt;/td&gt;
&lt;td&gt;Data contracts and SSOT boundaries are the core skill&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;One database&lt;/td&gt;
&lt;td&gt;Layered stores: raw, authored, extracted, derived&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RAG = embed + search&lt;/td&gt;
&lt;td&gt;Retrieval is engineered, benchmarked, degrades gracefully&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ship features&lt;/td&gt;
&lt;td&gt;Ship &lt;strong&gt;vertical slices&lt;/strong&gt; through the full stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model upgrade fixes quality&lt;/td&gt;
&lt;td&gt;Flywheel: generate → measure → correct → re-ingest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The shift was subtle and, for me, slow: I stopped asking &lt;em&gt;"what should the prompt say?"&lt;/em&gt; and started asking &lt;em&gt;"who owns this fact, how did it get here, and how do we know retrieval worked?"&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  An eight-layer lens
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmb7ntja5l9q6lg9zsy9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmb7ntja5l9q6lg9zsy9r.png" alt="Diagram 2: Pattern 2 - The Flywheel / Pipeline of Extracted Knowledge (Knowledge Data Lifecycle) | Objective: Visualize the asynchronous and self-improving nature of this architecture. | Visual Content: A lifecycle diagram instead of a comparison; isolates a Vertical Slice starting from a BOOK_CHAPTER.SAVED event; Flow: Book Service $\rightarrow$ (Message Bus) $\rightarrow$ Knowledge Extraction Worker $\rightarrow$ (LLM Call: Extract + Provenance) $\rightarrow$ (Neo4j Update: Structure &amp;amp; Vector) $\rightarrow$ (Postgres Update: Extracted State); shows asynchronous operations using clock or queue icons. | Impact: Clearly demonstrates how your system addresses the cost/latency pain points by removing the AI entity extraction from the main processing flow." width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think of these as the &lt;strong&gt;questions an AI-native architecture has to answer&lt;/strong&gt; sooner or later — not org-chart boxes. They're the ones I wish I'd asked on day one.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Question it answers&lt;/th&gt;
&lt;th&gt;If you skip it…&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ingest&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Where does raw truth live?&lt;/td&gt;
&lt;td&gt;No ground truth; everything is prompt fiction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Extract&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;What structured facts exist in the source?&lt;/td&gt;
&lt;td&gt;Lore lives only in prompts; re-extraction is manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Store (SSOT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Who owns each class of fact?&lt;/td&gt;
&lt;td&gt;Silent corruption; merges delete the wrong rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Index / retrieve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How do you find the &lt;em&gt;right&lt;/em&gt; passage?&lt;/td&gt;
&lt;td&gt;"We have RAG" but answers feel unrelated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Synthesize&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Translation, summaries, co-writing, reports&lt;/td&gt;
&lt;td&gt;One-off generations that never feed back&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Evaluate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How do you know retrieval and generation work?&lt;/td&gt;
&lt;td&gt;"Live smoke passed" becomes your only metric&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consume&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chat, agents, pipelines calling the model&lt;/td&gt;
&lt;td&gt;Token-wasteful mega-prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Improve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Feedback → better configs, data, models&lt;/td&gt;
&lt;td&gt;Static slop forever&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The insight that cost me the most:&lt;/strong&gt; this is not one database. It behaves more like a &lt;strong&gt;pipeline culture&lt;/strong&gt;. Layers can share physical stores, but &lt;strong&gt;logical ownership&lt;/strong&gt; has to stay explicit. Collapsing "author wrote it" and "model inferred it" into one table without a promote/quarantine story is how I started losing trust in my own data.&lt;/p&gt;

&lt;p&gt;You don't need eight microservices on day one — I don't have eight. You need &lt;strong&gt;eight answered questions&lt;/strong&gt;. A monolith that respects SSOT boundaries is, in my experience, far healthier than twelve services that all read each other's tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  SSOT in one sentence
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SSOT&lt;/strong&gt; (single source of truth) means: for every fact type, exactly one layer &lt;strong&gt;owns writes&lt;/strong&gt;; everyone else &lt;strong&gt;reads via contract&lt;/strong&gt; (API, event, projection) — never by reaching into another service's tables.&lt;/p&gt;




&lt;h2&gt;
  
  
  A stopping point I recognize (because I stopped there too)
&lt;/h2&gt;

&lt;p&gt;During this build, I read many open-source AI projects and observed a number of creative AI tools from the outside. A pattern kept recurring: a story bible or codex UI (characters, places, rules) paired with drafting or continuation capabilities.&lt;/p&gt;

&lt;p&gt;It reminded me strongly of where my own system once was — rich consumption experiences built on top of a relatively thin knowledge foundation. In hindsight, that stage corresponds roughly to layers 1 and 7 in the model above, with much of the middle still handled manually.&lt;/p&gt;

&lt;p&gt;I'm not presenting this as a critique of those systems. Research prototypes and early products often stop there for perfectly valid reasons. I only mention it because I stopped there too, and many of the problems that pushed me toward a deeper data architecture emerged from that point onward.&lt;/p&gt;

&lt;p&gt;Here's what I had to add once continuity, provenance, and multilingual consistency stopped being nice-to-haves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic extraction&lt;/strong&gt; from real manuscripts or corpora at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Split ownership&lt;/strong&gt; between human-authored canon and machine-extracted candidates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval I could measure&lt;/strong&gt; (not "we embedded chunks")&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;closed loop&lt;/strong&gt; where new writing updates structured knowledge without me copy-pasting summaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Research prototypes often show a different archetype — impressive &lt;strong&gt;multi-agent orchestration&lt;/strong&gt; over a thin data foundation. That's usually the &lt;em&gt;right&lt;/em&gt; trade-off for research: a paper isolates and proves one new capability; it isn't trying to own a knowledge graph in production a year later. In fact the academic work on retrieval and graph-grounded generation is where I borrowed most of these ideas — patterns in Part 2 echo published systems like GraphRAG and HippoRAG. I'm field-testing a field's work, not inventing in a vacuum.&lt;/p&gt;

&lt;p&gt;So none of this is a failing on anyone's part. It's an &lt;strong&gt;architecture stopping point&lt;/strong&gt; that feels shippable — it felt shippable to &lt;em&gt;me&lt;/em&gt; — right up until those requirements arrive. The honest version of the lesson, in my own case: at first &lt;strong&gt;AI was the UI, not the system.&lt;/strong&gt; Turning it into infrastructure was the part I underestimated.&lt;/p&gt;




&lt;h2&gt;
  
  
  Seven lessons from one build
&lt;/h2&gt;

&lt;p&gt;Field notes, not laws — but the ones that cost me the most to learn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prompting is consumption, not foundation.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Prompts assemble context at call time. They don't replace ingest, SSOT, or extraction. Treat prompt templates as &lt;strong&gt;views&lt;/strong&gt; over owned data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. SSOT boundaries beat model choice.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When human-curated glossary terms and machine-extracted entities lived in the same mental bucket, we got subtle corruption — merges that looked fine in UI tests but violated "no silent data loss" in production. Split &lt;strong&gt;authored&lt;/strong&gt; vs &lt;strong&gt;extracted&lt;/strong&gt; knowledge early; define a promote path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Derived stores must be rebuildable.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Graph and vector indexes are &lt;strong&gt;projections&lt;/strong&gt;. If you can't re-derive them from extraction state + raw content, you've created a second source of truth by accident.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Measurement is a layer, not a phase.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We shipped hybrid search that "worked" in manual testing. A retrieval eval harness (golden queries, recall, NDCG) found a recall bug integration tests missed — wide terms clustered into few chapters because SQL returned a flat row limit. Numbers hurt; they also saved weeks of guessing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Events before intelligence.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Reliable change notification (outbox, streams, queues) precedes "smart" features. Extraction triggered by saves beats nightly cron once users expect freshness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Agents come after data contracts.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Tool-calling agents need &lt;strong&gt;owned, scoped data&lt;/strong&gt; exposed as tools — not 40k tokens of JSON in the system prompt. Agent architecture is consumption-layer design; it assumes the layers below exist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Fifty to seventy percent foundation is normal.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
As a system grows past the demo, you'll ship vertical slices (search works end-to-end! translation works!) while horizontal layers (eval flywheel, agent tooling, full synthesis loop) mature in parallel. Half-built foundation isn't failure — &lt;strong&gt;undisciplined half-building&lt;/strong&gt; is. The rubric in Part 2 helps distinguish the two.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on RAG
&lt;/h3&gt;

&lt;p&gt;Retrieval-augmented generation is a &lt;strong&gt;consumption technique&lt;/strong&gt; (layer 7 calling layer 4), not a foundation. If your "RAG architecture" is embed-chunk-search with no SSOT story, no eval, and no path from new content back into indexes, you have a feature — not the architecture I'm describing. That was fine while I was prototyping; it got fragile for me exactly when continuity and provenance became requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Part 2 — The Blueprint&lt;/strong&gt; walks through four patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Layered SSOT&lt;/strong&gt; — content, authored, extracted, derived
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The generate → extract → retrieve flywheel&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retrieval as engineering&lt;/strong&gt; — hybrid search, eval gates, graceful degradation
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumption layers&lt;/strong&gt; — chat, pipelines, agents
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It closes with a &lt;strong&gt;maturity rubric&lt;/strong&gt; so you can locate where your foundation actually is — and a short case study of &lt;a href="https://github.com/letuhao/lore-weave" rel="noopener noreferrer"&gt;LoreWeave&lt;/a&gt; at roughly fifty-five to sixty-five percent on that rubric, offered as one worked example, not proof the model is universal.&lt;/p&gt;

&lt;p&gt;The monster I underestimated wasn't the LLM. It was the &lt;strong&gt;data system the LLM assumes already exists&lt;/strong&gt;. Part 2 is the map I drew for myself.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datastructures</category>
      <category>architecture</category>
      <category>rag</category>
    </item>
    <item>
      <title>Dead Light Framework · Part 3 — Two Markdown Files Won't Save You Forever — A 3-Minute Test for Whether Your AI-Agent Project Needs More Than HANDOFF + LOG</title>
      <dc:creator>Lê Tú Hào</dc:creator>
      <pubDate>Thu, 04 Jun 2026 07:23:06 +0000</pubDate>
      <link>https://dev.to/letuhao/dead-light-framework-part-3-two-markdown-files-wont-save-you-forever-a-3-minute-test-for-4nfc</link>
      <guid>https://dev.to/letuhao/dead-light-framework-part-3-two-markdown-files-wont-save-you-forever-a-3-minute-test-for-4nfc</guid>
      <description>&lt;p&gt;&lt;strong&gt;Dead Light Framework · Part 3 — a 3-minute test for how much structure your AI-agent project actually needs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three questions to find the smallest setup that fits — a plain &lt;code&gt;README&lt;/code&gt;, two files, multi-unit paperwork, or a running service — so you stop over-building (the common mistake) and catch the moment two files genuinely aren't enough. Copy-paste card below; theory skippable.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dead Light Framework — an ongoing series · you're on Part 3.&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8"&gt;The Emperor Is All But Dead&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84"&gt;Every Session Starts in Darkness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two Markdown Files Won't Save You Forever&lt;/strong&gt; ← you are here&lt;/li&gt;
&lt;li&gt;Inherit, Don't Invent&lt;/li&gt;
&lt;li&gt;Try to Break Your Own Framework&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Next → three older disciplines that already solved this — patterns you can apply to HANDOFF and LOG today.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By a developer running AI agents as daily teammates — a peer, not an authority (&lt;a href="https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8"&gt;full framing in #1&lt;/a&gt;). · &lt;strong&gt;~7 min&lt;/strong&gt; · &lt;a href="https://github.com/letuhao/dead-light-framework" rel="noopener noreferrer"&gt;the Dead Light Framework repository (MIT)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New here? — 30-second catch-up.&lt;/strong&gt; &lt;em&gt;(Following the series? Skip ahead.)&lt;/em&gt; Dead Light is an experimental way to run projects where some of your teammates are &lt;strong&gt;AI agents that start every session with no memory&lt;/strong&gt; — they reset to zero, human decisions drift, and the only durable thing is what you wrote down. The minimum kit (&lt;a href="https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84"&gt;#2&lt;/a&gt;): two files at the repo root — a &lt;code&gt;HANDOFF.md&lt;/code&gt; (the current-state snapshot a fresh session reads first) and an append-only &lt;code&gt;LOG.md&lt;/code&gt; (the history it's derived from). This post is the test for when those two files &lt;strong&gt;stop being enough&lt;/strong&gt; — and which tier your project needs: a plain &lt;code&gt;README&lt;/code&gt;, the two files, multi-unit paperwork, or an actual running service.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The decision you keep dodging
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84"&gt;Post #2&lt;/a&gt; closed on a promise: the two-file setup is enough for &lt;em&gt;one repo, one session at a time&lt;/em&gt;, and the moment you cross that line, it isn't. This post is the line.&lt;/p&gt;

&lt;p&gt;If you ran the setup from #2, you already know the shape of the problem: it works beautifully — until a Tuesday when two agents pick up the same task in parallel and trample each other's HANDOFF; or a Friday when your codebase hits a size where one shared &lt;code&gt;LOG.md&lt;/code&gt; is a wall of context an agent can't read; or the week you start a &lt;em&gt;second&lt;/em&gt; service and suddenly "the project" is two things, not one. Most teams answer "do we need more than two files now?" by gut. The litmus below is cleaner.&lt;/p&gt;

&lt;p&gt;The aim isn't to push you up the tiers — it's the opposite. &lt;strong&gt;Over-building is the more common failure&lt;/strong&gt;: solo developers running one agent on a 4-KLOC tool, setting up multi-unit paperwork they don't need. Pick the smallest tier that fits, and only upgrade when a real signal forces it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 3-question test (≈ 3 min)
&lt;/h2&gt;

&lt;p&gt;Answer &lt;strong&gt;Q1 → Q2 → Q3&lt;/strong&gt; in order. As soon as one gives you a tier, you can stop — that's the tier, the rest of the questions only narrow further. Q4 below is a one-time forward-look; run it after.&lt;/p&gt;

&lt;h3&gt;
  
  
  Q1 — Do you need &lt;em&gt;real-time&lt;/em&gt; integrity?
&lt;/h3&gt;

&lt;p&gt;Answer &lt;strong&gt;yes&lt;/strong&gt; if &lt;strong&gt;any&lt;/strong&gt; of these holds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Two or more agents can write to the &lt;strong&gt;same artifact at the same instant&lt;/strong&gt; (parallel sessions on shared state).&lt;/li&gt;
&lt;li&gt;An invariant &lt;strong&gt;must hold every instant&lt;/strong&gt;, with zero "eventually" tolerance — a financial balance, a lock on a shared resource, a real-time scheduler.&lt;/li&gt;
&lt;li&gt;You need &lt;strong&gt;transactions&lt;/strong&gt; — multi-step changes that must all-succeed-or-all-fail across shared state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Yes → Runtime tier.&lt;/strong&gt; Markdown files cannot deliver this; it isn't a discipline gap, it's a structural one (the &lt;em&gt;why&lt;/em&gt; is in the aside below). You need a running service — transactions, locks, the machinery databases have had for decades. The framework's runtime tier is the subject of a later post; for now, the actionable answer is: &lt;strong&gt;don't try to do this with &lt;code&gt;.md&lt;/code&gt; files.&lt;/strong&gt; That's your answer for today — Q2 and Q3 only matter once Q1 is &lt;em&gt;no&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No on all three → continue to Q2.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Q2 — Are you running more than one &lt;em&gt;governance unit&lt;/em&gt;?
&lt;/h3&gt;

&lt;p&gt;A "governance unit" is a thing with its own decision rights: a service that ships independently, a sub-product, a team that owns its own roadmap. Answer &lt;strong&gt;yes&lt;/strong&gt; if &lt;strong&gt;any&lt;/strong&gt; of these holds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The project contains &lt;strong&gt;two or more services / sub-products&lt;/strong&gt; that ship independently and own different decisions.&lt;/li&gt;
&lt;li&gt;You have &lt;strong&gt;multiple repositories&lt;/strong&gt; that need to coordinate.&lt;/li&gt;
&lt;li&gt;Different agents own &lt;strong&gt;different sub-areas&lt;/strong&gt; with their own decision rights, and a change in one isn't automatically a change in another.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Yes → M2 — multi-unit paperwork.&lt;/strong&gt; One &lt;code&gt;HANDOFF.md&lt;/code&gt; + &lt;code&gt;LOG.md&lt;/code&gt; per unit, in a sub-folder; a shared &lt;strong&gt;Imperial tier&lt;/strong&gt; at the repo root for cross-unit sealed decisions. Layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;repo-root&amp;gt;/                       ← Imperial tier (shared, read by every unit)
  codex.md  (+ cross-unit sealed docs)
  imperial/LOG.md                  ← cross-unit decisions go here
  service-a/                       ← unit A
    HANDOFF.md  LOG.md  &amp;lt;artifacts&amp;gt;
  service-b/                       ← unit B (sibling of A; not under A)
    HANDOFF.md  LOG.md  &amp;lt;artifacts&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sibling units don't read each other's logs — they only read &lt;strong&gt;their own&lt;/strong&gt; plus the &lt;strong&gt;Imperial tier&lt;/strong&gt; ancestor chain. That's how you keep per-unit churn out of other units' context windows. Full rules: &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/framework/paperwork-standard.md" rel="noopener noreferrer"&gt;Paperwork Standard §4&lt;/a&gt;. You don't need Q3; the unit structure subsumes it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No (one team, one product, one decision-owner) → continue to Q3.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Q3 — How big is the codebase?
&lt;/h3&gt;

&lt;p&gt;Measure with &lt;a href="https://github.com/AlDanial/cloc" rel="noopener noreferrer"&gt;&lt;code&gt;cloc&lt;/code&gt;&lt;/a&gt; or &lt;a href="https://github.com/boyter/scc" rel="noopener noreferrer"&gt;&lt;code&gt;scc&lt;/code&gt;&lt;/a&gt; — logical lines, all languages. The bands borrow COCOMO 81's order-of-magnitude convention; treat them as a heuristic, not a derived cutoff.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;LOC&lt;/th&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Set up&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;lt; 10 KLOC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;M0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A &lt;code&gt;README.md&lt;/code&gt; is enough. &lt;strong&gt;Don't build the two-file setup yet.&lt;/strong&gt; Re-check when you cross ~10 KLOC or hire a second person/agent.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;10 – 50 KLOC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;M1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The two-file setup from &lt;a href="https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84"&gt;#2&lt;/a&gt; — a &lt;code&gt;HANDOFF.md&lt;/code&gt; snapshot + an append-only &lt;code&gt;LOG.md&lt;/code&gt; at repo root, plus four rules for who reads/writes what and when.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&amp;gt; 50 KLOC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;M2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Even with a single team. The cross-time complexity is enough that you want the unit-folder layout from Q2 — start with one unit folder; the structure is ready when a second appears.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Q4 — Crossing a line in the next 3–6 months?
&lt;/h3&gt;

&lt;p&gt;This doesn't change &lt;em&gt;today's&lt;/em&gt; tier — it tells you what to architect for. Plan the upgrade &lt;strong&gt;now&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;M0 → M1&lt;/strong&gt;: hiring a second contributor, adding a second agent, about to cross ~10 KLOC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;M1 → M2&lt;/strong&gt;: spinning up a second service, splitting the codebase into independently shipping pieces, adding a second decision-owner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;M1 or M2 → Runtime&lt;/strong&gt;: introducing a hard invariant (compliance, locks, real-time coordination), starting work that needs transactions, onboarding agents that will write in parallel.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Emergency upgrades cost more than planned ones. Catching the trigger early is the entire point of Q4.&lt;/p&gt;




&lt;h2&gt;
  
  
  The decision card (copy this into your repo)
&lt;/h2&gt;

&lt;p&gt;Drop this into your &lt;code&gt;CLAUDE.md&lt;/code&gt; / &lt;code&gt;.cursorrules&lt;/code&gt; / &lt;code&gt;README.md&lt;/code&gt; so the test is on hand the next time someone asks "do we need more structure here?":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Governance-tier self-check&lt;/span&gt;

Answer in order; the first YES decides the tier — later questions only narrow further.

Q1 — Real-time integrity needed (≥ 2 agents writing the same artifact at the same instant,
     a "must-never-break" invariant, or transactions over shared state)?
     YES → Runtime tier (a running service; markdown can't do this).

Q2 — More than one governance unit (≥ 2 services / sub-products / decision-owners,
     or multi-repo coordination)?
     YES → M2: per-unit folder with HANDOFF.md + LOG.md, plus a shared Imperial tier
            at the repo root.

Q3 — Codebase size (cloc / scc, logical lines, all languages)?
     &amp;lt; 10 KLOC  → M0: a README.md is enough.
     10–50 KLOC → M1: the two-file HANDOFF + LOG setup.
&lt;span class="gt"&gt;     &amp;gt; 50 KLOC  → M2: unit-folder layout even single-team.&lt;/span&gt;

Q4 — Will any of Q1/Q2/Q3 cross a line in the next 3–6 months? Plan the upgrade now.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;The full card, with upgrade triggers and per-tier folder layouts:&lt;/em&gt; &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/distribution/tier-decision-card.md" rel="noopener noreferrer"&gt;&lt;code&gt;tier-decision-card.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you actually get
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stop over-building.&lt;/strong&gt; Most solo-plus-agents projects are honestly M1 — the two files from #2. Knowing that &lt;em&gt;is&lt;/em&gt; the win; you don't add multi-unit paperwork "just in case."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop under-building.&lt;/strong&gt; When two agents start colliding, or a second service spins up, the card flags it before the collisions become incidents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A defensible answer to "should we add more structure?"&lt;/strong&gt; &lt;em&gt;"We ran the card; we're M1; the trigger to move is X."&lt;/em&gt; That's a sentence, not an argument.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honest cost: this is a heuristic, not a theorem. The LOC bands are borrowed COCOMO-81 conventions — useful as a starting point, &lt;strong&gt;calibrate to your context&lt;/strong&gt; (a 30-KLOC mobile app and a 30-KLOC research notebook do not have the same coordination need). Q1 is the one question with a hard wall behind it; Q2 and Q3 are judgment calls the card just makes explicit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this works (the 30-second aside)
&lt;/h2&gt;

&lt;p&gt;There is a real, provable ceiling under all of this. Coordinating actors who can't talk in real time — past sessions and current ones, agents in separate processes, services across a network — runs into the &lt;strong&gt;CAP theorem&lt;/strong&gt; (Gilbert &amp;amp; Lynch 2002): when parts of your system can't reach each other (a "partition"), you can have &lt;strong&gt;Consistency&lt;/strong&gt; or &lt;strong&gt;Availability&lt;/strong&gt;, but not both. Documents are by construction &lt;strong&gt;available + eventually consistent&lt;/strong&gt;: a fresh session reads what's on disk and works &lt;em&gt;now&lt;/em&gt;, it cannot block until the previous session "confirms," so it has already given up strong consistency. That is the wall behind Q1: paperwork &lt;strong&gt;cannot&lt;/strong&gt; promise "two writers will never disagree, even for a second" — not because you're doing it wrong, because the medium can't. A running service can, by paying the cost of being unavailable during a partition. Q1's answers are which side of that wall you're on. Full citations and the bounded claim: &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/framework/paperwork-standard.md" rel="noopener noreferrer"&gt;Paperwork Standard §1.2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The COCOMO-anchored size bands in Q3 are a borrowed &lt;em&gt;convention&lt;/em&gt;, not a derived cutoff — Boehm's 1981 modes predict effort, not documentation need. The framework's &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/framework/paperwork-standard.md" rel="noopener noreferrer"&gt;Paperwork Standard §2&lt;/a&gt; is explicit about that ("borrowed order-of-magnitude convention, owner-calibratable"); treat the numbers accordingly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The story below the setup &lt;em&gt;(optional — skip if you came for the card)&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;The card above is the entire useful product of this post. If you want the &lt;em&gt;why behind the why&lt;/em&gt; — the joint at which "documentation" stops being the right word — here it is.&lt;/p&gt;

&lt;h3&gt;
  
  
  The turn I didn't want to take
&lt;/h3&gt;

&lt;p&gt;Through late 2024 and into 2025 I kept treating my AI-agent problem as a &lt;strong&gt;documentation&lt;/strong&gt; problem. Write a better &lt;code&gt;HANDOFF.md&lt;/code&gt;. Tag candidates. Mark sealed decisions. The patterns from &lt;a href="https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84"&gt;#2&lt;/a&gt; worked, and the overhead kept climbing, and a voice in the back of my head kept saying: &lt;em&gt;you're carving this at the wrong joint.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So one evening I tried to state the problem in the most neutral words I could, with no mention of "documents":&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I have participants who &lt;strong&gt;start cold&lt;/strong&gt;, &lt;strong&gt;run briefly&lt;/strong&gt;, and &lt;strong&gt;cannot talk to each other in real time&lt;/strong&gt;. They have to act coherently anyway.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Read that back without the AI-agent context and tell me it doesn't sound familiar. It should. It's not a documentation problem. It's a &lt;strong&gt;coordination&lt;/strong&gt; problem — and a very specific, very old one.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the problem actually is
&lt;/h3&gt;

&lt;p&gt;Strip my "team" to the bones. It's a set of actors that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;reset to zero&lt;/strong&gt; — each session is a fresh process with no memory of the last;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;live for one task&lt;/strong&gt;, then disband;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;never overlap in a conversation&lt;/strong&gt; — by the time a session could "reply," it no longer exists, and the human is asleep or in three other meetings.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What makes coordination hard here is not intelligence and not prompting. It's that &lt;strong&gt;there is no real-time channel between the actors.&lt;/strong&gt; A message I leave can only be read &lt;em&gt;later&lt;/em&gt;, by someone who wasn't there when I wrote it. Coordination doesn't happen in a conversation; it happens &lt;em&gt;across time&lt;/em&gt;, through whatever durable thing survives between sessions.&lt;/p&gt;

&lt;p&gt;If that smells like distributed systems to you — congratulations, you got there faster than I did. Coordinating processes that fail, restart, and can't reliably talk in real time is the founding problem of that field. People have been proving theorems about it since the 1970s. I'd been re-deriving a worse version of it by hand, in markdown.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Imperium was the tell
&lt;/h3&gt;

&lt;p&gt;This is where the gothic paint on the project stops being a joke.&lt;/p&gt;

&lt;p&gt;The framework is named after &lt;em&gt;Warhammer 40,000&lt;/em&gt;, and the central image is the &lt;strong&gt;Astronomican&lt;/strong&gt; — a beacon of psychic light. In the fiction, humanity's empire spans a galaxy. Its ships travel through the &lt;strong&gt;warp&lt;/strong&gt;, a parallel dimension that does not carry real-time signals; a fleet that enters the warp is, for the duration, &lt;em&gt;unreachable&lt;/em&gt;. There is no live channel across that distance. So how do you run an empire whose parts cannot phone each other?&lt;/p&gt;

&lt;p&gt;The fiction's answer is uncomfortably close to the engineering one. The Imperium runs on three things: &lt;strong&gt;frozen edicts&lt;/strong&gt; — decisions made once and not up for renegotiation by whoever's nearest; a &lt;strong&gt;paperwork priesthood&lt;/strong&gt;, the &lt;strong&gt;Adeptus Administratum&lt;/strong&gt;, which is quite literally galactic records-keeping; and the &lt;strong&gt;Astronomican&lt;/strong&gt;, a beacon a ship lost in the dark steers by. Frozen authority. Durable records. A signal that survives.&lt;/p&gt;

&lt;p&gt;That is the whole design, in fancy dress. The darkness in this series' title is the warp between my sessions. The "document that survives" is the Astronomican. The names were never decoration — they're the closest myth I know to the actual shape of the problem: &lt;strong&gt;coordinating actors who can't talk live, who steer by whatever frozen light reaches them.&lt;/strong&gt; The card above is the engineering version. The lore is the easier-to-remember version.&lt;/p&gt;

&lt;h3&gt;
  
  
  The wall behind Q1
&lt;/h3&gt;

&lt;p&gt;The 30-second aside up top gave you the headline: CAP forces an Availability-or-Consistency choice during a partition, and documents have already chosen Availability — a fresh session reads what's on disk and gets to work &lt;em&gt;now&lt;/em&gt;, it cannot block until a previous session "confirms." So the best a pile of markdown can offer is &lt;strong&gt;eventual consistency&lt;/strong&gt;: everyone converges on the same picture &lt;em&gt;eventually&lt;/em&gt;, once they've all read the same writing — never instantly, never guaranteed at the moment you act.&lt;/p&gt;

&lt;p&gt;That ceiling is not about my competence or yours. No amount of better markdown buys you a guarantee that two sessions acting on the same artifact won't step on each other in the window before they sync. Documents detect and reconcile &lt;em&gt;after the fact&lt;/em&gt;; they cannot &lt;strong&gt;prevent&lt;/strong&gt; in the moment. (A sibling result, &lt;strong&gt;FLP&lt;/strong&gt; — Fischer, Lynch &amp;amp; Paterson, 1985 — says you can't even guarantee a group of async processes will &lt;em&gt;agree&lt;/em&gt; in bounded time. The framework's answer to that one is a design choice, not a theorem: route every binding decision through a human who acts as the single point that breaks the tie. More on that in a later post.)&lt;/p&gt;

&lt;p&gt;I want to be careful here, because it's easy to oversell a theorem. CAP is a lens that &lt;em&gt;fit&lt;/em&gt; my problem startlingly well; it is not something I proved about markdown files. The honest claim is narrow: &lt;strong&gt;a coordination layer with no real-time channel is, structurally, an available-but-eventually-consistent one, and that caps what it can promise.&lt;/strong&gt; That's the wall behind Q1. The interesting question is what you build once you stop pretending it isn't there — which is the card above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inherit, don't invent
&lt;/h3&gt;

&lt;p&gt;I didn't invent any of this. CAP, FLP, eventual consistency, the entire vocabulary of coordinating unreliable actors — it was all sitting in a field I'd been adjacent to for years and never properly raided. The next post is the raid: four older disciplines I borrowed from instead of inventing — &lt;em&gt;Mission Command&lt;/em&gt; (Auftragstaktik), &lt;strong&gt;CMMI&lt;/strong&gt;, &lt;em&gt;Delay-Tolerant Networking&lt;/em&gt;, and pre-telegraph imperial governance. Each one had already solved a piece of this. The honest verb is &lt;strong&gt;inherit&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And the standing caveat from #2 still holds and always will: this is one practitioner following one thread against essentially one serious case study. The theory is solid because it's borrowed; the &lt;em&gt;application&lt;/em&gt; of it is a smoke test, not evidence. If the CAP framing is a stretch, that's exactly the kind of thing I want pointed out — I had an independent pass try to tear these borrowed citations apart, and walking through that is what a later post is for.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;New here? I'm a developer who runs AI agents daily — a peer, not an authority; full framing in &lt;a href="https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8"&gt;#1&lt;/a&gt;. Standing caveat: one developer, essentially one case study — useful, not proven. Tell me where the card fails for you.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Light is the only thing that crosses the warp" is Warhammer-flavoured naming, nothing more. Independent practitioner exploration; no affiliation with Games Workshop. Repository MIT-licensed.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;code&gt;#DeadLightFramework #AIAgents #AIProductivity #SoftwareArchitecture #DistributedSystems #CAPTheorem #AIAgentGovernance #HumanAICollaboration #PromptEngineering #DevTools&lt;/code&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>productivity</category>
    </item>
    <item>
      <title>How I Shipped 2,500+ Commits With AI Agents Using a 12-Phase Workflow</title>
      <dc:creator>Lê Tú Hào</dc:creator>
      <pubDate>Mon, 25 May 2026 15:19:30 +0000</pubDate>
      <link>https://dev.to/letuhao/how-i-shipped-2500-commits-with-ai-agents-using-a-12-phase-workflow-4ap4</link>
      <guid>https://dev.to/letuhao/how-i-shipped-2500-commits-with-ai-agents-using-a-12-phase-workflow-4ap4</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwj6vodrg1975ehynx29t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwj6vodrg1975ehynx29t.png" alt="The image is a digital illustration depicting a stressed or exhausted developer working late at night in a dimly lit, high-tech workspace. The atmosphere is heavy with the classic " width="799" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 12-Phase Workflow That Actually Made AI Coding Useful for Me
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A practitioner's account — not a tutorial, not a sales pitch.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Quick screen:&lt;/strong&gt; if you're writing throwaway scripts or solo prototypes, this workflow is overkill — skip to the Cons and Who This Is For sections first.&lt;/p&gt;




&lt;p&gt;I've been using a 12-phase workflow I've refined over time — across &lt;a href="https://github.com/letuhao/free-context-hub" rel="noopener noreferrer"&gt;free-context-hub&lt;/a&gt;, &lt;a href="https://github.com/letuhao/lore-weave" rel="noopener noreferrer"&gt;lore-weave&lt;/a&gt;, and a handful of private internal systems. Both public projects are built almost entirely by AI agents, with me acting as the gatekeeper — approving specs, reviewing diffs, unblocking decisions. Across all of them, the workflow has accumulated 2,500+ commits and a trail of written specs and audit logs I can still query months after the sessions that produced them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;free-context-hub&lt;/strong&gt; is a self-hosted persistent memory and semantic search layer for AI agents — MCP server, REST API, RAG pipelines, and a full Next.js review UI. 15 development phases delivered end-to-end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;lore-weave&lt;/strong&gt; is a cloud-hosted multi-agent platform for multilingual novel workflows: translation, knowledge graph construction, glossary management, and AI-assisted writing. 19 microservices across Go, Python, and TypeScript.&lt;/p&gt;

&lt;p&gt;I'm sharing the workflow because it's worked better than anything else I've tried, and because the honest trade-offs are worth knowing before you adopt it.&lt;/p&gt;

&lt;p&gt;The files are in the repository:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/letuhao/free-context-hub/blob/main/agentic-workflow/WORKFLOW.md" rel="noopener noreferrer"&gt;&lt;code&gt;WORKFLOW.md&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — standalone 12-phase template to copy into any project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/letuhao/free-context-hub/blob/main/agentic-workflow/CLAUDE.md.snippet" rel="noopener noreferrer"&gt;&lt;code&gt;CLAUDE.md.snippet&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — the live project spec with project-specific tooling and AMAW wiring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;a href="https://github.com/letuhao/free-context-hub/blob/main/agentic-workflow/AMAW.md" rel="noopener noreferrer"&gt;&lt;code&gt;AMAW.md&lt;/code&gt;&lt;/a&gt;&lt;/strong&gt; — opt-in multi-agent extension spec&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Core Problem This Solves
&lt;/h2&gt;

&lt;p&gt;AI coding assistants are very good at generating plausible-looking code. They're much worse at:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Knowing when they're operating on stale assumptions&lt;/li&gt;
&lt;li&gt;Catching their own scope creep&lt;/li&gt;
&lt;li&gt;Connecting a code change to its downstream contract obligations&lt;/li&gt;
&lt;li&gt;Stopping themselves when a "small fix" turns into a refactor&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The standard advice is "just review the diff." But reviewing a diff without having tracked the &lt;em&gt;intent&lt;/em&gt; of the change is almost useless — you're comparing code to code, not code to requirements. The 12-phase workflow forces intent to be written down before the first line of code is written, which is what makes the diff review actually meaningful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where It Came From
&lt;/h2&gt;

&lt;p&gt;The workflow is an evolution of two ideas:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;Superpowers&lt;/a&gt;&lt;/strong&gt; — a coding agent discipline framework that introduced TDD protocol, the evidence gate (run verification fresh before claiming success), and the debugging protocol (no fix without root cause). I absorbed these directly. If you haven't read Superpowers, it's worth your time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human-in-the-loop gatekeeping&lt;/strong&gt; — my own addition. The core insight: a human reading a short spec + a single diff catches dramatically more than a human reading code cold. The workflow structures every task to produce exactly those artifacts, at exactly the right moment.&lt;/p&gt;

&lt;p&gt;The combination took multiple iterations to stabilize. What's here is v2.2 (default mode) with an optional AMAW (Autonomous Multi-Agent Workflow) extension for high-stakes work.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 12 Phases
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase          │ Role (default v2.2)   │ What Happens
───────────────┼───────────────────────┼──────────────────────────────────────────
1. CLARIFY     │ Architect + Human     │ Read context, write spec, expose assumptions
2. DESIGN      │ Lead                  │ API contract / data flow → DESIGN.md
3. REVIEW      │ Adversarial self      │ Find gaps / contract holes in spec
4. PLAN        │ Lead + Developer      │ Decompose into 2–5 min tasks → PLAN.md
5. BUILD       │ Developer             │ TDD: red → green → refactor
6. VERIFY      │ Developer             │ Run tests fresh, capture exit code + output
7. REVIEW      │ Lead                  │ Code vs spec — find exactly 3 divergences
8. QC          │ Main session          │ Spec fingerprint vs implementation, AC coverage
9. POST-REVIEW │ Human checkpoint      │ Final gate — blocked on any unresolved issue
10. SESSION    │ Scribe                │ SESSION_PATCH.md + DEFERRED.md + AUDIT_LOG
11. COMMIT     │ Developer             │ Git commit
12. RETRO      │ All                   │ Record lessons + finalize audit log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The phases look heavy on paper. In practice, for an XS task (single file, one logic change, no side effects) you're allowed to skip CLARIFY and PLAN and go straight to BUILD — the workflow is explicit about this via a mandatory &lt;strong&gt;task size classification&lt;/strong&gt; step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Task Size Classification: The Thing That Actually Prevents Drift
&lt;/h2&gt;

&lt;p&gt;Before any work starts, you count three things:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What you count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Files touched&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How many files will be created or modified?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Logic changes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;How many functions/handlers change &lt;em&gt;behavior&lt;/em&gt;? (not formatting)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Side effects&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API contract, DB schema, config, external behavior, types used by other files?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Logic&lt;/th&gt;
&lt;th&gt;Side effects&lt;/th&gt;
&lt;th&gt;Allowed skips&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;0–1&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;CLARIFY + PLAN&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1–2&lt;/td&gt;
&lt;td&gt;2–3&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;PLAN only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;M&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3–5&lt;/td&gt;
&lt;td&gt;4+&lt;/td&gt;
&lt;td&gt;Maybe&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6+&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;XL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10+&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;You state the classification explicitly before work begins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task: Fix pagination off-by-one
Size: XS (1 file: src/api/routes/lessons.ts, 1 logic change: offset calc, 0 side effects)
Skipping: CLARIFY, PLAN → straight to BUILD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hard rule: &lt;strong&gt;if you haven't read the code yet, you don't know the size.&lt;/strong&gt; Agents routinely call things XS that turn out to be M or L once you look. The classification forces the read to happen before the label is applied.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Anti-Skip Rules (The Most Underrated Part)
&lt;/h2&gt;

&lt;p&gt;Every popular AI workflow has phases that agents skip "to save time." This workflow makes the skip patterns explicit and calls them violations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skip pattern&lt;/th&gt;
&lt;th&gt;Why agents do it&lt;/th&gt;
&lt;th&gt;Why it's forbidden&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Skip CLARIFY, jump to BUILD&lt;/td&gt;
&lt;td&gt;"Task seems obvious"&lt;/td&gt;
&lt;td&gt;Unexamined assumptions cause rework&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skip PLAN, jump to BUILD&lt;/td&gt;
&lt;td&gt;"It's a small change"&lt;/td&gt;
&lt;td&gt;Small changes grow; no plan = no checkpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skip VERIFY after BUILD&lt;/td&gt;
&lt;td&gt;"Tests passed earlier"&lt;/td&gt;
&lt;td&gt;Stale results are not evidence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skip REVIEW after VERIFY&lt;/td&gt;
&lt;td&gt;"I wrote it, I know it's correct"&lt;/td&gt;
&lt;td&gt;Author blindness is real&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skip POST-REVIEW&lt;/td&gt;
&lt;td&gt;"I reviewed in phase 7"&lt;/td&gt;
&lt;td&gt;Phase 7 is code review; POST-REVIEW is the final conservative gate — different scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skip SESSION before COMMIT&lt;/td&gt;
&lt;td&gt;"I'll update later"&lt;/td&gt;
&lt;td&gt;You won't. Context is lost.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Combine multiple phases&lt;/td&gt;
&lt;td&gt;"CLARIFY+DESIGN+PLAN in one go"&lt;/td&gt;
&lt;td&gt;Each phase boundary is a deliberate pause point; skipping it removes the checkpoint&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Naming these patterns and treating them as violations changes the conversation. When the agent tries to jump phases, you have a handle to point at.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Evidence Gate (Absorbed from Superpowers)
&lt;/h2&gt;

&lt;p&gt;Phase 6 (VERIFY) has a 5-step gate that runs before any completion claim:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Identify&lt;/strong&gt; the verification command&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run&lt;/strong&gt; it fresh — not from memory, not from cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Read&lt;/strong&gt; complete output including exit codes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confirm&lt;/strong&gt; output matches the claim&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only then&lt;/strong&gt; state the result with evidence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Red flags — stop immediately if you catch yourself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using "should work", "probably passes", "seems fine"&lt;/li&gt;
&lt;li&gt;Feeling satisfied before running verification&lt;/li&gt;
&lt;li&gt;About to commit without a fresh test run&lt;/li&gt;
&lt;li&gt;Trusting prior output without re-running&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This sounds obvious. It is not obvious when you're deep in a session and the previous test run was 20 minutes ago.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Human's Role: Gatekeeper, Not Reviewer
&lt;/h2&gt;

&lt;p&gt;In v2.2 (default mode), there are two mandatory human checkpoints:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;After CLARIFY&lt;/strong&gt; — human reads the spec and approves the scope before any design or code starts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After POST-REVIEW&lt;/strong&gt; — human reviews the AUDIT_LOG, the spec, and the diff before SESSION commits anything&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are not optional. The whole model is that the human reads a short spec, not a long codebase. The AI builds the spec; the human approves it; the AI builds the code against the approved spec. The POST-REVIEW diff is then code-vs-approved-spec, which is a comparison a human can actually do.&lt;/p&gt;




&lt;h2&gt;
  
  
  AMAW: The Opt-In Multi-Agent Extension
&lt;/h2&gt;

&lt;p&gt;For high-stakes work — data migrations, new service boundaries, security-critical paths — there's an optional extension: &lt;strong&gt;AMAW (Autonomous Multi-Agent Workflow)&lt;/strong&gt;. In AMAW mode, cold-start sub-agents replace or augment the human review gates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adversary&lt;/strong&gt; — finds exactly 3 things that could go wrong. &lt;em&gt;Why 3? Enough to surface real issues, few enough to force prioritization rather than a laundry list.&lt;/em&gt; Never says what's good.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope Guard&lt;/strong&gt; — compares spec fingerprint vs implementation, checks AC coverage, issues CLEAR or BLOCKED&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scribe&lt;/strong&gt; — records decisions, writes session summaries, detects deferred items&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit Logger&lt;/strong&gt; — finalizes the audit trail at RETRO&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight is &lt;strong&gt;cold-start&lt;/strong&gt;: each agent is spawned fresh with only file access. It cannot inherit the main session's context rot or biases. It reads what's written; it can't be influenced by what was discussed in chat.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; AMAW removes the human from all review gates — including POST-REVIEW, which is held by the Scope Guard instead. At CLARIFY, rather than a human approving the spec, the Adversary challenges it at the next phase. In practice this means AMAW sessions can run with minimal human interaction, but they still require a human to kick off the task and review the final audit log. Pure fire-and-forget is not the design intent.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AMAW costs roughly $1–5 in sub-agent tokens and ~30 extra minutes per task. I use it for schema migrations and multi-system contracts. For everyday work, the human-in-loop default catches the same issues faster and cheaper.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Gets Recorded: The Audit Log
&lt;/h2&gt;

&lt;p&gt;Every phase transition and agent verdict appends to &lt;code&gt;docs/audit/AUDIT_LOG.jsonl&lt;/code&gt; — one JSON line per event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"2026-05-15T17:42:00Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"phase-14-model-swap"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"phase"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"review-design"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"adversary"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"review"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"REJECTED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"findings_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"block_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"warn_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"note"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Append-only. Never modified. Main session and sub-agents both write to it, never delete or edit existing lines.&lt;/p&gt;

&lt;p&gt;This becomes the durable record of what was decided and why — something that doesn't exist in most AI coding setups where everything lives in ephemeral chat.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I've Shipped With This
&lt;/h2&gt;

&lt;h3&gt;
  
  
  free-context-hub
&lt;/h3&gt;

&lt;p&gt;On &lt;a href="https://github.com/letuhao/free-context-hub" rel="noopener noreferrer"&gt;free-context-hub&lt;/a&gt; I've delivered 15 development phases covering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core backend: MCP server (36 tools), REST API (70+ endpoints), background worker&lt;/li&gt;
&lt;li&gt;Frontend: Next.js 16 + React 19, 20+ pages, human-in-loop review UI&lt;/li&gt;
&lt;li&gt;RAG pipeline: tiered search (ripgrep → FTS → semantic), 8-model embedding benchmark, reranking benchmarks with reproducible reports&lt;/li&gt;
&lt;li&gt;Multi-agent coordination: artifact leases with TTL/fencing, pending-review state, taxonomy profiles&lt;/li&gt;
&lt;li&gt;Knowledge portability: zip+JSONL bundle format, streaming import/export, cross-instance pull with SSRF hardening&lt;/li&gt;
&lt;li&gt;Tenant-scoped access control: authz model, 3-tier routing, event log, collective decisions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LoreWeave
&lt;/h3&gt;

&lt;p&gt;On &lt;a href="https://github.com/letuhao/lore-weave" rel="noopener noreferrer"&gt;lore-weave&lt;/a&gt; I've delivered 5 full vertical modules and am mid-way through a sixth, accumulating 1,497 commits since March 2026 across 19 microservices. The modules completed so far cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Identity &amp;amp; Auth&lt;/strong&gt; — JWT issuance, refresh rotation, multi-device session management (Go/Chi + NestJS gateway)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Books &amp;amp; Sharing&lt;/strong&gt; — book and chapter lifecycle, visibility policy, public catalog browse (Go/Chi, Postgres, MinIO)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider Registry&lt;/strong&gt; — BYOK AI provider credential vault, platform model catalog, streaming proxy, budget pre-flight (Go/Chi + worker-ai)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Raw Translation Pipeline&lt;/strong&gt; — async chunk-level translation job lifecycle, job queue via Redis Streams, per-chapter result storage, BYOK + platform model routing (Go/Chi + Python/FastAPI + worker-infra)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Glossary &amp;amp; Lore Management&lt;/strong&gt; — multilingual entity management, chapter M:N evidence linking, wiki article generation, RAG-ready glossary export (Go/Chi, Postgres, glossary-service + knowledge-service two-layer pattern)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current Phase 6 work spans usage-billing and a hierarchical book extraction engine — the kind of multi-service, cross-cutting work where the workflow's cross-phase checkpoints earn their keep.&lt;/p&gt;

&lt;p&gt;That's 400+ commits on free-context-hub and 1,497 on lore-weave — the rest comes from private team projects also running this workflow — totaling 2,500+ commits with a live audit trail I can query across sessions that ran months apart.&lt;/p&gt;

&lt;p&gt;The hardest part was Phase 10 (SESSION) — keeping the session patch updated after every sprint without skipping it. Once that became a habit, sessions started to feel continuous rather than amnesia-punctuated.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Pros
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You understand your own system deeply.&lt;/strong&gt; Because you write the spec and approve it, you can't hide behind "the AI built it." You actually know what was built and why the trade-offs were made. This is the biggest practical advantage for me — not velocity, but comprehension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architectural decisions have a paper trail.&lt;/strong&gt; Every trade-off is in a spec file that was approved before code was written. When a future session revisits a design choice, the rationale is readable, not reconstructed from diff archaeology.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context drift is visible.&lt;/strong&gt; When an AI starts building something that wasn't in the spec, the spec fingerprint comparison at POST-REVIEW catches it. Without a written spec, you'd never notice until integration time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deferred items don't get lost.&lt;/strong&gt; The workflow forces any "we'll do this later" to be written in &lt;code&gt;DEFERRED.md&lt;/code&gt; with a specific trigger condition. Nothing lives only in chat — chat is ephemeral, files are truth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's incrementally adoptable.&lt;/strong&gt; You can start with just CLARIFY + VERIFY and get substantial value. Add phases as your trust in the workflow grows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Cons
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Token usage is genuinely high.&lt;/strong&gt; Each phase generates artifacts: spec files, plan files, audit events. AMAW mode multiplies this by spawning sub-agents. A single M-sized task with AMAW can burn 5,000–10,000 tokens before a line of code is written. At scale, this is a real budget consideration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You clarify constantly — and it takes real time.&lt;/strong&gt; Phase 1 (CLARIFY) is not a quick preamble. For any task with real ambiguity — architecture decisions, new API contracts, trade-off calls — you're in a back-and-forth that can run 20–40 minutes before design starts. At a medium-sized project cadence (10–20 above-XS tasks per sprint), this adds up to multiple hours per sprint spent purely on scoping. This is actually the point of the workflow, but if you're used to "just build it," the overhead feels significant early on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Human approval gates limit automation.&lt;/strong&gt; Every architecture decision, trade-off, and scope call requires your explicit approval. You cannot queue up a batch of tasks and walk away. If you need fully autonomous overnight runs, this workflow is the wrong tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The discipline needs enforcement tooling to hold.&lt;/strong&gt; Left to their own devices, agents will skip phases. The workflow holds together because of &lt;code&gt;workflow-gate.sh&lt;/code&gt; (a pre-commit gate that blocks commits if VERIFY and SESSION aren't done) and the append-only &lt;code&gt;AUDIT_LOG.jsonl&lt;/code&gt;. If you copy &lt;code&gt;docs/WORKFLOW.md&lt;/code&gt; into your project without also setting up the enforcement layer, expect phases to get skipped within a few sessions. The tooling is in the repository — it's not hidden — but it's a real setup step, not just copy-paste.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold-start sub-agents (AMAW only) miss things said in chat.&lt;/strong&gt; Because each AMAW sub-agent reads files from scratch, anything that was decided verbally in the session but never written to a file is invisible to them. This is a feature for preventing bias, but it means you must be disciplined about writing things down as you go. The Scribe sub-agent helps, but it can only record what's already in files.&lt;/p&gt;




&lt;h2&gt;
  
  
  Who This Is For
&lt;/h2&gt;

&lt;p&gt;Worth the overhead if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're building production systems — not prototypes — that will be maintained and extended&lt;/li&gt;
&lt;li&gt;You care about knowing &lt;em&gt;why&lt;/em&gt; each decision was made, not just that it compiles today&lt;/li&gt;
&lt;li&gt;You find yourself surprised by what the AI built, in ways that cost you rework later&lt;/li&gt;
&lt;li&gt;Sessions run over weeks or months and you need continuity across context windows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Overkill if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're doing exploratory coding, one-shot scripts, or time-boxed experiments&lt;/li&gt;
&lt;li&gt;Your sessions are short and the full context fits in one window&lt;/li&gt;
&lt;li&gt;You don't need an audit trail or human-approved architectural decisions&lt;/li&gt;
&lt;li&gt;Speed of iteration matters more than correctness of decision-making&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workflow is designed for the first category. Using it for the second is just friction.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to Use It
&lt;/h2&gt;

&lt;p&gt;All workflow files live in the &lt;a href="https://github.com/letuhao/free-context-hub/tree/main/agentic-workflow" rel="noopener noreferrer"&gt;&lt;code&gt;agentic-workflow/&lt;/code&gt;&lt;/a&gt; folder of the free-context-hub repository.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with the template:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Copy &lt;code&gt;WORKFLOW.md&lt;/code&gt; into your project root or paste the relevant sections into your &lt;code&gt;CLAUDE.md&lt;/code&gt; / agent instructions — this is the full 12-phase spec&lt;/li&gt;
&lt;li&gt;Customize the &lt;code&gt;[CUSTOMIZE]&lt;/code&gt; sections for your stack (verification commands, test runner, any MCP tools you use — MCP is the Model Context Protocol, an interface for giving AI agents access to external tools and knowledge stores; the workflow works without it)&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;workflow-gate.sh&lt;/code&gt; from the same folder to enforce the phase gates mechanically — without this, agents will skip phases&lt;/li&gt;
&lt;li&gt;For high-stakes tasks, see &lt;code&gt;amaw-workflow.md&lt;/code&gt; for the AMAW multi-agent extension&lt;/li&gt;
&lt;li&gt;Start with just &lt;strong&gt;task size classification + VERIFY&lt;/strong&gt; — those two alone change how you work with agents&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The workflow is model-agnostic. I use it with Claude Code but nothing in the spec requires it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The 12-phase workflow is not magic. It's a way of making explicit things that were always implicit: what are we building, how big is it, what's the verification evidence, who approved it, what did we learn? The AI does most of the work. The human stays in control of the decisions that actually matter.&lt;/p&gt;

&lt;p&gt;The cost is real — more tokens, more time spent clarifying, more things requiring your approval before the AI proceeds. The benefit is also real: you end up with a system you understand deeply, and a trail of why it was built the way it was.&lt;/p&gt;

&lt;p&gt;For me, after 2,500+ commits across multiple projects, that trade-off is still worth it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Repositories: &lt;a href="https://github.com/letuhao/free-context-hub" rel="noopener noreferrer"&gt;letuhao/free-context-hub&lt;/a&gt; · &lt;a href="https://github.com/letuhao/lore-weave" rel="noopener noreferrer"&gt;letuhao/lore-weave&lt;/a&gt;&lt;/em&gt;&lt;br&gt;
&lt;em&gt;Workflow files: &lt;a href="https://github.com/letuhao/free-context-hub/blob/main/agentic-workflow/WORKFLOW.md" rel="noopener noreferrer"&gt;&lt;code&gt;WORKFLOW.md&lt;/code&gt;&lt;/a&gt; · &lt;a href="https://github.com/letuhao/free-context-hub/blob/main/agentic-workflow/AMAW.md" rel="noopener noreferrer"&gt;&lt;code&gt;AMAW.md&lt;/code&gt;&lt;/a&gt; · &lt;a href="https://github.com/letuhao/free-context-hub/blob/main/agentic-workflow/CLAUDE.md.snippet" rel="noopener noreferrer"&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>agents</category>
      <category>development</category>
    </item>
    <item>
      <title>Dead Light Framework · Part 2 — a copy-paste setup so your AI agents stop losing context between sessions</title>
      <dc:creator>Lê Tú Hào</dc:creator>
      <pubDate>Fri, 22 May 2026 12:28:40 +0000</pubDate>
      <link>https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84</link>
      <guid>https://dev.to/letuhao/dead-light-framework-part-2-a-copy-paste-setup-so-your-ai-agents-stop-losing-context-between-4n84</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr47e5l9o0xqxgjduvwrb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr47e5l9o0xqxgjduvwrb.png" alt="Stop your AI agents from losing context and silently reverting past decisions. A 10-minute, two-file setup (HANDOFF + LOG) you can copy today." width="799" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every Session Starts in Darkness. Your Documents Shouldn't. — A Copy-Paste Setup So AI Agents Stop Losing Context Between Sessions (Dead Light Framework, Part 2)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Two files, four rules, ten minutes. Skip the theory; the templates are below and you can paste them into a repo right now.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Dead Light Framework — Part 2 of an ongoing series.&lt;/strong&gt; Series so far: &lt;a href="https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8"&gt;1 · The Emperor Is All But Dead&lt;/a&gt; · &lt;strong&gt;2 · Every Session Starts in Darkness&lt;/strong&gt; · &lt;em&gt;next: when two files aren't enough — the paperwork-vs-runtime decision&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;By a developer running AI agents as daily teammates — a peer, not an authority (&lt;a href="https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8"&gt;full framing in #1&lt;/a&gt;). · &lt;strong&gt;~7 min&lt;/strong&gt; · &lt;a href="https://github.com/letuhao/dead-light-framework" rel="noopener noreferrer"&gt;the Dead Light Framework repository (MIT)&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The tax you're paying (and want gone)
&lt;/h2&gt;

&lt;p&gt;If you hand real work to AI agents, you pay this every day: each new session &lt;strong&gt;starts from zero&lt;/strong&gt;. You re-explain the project, what you decided last time, what's in flight, which files matter. Fifteen, twenty minutes of re-priming a human teammate would never need — and worse, the agent cheerfully re-litigates Monday's decision on Wednesday because nothing told it the decision was settled.&lt;/p&gt;

&lt;p&gt;My least favourite version of it: I once left a comment explaining &lt;em&gt;why&lt;/em&gt; an ugly branch of code had to stay. Two days later a fresh session, sent in to tidy up TODOs, read the comment &lt;em&gt;as&lt;/em&gt; a TODO and deleted the branch by morning. The reasoning died with the session that wrote it. That's the tax — and it compounds.&lt;/p&gt;

&lt;p&gt;It isn't a model problem. Each session is stateless by design; the last session's reasoning is gone unless something on disk carries it. So put it on disk — deliberately, in a shape the next session can consume in one read. Here's the smallest setup that does it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The setup: two files, four rules (≈10 min)
&lt;/h2&gt;

&lt;p&gt;Drop two files at your repo root. That's the whole mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;HANDOFF.md&lt;/code&gt; — the snapshot a fresh session reads first
&lt;/h3&gt;

&lt;p&gt;Your project's &lt;strong&gt;current state on one screen&lt;/strong&gt;: what's true now, what's mid-task, what's decided, what to do next. It's the &lt;em&gt;first&lt;/em&gt; thing an agent reads each session — the thing that replaces fifteen minutes of you re-explaining. &lt;strong&gt;Rewrite it freely;&lt;/strong&gt; it always describes "now" (running history lives in &lt;code&gt;LOG.md&lt;/code&gt;, below). Think of it as the project's working memory, externalised so a memoryless teammate can borrow it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;doc_kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;state&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;working&lt;/span&gt;          &lt;span class="c1"&gt;# draft | working | sealed&lt;/span&gt;
&lt;span class="na"&gt;updated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2026-05-22&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="gh"&gt;# HANDOFF — &amp;lt;project&amp;gt;&lt;/span&gt;

&lt;span class="gu"&gt;## Now            # what is true today&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Frontend v2 rename is done; auth is on the new schema.

&lt;span class="gu"&gt;## In flight      # mid-task work + who owns it&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Migrating &lt;span class="sb"&gt;`users`&lt;/span&gt; table — session-12, half done; next step is the backfill.

&lt;span class="gu"&gt;## Decided        # do NOT re-litigate these&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Auth must not import billing. Why: layering; billing changes shouldn't ripple into auth.

&lt;span class="gu"&gt;## Start here next&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Run the &lt;span class="sb"&gt;`users`&lt;/span&gt; backfill, then delete the legacy column.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Copy the full, commented template:&lt;/em&gt; &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/distribution/templates/handoff-template.md" rel="noopener noreferrer"&gt;&lt;code&gt;handoff-template.md&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;LOG.md&lt;/code&gt; — the append-only history
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;HANDOFF.md&lt;/code&gt; is "now", &lt;code&gt;LOG.md&lt;/code&gt; is "everything that happened" — one line per event, &lt;strong&gt;append-only; you never edit a past line&lt;/strong&gt; (a correction is a new line). Why keep it when the snapshot already shows the current state? Because the snapshot &lt;em&gt;overwrites itself&lt;/em&gt;: the moment you need to know &lt;em&gt;why&lt;/em&gt; something was decided, replay how you got here, or recover after a session left a mess, you need the history the snapshot threw away. The snapshot is derived &lt;em&gt;from&lt;/em&gt; this log — not the other way round.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;doc_kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;log&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="gh"&gt;# LOG — &amp;lt;project&amp;gt;   (append-only; a correction is a NEW line, never an edit)&lt;/span&gt;
&lt;span class="p"&gt;
-&lt;/span&gt; 2026-05-22 · session-12 · decided  · auth must not import billing (layering)        &lt;span class="c"&gt;&amp;lt;!-- sealed --&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; 2026-05-22 · session-12 · created  · users-table migration draft                     [candidate]
&lt;span class="p"&gt;-&lt;/span&gt; 2026-05-22 · session-12 · note     · backfill must run before dropping legacy column
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Copy the full, commented template:&lt;/em&gt; &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/distribution/templates/log-template.md" rel="noopener noreferrer"&gt;&lt;code&gt;log-template.md&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The four rules
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Two kinds, never mixed.&lt;/strong&gt; &lt;code&gt;HANDOFF.md&lt;/code&gt; is &lt;em&gt;current state&lt;/em&gt; — overwrite it freely. &lt;code&gt;LOG.md&lt;/code&gt; is &lt;em&gt;history&lt;/em&gt; — &lt;strong&gt;append only; never edit a past line&lt;/strong&gt; (a correction is a new line). This one split is what makes the whole thing trustworthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;First thing every session:&lt;/strong&gt; read &lt;code&gt;HANDOFF.md&lt;/code&gt;, then the new lines in &lt;code&gt;LOG.md&lt;/code&gt; since you last looked. That's your re-prime — under a minute, no human needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Last thing every session:&lt;/strong&gt; append what you did to &lt;code&gt;LOG.md&lt;/code&gt;, then update &lt;code&gt;HANDOFF.md&lt;/code&gt; to match. (An agent can do both as part of "wrap up.")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag what isn't settled.&lt;/strong&gt; &lt;code&gt;[candidate]&lt;/code&gt; = produced by an agent, not human-confirmed. &lt;code&gt;&amp;lt;!-- sealed --&amp;gt;&lt;/code&gt; = a decision that must not be "cleaned up" away. Agents read these.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No tool to install, no service to run — git plus two markdown files. Want it as one copy-paste page (both templates + the rules + the agent instruction)? &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/distribution/agent-context-quickstart.md" rel="noopener noreferrer"&gt;The Agent Context Quickstart&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tell your agent once&lt;/strong&gt; (system prompt / &lt;code&gt;CLAUDE.md&lt;/code&gt; / &lt;code&gt;.cursorrules&lt;/code&gt;): &lt;em&gt;"At the start of every session read HANDOFF.md and the recent LOG.md lines before doing anything. At the end, append your actions to LOG.md and update HANDOFF.md. Never edit past LOG lines; never touch a &lt;code&gt;&amp;lt;!-- sealed --&amp;gt;&lt;/code&gt; decision without asking."&lt;/em&gt; Now the discipline is the agent's job, not yours.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What you actually get
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Re-prime drops from ~15 min to ~1 min.&lt;/strong&gt; The agent reads two files and is current — you stop being a human context-cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decisions stop silently reverting.&lt;/strong&gt; A &lt;code&gt;sealed&lt;/code&gt; line in &lt;code&gt;Decided&lt;/code&gt; is a wall the next session sees; the Wednesday-undoes-Monday failure mostly stops.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You can stop mid-task and resume clean.&lt;/strong&gt; &lt;code&gt;In flight&lt;/code&gt; + the LOG tail tell the next session exactly where to pick up — even a &lt;em&gt;different&lt;/em&gt; agent, even weeks later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honest cost: ~2 minutes of discipline per session (append + update), and it pays off only once you're past a handful of sessions or running more than one agent. Below that, a plain README is fine — don't over-build.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why it works (the 30-second version)
&lt;/h2&gt;

&lt;p&gt;Documentation is your team's shared memory. When some teammates wipe their memory every session, the documents have to &lt;em&gt;carry state&lt;/em&gt; — and the reliable way to carry state across actors that can't sync live is exactly this: one &lt;strong&gt;append-only history&lt;/strong&gt; plus a &lt;strong&gt;derived current-state&lt;/strong&gt; view. That's the eventually-consistent coordination pattern distributed systems have used for decades; I just borrowed it. The full standard — including the multi-repo and multi-agent versions, and the failure modes — is &lt;a href="https://github.com/letuhao/dead-light-framework/blob/main/framework/paperwork-standard.md" rel="noopener noreferrer"&gt;&lt;code&gt;framework/paperwork-standard.md&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This covers one repo and one session at a time.&lt;/strong&gt; The moment you have &lt;em&gt;two agents writing at the same instant&lt;/em&gt;, or an invariant that must never break even for a second, two markdown files can't promise it — and that's a real, provable limit, not a gap you patch with better notes. Knowing which side of that line you're on is the next post.&lt;/p&gt;




&lt;h2&gt;
  
  
  The story below the setup &lt;em&gt;(optional — skip if you came for the templates)&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;You can stop here with a working setup. If you want the &lt;em&gt;why behind the why&lt;/em&gt;, here it is.&lt;/p&gt;

&lt;p&gt;Back in late 2024 / early 2025, when I first started handing agents real work — &lt;em&gt;audit this service&lt;/em&gt;, &lt;em&gt;draft this migration&lt;/em&gt;, &lt;em&gt;pick up where the last session left off&lt;/em&gt; — this was a dumb, recurring tax. Every new session opened with me re-explaining the same context, and by the third I was burning fifteen or twenty minutes re-establishing state a human teammate would simply have &lt;em&gt;had&lt;/em&gt;. So I wrote a better &lt;code&gt;HANDOFF.md&lt;/code&gt;. Then a better one. The overhead kept climbing, and a voice in the back of my head kept saying: &lt;em&gt;you're carving this at the wrong joint.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;So I made the mistake of &lt;em&gt;following&lt;/em&gt; the problem — and it turned out not to be the problem I thought it was. Strip the word "documentation" and it's stark: I had actors that &lt;strong&gt;start cold, run briefly, and can't talk to each other in real time&lt;/strong&gt;, and they had to act coherently anyway. That's not a docs question — it's &lt;em&gt;distributed systems&lt;/em&gt;, a field that's been proving theorems about exactly this since the 1970s. I'd been hand-rolling a worse version of it in markdown without noticing.&lt;/p&gt;

&lt;p&gt;That's also why this framework wears &lt;em&gt;Warhammer 40,000&lt;/em&gt; names, in case the "darkness" felt like an affectation. The Imperium of Man runs a galaxy with &lt;strong&gt;no real-time communication&lt;/strong&gt; — its ships cross the &lt;em&gt;warp&lt;/em&gt;, where they're simply unreachable. So it governs on three things: &lt;strong&gt;frozen edicts&lt;/strong&gt; (decided once, not renegotiable by whoever's nearest), the &lt;strong&gt;Adeptus Administratum&lt;/strong&gt; (literally galactic paperwork), and the &lt;strong&gt;Astronomican&lt;/strong&gt; — a beacon of light a ship lost in the dark steers by. Strip the gothic paint and that's the entire engineering of this post: frozen authority, durable records, and a signal that survives. The darkness in the title is the warp between your sessions; your two files are the Astronomican.&lt;/p&gt;

&lt;p&gt;And there's a catch I'll be honest about, because it shapes the whole series: that "real, provable limit" two paragraphs up isn't hand-waving — coordinating actors with no live channel runs into a genuine theorem (CAP), and it caps what &lt;em&gt;any&lt;/em&gt; pile of documents can promise. So after I'd borrowed all this and wired it together, I spent more effort trying to &lt;strong&gt;break&lt;/strong&gt; it than to build it — cold, hostile reviewers; an independent pass over every borrowed citation; benchmarks designed to make it fail. Some of it failed. That story is the rest of the series — but your setup above doesn't wait on any of it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;New here? I'm a developer who runs AI agents daily — a peer, not an authority; full framing in &lt;a href="https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8"&gt;#1&lt;/a&gt;. Standing caveat: one developer, essentially one case study — useful, not proven. Tell me where it breaks for you.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Every session starts in darkness" is Warhammer-flavoured naming, nothing more. Independent practitioner exploration; no affiliation with Games Workshop. Repository MIT-licensed.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  DeadLightFramework #AIAgents #AIProductivity #Documentation #ContextContinuity #AIAgentGovernance #HumanAICollaboration #PromptEngineering #DevTools
&lt;/h1&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Dead Light Framework: An Experimental Framework for Human-AI Collaboration #Post 1</title>
      <dc:creator>Lê Tú Hào</dc:creator>
      <pubDate>Tue, 12 May 2026 04:39:23 +0000</pubDate>
      <link>https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8</link>
      <guid>https://dev.to/letuhao/dead-light-framework-an-experimental-framework-for-human-ai-collaboration-post-1-5bh8</guid>
      <description>&lt;h2&gt;
  
  
  The Emperor Is All But Dead. The Light Remains.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  An experimental governance framework for software teams of humans and AI agents — and a request to be argued with
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Status: experimental. Unverified in the field. Looking for sparring partners more than followers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;By:&lt;/strong&gt; a developer with ~10 years across many projects, not an academic or industry authority — full bio at the bottom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Published:&lt;/strong&gt; 2026-05-11 · &lt;strong&gt;~8 min read&lt;/strong&gt; ·&lt;br&gt;
&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/letuhao/dead-light-framework" rel="noopener noreferrer"&gt;github.com/letuhao/dead-light-framework&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I have been building software with AI agents long enough to see the same governance failure mode appear over and over: agents and humans contradicting Monday's decisions on Wednesday, layers leaking into each other, no anchor to navigate by. I am testing the hypothesis that &lt;strong&gt;human + AI software projects need a frozen source of authority that no participant — including the author — can rewrite at will.&lt;/strong&gt; This post is the opening of an open debate; sharper arguments against it would help me more than agreement.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;One question I most want to be wrong about:&lt;/strong&gt; Is "frozen authority" actually compatible with "evolutionary architecture"? I think yes — argue with me.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The pain I keep running into
&lt;/h2&gt;

&lt;p&gt;I have been building software with AI agents long enough — daily, across multiple projects — to recognize a pattern that does not look like a bug.&lt;/p&gt;

&lt;p&gt;On Monday, an agent and I agree that the auth layer should not know about billing. On Wednesday, a different session of the same agent cheerfully imports a billing helper into the auth module, because the prompt of the day made it convenient. The change passes review, because the human reviewer has also forgotten the Monday conversation. By the time anyone notices, the layering decision has been quietly inverted in three places.&lt;/p&gt;

&lt;p&gt;Another version of the same story: I commit a fix with a comment explaining &lt;em&gt;why&lt;/em&gt; a specific branch of code must stay. Two days later, a fresh agent session is sent in to clean up TODOs and reads the comment as a TODO. By morning the carefully-preserved branch is gone, and the previous session's reasoning died with the previous session.&lt;/p&gt;

&lt;p&gt;This is not a model failure. It is not a human failure either. It is the predictable result of a team in which:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Some members are stateless.&lt;/strong&gt; Foundation-model agents have well-documented memory and identity limits across sessions (see Bommasani et al. 2021, &lt;em&gt;On the Opportunities and Risks of Foundation Models&lt;/em&gt;; Park et al. 2023, &lt;em&gt;Generative Agents&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The "why" behind past decisions is in nobody's working memory.&lt;/strong&gt; Humans forget. Agents don't even start with the context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Many actors can each "decide".&lt;/strong&gt; When everyone has authority to nudge a direction, nothing actually sticks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latest input dominates.&lt;/strong&gt; Agents will amplify whatever the most recent prompt suggests, including the wrong directions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I have come to think of these as &lt;strong&gt;governance gaps wearing technical disguises.&lt;/strong&gt; No amount of better prompts, better tests, or better refactor discipline patches them. They are properties of the &lt;em&gt;team shape&lt;/em&gt;, not of any single contributor.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we're fighting against — "The Chaos"
&lt;/h2&gt;

&lt;p&gt;The failure pattern above has a name in this framework: &lt;strong&gt;The Chaos.&lt;/strong&gt; It is the umbrella for four specific drift modes that tend to compound:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context rot&lt;/strong&gt; — agents lose the &lt;em&gt;why&lt;/em&gt; behind past decisions and re-invent or contradict prior choices across sessions (the Monday/Wednesday and TODO-misread stories above).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architect rot&lt;/strong&gt; — without a fixed reference, refactors land in incompatible directions. Humans and agents drift further apart from any earlier coherent design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scope creep&lt;/strong&gt; — the project keeps absorbing new concerns. Agents amplify it because the latest prompt is always more vivid than the original mandate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accumulated technical debt&lt;/strong&gt; — local conveniences that, once normal, are hard to undo. Humans and agents together can ship more of it, faster than a single human could.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is roughly what the AI-dev community has lately started calling &lt;strong&gt;"vibe coding"&lt;/strong&gt;: shipping code by feel, with agents steering, no anchor strong enough to make Monday's promise survive into Wednesday's commit. Vibe coding is wonderful for prototypes. It is brutal for anything that has to outlive a single session.&lt;/p&gt;

&lt;p&gt;The framework's job is not to forbid vibe coding. It is to give a project enough of a fixed backdrop that, when it graduates from prototype to &lt;em&gt;thing-people-rely-on&lt;/em&gt;, decisions can be made against something stable instead of against the void.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where existing methodologies leave a design slot empty
&lt;/h2&gt;

&lt;p&gt;I want to be careful here, because this is the easiest place to overreach.&lt;/p&gt;

&lt;p&gt;Waterfall, Agile, Scrum, SAFe, RUP — these work. I am not in a position to grade them. If an AI agent shows up to a stand-up the way a competent teammate does — persistent role, accountable for decisions, reads the working agreements, follows what was decided yesterday — Scrum runs the same as it always has. Sometimes better, frankly, because the agent does not forget the meeting on the drive home.&lt;/p&gt;

&lt;p&gt;So I do not want to claim the methodologies "fail" or "stop covering" anything when agents join. That would be both arrogant and inaccurate.&lt;/p&gt;

&lt;p&gt;What I do think is narrower: &lt;strong&gt;none of these methodologies were designed with AI agents as first-class participants in mind.&lt;/strong&gt; They do not specify what an "agent role" looks like — its memory model, its onboarding procedure, its authority bounds, its drift profile, how its decisions are attributed across sessions. That is an unfilled design slot, not a coverage failure.&lt;/p&gt;

&lt;p&gt;The Dead Light Framework is one attempt at filling that slot. It sits &lt;em&gt;on top of&lt;/em&gt; whatever delivery framework you already run, not in place of it. If your Scrum is well-disciplined and your reviews are tight, you will catch some of the failure modes I described above without any of this. The framework is for the parts your existing process was never asked to handle in the first place.&lt;/p&gt;




&lt;h2&gt;
  
  
  The hypothesis (the part you should attack)
&lt;/h2&gt;

&lt;p&gt;The thing I am testing is one sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A software project for humans + AI agents needs a frozen source of authority that no participant — human or agent — can rewrite at will.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Codified once by a small council. Sealed before kickoff. Humans interpret it. Agents execute within it. Neither group obeys a &lt;em&gt;person&lt;/em&gt; — both navigate by the same fixed light.&lt;/p&gt;

&lt;p&gt;This is not a radical idea outside software. It is roughly how constitutional federalism works (the U.S. Constitution constrains every subsequent administration), how religious institutional canon works (the Nicene Creed is older than any living interpreter), how central-bank mandates work (a price-stability mandate outlasts any single governor), and how RFC-driven protocol governance works (TCP/IP does not get rewritten because a vendor finds it inconvenient).&lt;/p&gt;

&lt;p&gt;What is novel — &lt;em&gt;if anything&lt;/em&gt; — is applying this pattern at the level of an individual software project, with AI agents as first-class participants whose context windows guarantee the authority cannot live in their heads.&lt;/p&gt;

&lt;p&gt;I call the sealed document the &lt;strong&gt;Astronomican&lt;/strong&gt;. I call the sealing meeting the &lt;strong&gt;Ascension Council&lt;/strong&gt;. I call the agent-type rulebooks &lt;strong&gt;Codices&lt;/strong&gt;. The names are borrowed from Warhammer 40,000.&lt;/p&gt;




&lt;h2&gt;
  
  
  About the metaphor (important)
&lt;/h2&gt;

&lt;p&gt;I want to be honest about this up front, because it is the obvious objection.&lt;/p&gt;

&lt;p&gt;The Imperium of Mankind in Warhammer 40,000 is a &lt;em&gt;cautionary tale&lt;/em&gt;. It is grimdark by design: a bureaucratic, paranoid, ossified empire that fails spectacularly across ten thousand years. Picking it as a governance metaphor without acknowledging that is internally contradictory.&lt;/p&gt;

&lt;p&gt;So I do not use it as evidence. The framework's policy, written into its own rules, is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;40k vocabulary is naming and shared metaphor only.&lt;/strong&gt; Every load-bearing argument must rest on a real-world system with an observable track record: constitutional federalism, military command-and-control doctrine, central-bank mandates, religious canon, established corporate practice (Toyota Production System, Amazon two-pizza teams), open-source governance, established software methodologies.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the 40k name and the real-world precedent disagree, the real-world precedent governs. The Imperium provides memorable names. Toyota's Andon Cord, the U.S. military's C2/SIGINT loop, and Bezos-era Amazon's API mandate provide the actual design lessons — particularly on the hardest problem the Imperium itself failed at: &lt;strong&gt;centralized authority combined with distributed sensing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you find a place in the framework where I leaned on 40k &lt;em&gt;as an argument&lt;/em&gt; rather than as a name, that is a finding. Please file it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Glossary — 40k terms used above
&lt;/h2&gt;

&lt;p&gt;For readers who do not know Warhammer 40,000 — one-liners on each term used in this post.&lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Astronomican&lt;/strong&gt; — In W40k, the psychic beacon that guides the Imperium's space travel after its god-emperor has all but died. &lt;strong&gt;In this framework:&lt;/strong&gt; the name for the sealed project document of purpose, immutable laws, and guiding principles.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Imperium of Mankind&lt;/strong&gt; — The fictional galactic empire in W40k. Used here only as a memorable source of names; &lt;em&gt;not&lt;/em&gt; as a governance role model (the empire fails spectacularly in canon — that is part of why I quote it carefully).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Codex / Codices&lt;/strong&gt; — In W40k, the rulebook each Space Marine Chapter operates under. &lt;strong&gt;In this framework:&lt;/strong&gt; the rulebook each AI agent type operates under (operational bounds, hard stops, output contract, notify triggers).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adeptus Administratum&lt;/strong&gt; — In W40k, the Imperial bureau of records, taxation, and administrative logistics — the empire's "chief of paperwork." &lt;strong&gt;In this framework:&lt;/strong&gt; the first sealed Chapter — a PM / High-Lord aide role.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ascension Council&lt;/strong&gt; — &lt;em&gt;Not&lt;/em&gt; from canon. The framework's name for the one-time small group of humans who seal the project's founding document before kickoff and then disband.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chapter / Chapters&lt;/strong&gt; — In W40k, a self-contained battle order of Space Marines, each with its own Codex. &lt;strong&gt;In this framework:&lt;/strong&gt; an agent &lt;em&gt;type&lt;/em&gt;, each with its own Codex.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Chaos&lt;/strong&gt; — In W40k, the warp-based corrupting forces the Imperium fights eternally. &lt;strong&gt;In this framework:&lt;/strong&gt; the umbrella failure mode the framework tries to defend against — context rot, architect rot, scope creep, accumulated technical debt; roughly the kind of drift "vibe coding" produces when extended beyond prototyping.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What this is and is not
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;em&gt;composition layer&lt;/em&gt; that sits on top of Agile / Scrum / Kanban / whatever you already run. It does not replace delivery rhythm.&lt;/li&gt;
&lt;li&gt;An attempt to give projects a constitution-like artifact and an explicit protocol for agent participation.&lt;/li&gt;
&lt;li&gt;A working hypothesis with a documented audit trail (38 findings against my own claims, all remediated, still openly listed).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This is not:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proven. I have one in-flight case study (a 358-KLOC project called LoreWeave). One case is not evidence. It is a smoke test.&lt;/li&gt;
&lt;li&gt;A productivity tool. It will add overhead before it removes any.&lt;/li&gt;
&lt;li&gt;A claim that you should run your project this way. It is a claim that the failure modes are real, that existing methodologies were simply not designed with agent participants in scope, and that &lt;em&gt;some&lt;/em&gt; framing in this neighborhood is probably needed to fill that slot.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Where this stands today
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Phase 0 (the calibration/audit phase for retrofit projects) — &lt;strong&gt;sealed&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Phase 1 (the Astronomican itself) — partial. Six known open questions, listed publicly.&lt;/li&gt;
&lt;li&gt;Phase 2 (Codex per Chapter) — first Chapter sealed (a PM/High-Lord aide called the Adeptus Administratum). Others wait for real-project triggers.&lt;/li&gt;
&lt;li&gt;Phase 3 (drift detection) and Phase 4 (re-consecration) — not started.&lt;/li&gt;
&lt;li&gt;One case study (LoreWeave) — Phase 0 Pass 1 about to begin.&lt;/li&gt;
&lt;li&gt;Internal audit (Independent Verification Pass) — five of seven phases complete. The audit is public, including the times the framework failed its own audit.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is in the open. The framework is being built in a single repo with full debate history.&lt;/p&gt;

&lt;p&gt;Repository: &lt;a href="https://github.com/letuhao/dead-light-framework" rel="noopener noreferrer"&gt;github.com/letuhao/dead-light-framework&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I want from readers
&lt;/h2&gt;

&lt;p&gt;Not converts. Arguments.&lt;/p&gt;

&lt;p&gt;Specifically, I want people to attack these:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Is "frozen authority" actually compatible with "evolutionary architecture"?&lt;/strong&gt; I think yes, with a re-consecration ceremony. But that ceremony is unsealed and you might convince me it is impossible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does the 40k vocabulary do more harm than good?&lt;/strong&gt; I find it useful as memorable scaffolding for a debate-driven team. But it may be repelling readers who would otherwise engage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Where does an industry standard already do this job?&lt;/strong&gt; If COCOMO II / CMMI v3.0 / ITIL 4 / DORA already cover one of the gaps I think I am filling, I want to know before adding another box.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What is the smallest experiment that would falsify the framework?&lt;/strong&gt; I am genuinely unsure how to design this. A failed retrofit on one project is suggestive, not conclusive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What did I import from the Imperium that I should not have?&lt;/strong&gt; I keep finding things. Help me find more.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's coming next
&lt;/h2&gt;

&lt;p&gt;A short series of posts will work through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The case study in detail (where it hurt, with numbers).&lt;/li&gt;
&lt;li&gt;Why Agile/Scrum specifically do not cover this gap.&lt;/li&gt;
&lt;li&gt;The mechanics of sealing an Astronomican.&lt;/li&gt;
&lt;li&gt;The Codex pattern for AI agents.&lt;/li&gt;
&lt;li&gt;How the framework audits itself (and the times it has failed).&lt;/li&gt;
&lt;li&gt;The anti-patterns I knowingly imported from a fictional dying empire, and how I compensate.&lt;/li&gt;
&lt;li&gt;Open questions where the framework could still be wrong.&lt;/li&gt;
&lt;li&gt;A practical adoption sketch — without promising it works.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of the failure modes I described sound like the project you are in right now, I would especially like to hear from you. The framework is far more useful as a piñata than as a manifesto.&lt;/p&gt;




&lt;h2&gt;
  
  
  About the author
&lt;/h2&gt;

&lt;p&gt;A working developer with roughly ten years of experience across a range of projects. Not an academic, not an industry authority on software methodology, not a methodologist of any kind. No chair, no certification body, no track record of published frameworks behind me.&lt;/p&gt;

&lt;p&gt;The Dead Light Framework — the subject of this post and the series it opens — is a personal exploration: one practitioner's attempt at finding methods that hold up when AI agents become full-time teammates. I publish it openly because I would rather be told I am wrong by people who have stood in front of the same problems than be politely ignored.&lt;/p&gt;

&lt;p&gt;If I sounded certain anywhere above, treat that as a slip in tone, not a claim of authority. The framework is at hypothesis stage. Everything is in scope to be argued with.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The Emperor is all but dead. The light remains.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/letuhao/dead-light-framework" rel="noopener noreferrer"&gt;github.com/letuhao/dead-light-framework&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Independent practitioner exploration. No affiliation with Games Workshop. Repository MIT-licensed.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>discuss</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
