Kwansub Yun

Posted on Jun 3 • Originally published at flamehaven.space

When the Memory Gate Met a Real Archive: What 90 Experiments Taught Us About Cheap LLM Slop

#opensource #ai #architecture #governance

Introduction: Enforcing the MICA Contract

This article is the practical side of the MICA series. MICA stands for Memory Invocation and Context Archive. In the workflow described here, it is a small package that the maintainer loads at session start so the active rules are visible before any code is touched.

Parts 6 and 7 set up the contract. This article shows what that contract did when a real scientific archive started accumulating cheap slop across more surfaces than a single maintainer could manually hold.

The archive is the Flamehaven Verification Ledger. It publishes three kinds of records.

EQA (Equation-to-Artifact). Physics and math reproductions. Currently 56 records, numbered TOE-TEST-0001 through TOE-TEST-0056. Example: a Schwarzschild Planck-scale metric verification.
BAV (Biomolecular AI Validation). Protein-folding consensus checks across several AI fold models (AlphaFold3, AlphaFold2, Chai-1, Boltz-2). Currently 34 experiments, with 6 active cards and a 26-entry foundational archive.
BSC (Bioscience Compliance). Repository compliance audits against external risk taxonomies (the MIT AI Risk Repository and EU AI Act). Currently 2 audits.

That is around 90 experiments in total. The full file count is past 300. Every record is published. The archive is intended to be cited. If the AI maintainer drifts, the drift can become a downstream paper citation.

One scope note matters before the story starts. flamehaven-audit-reports is not the engine that computes these results. It is the public evidence surface. Upstream engines and experiment repositories produce the raw artifacts.

This repository ingests those artifacts, sanitizes them for publication, classifies what kind of record they are, and renders them in a static ledger that other people can inspect and cite.

The three lanes have already taught us three different shapes of cheap slop. EQA taught us about framing drift at scale (a record displayed as a PASS when no real check produced it). The portal taught us about state duplication (an inline JavaScript copy that drifted from the disk file behind it).

BAV keeps trying to teach us about provenance drift, artifact-identity drift, and over-clean presentation around real runs. The article walks those three scars in order and then describes the gate that grew out of them.

📖 Glossary

A short list. Skim and move on.

MICA. A small package the maintainer loads at session start. It carries the rules and exposes whether the package state is coherent before write work begins.
DI (Design Invariant). A rule with an ID. Example: DI-EQA-001 says math runs must use mpmath at 200-bit precision or higher.
Playbook. A markdown file. People read it. Every rule inside cites a DI.
Schema. Two machine-readable files. mica.yaml carries the package shape. archive.json carries the 28 DIs.
Validator. mica_pct.py. When run against a package root, it emits CLOSED CONTRACT or INCOMPLETE.
Receipt. A small JSON block proving a run actually ran. Pins the engine commit hash, the run command, and the output hash.
EQA / BAV / BSC. The three lanes of the archive. Physics math, protein folding, compliance.

That is enough to read the rest.

1. The Archive We Are Talking About

The story needs a concrete protagonist. The protagonist is the archive itself. The opening named the three lanes. This section adds the file shape, three live numbers, and one honest scope label that the rest of the article will keep returning to.

The protagonist is not a single program. It is a layered publication system. Upstream computation happens in engine or experiment repositories. flamehaven-audit-reports is the place where those outputs are turned into public records.

That projection layer does four jobs that are easy to blur together if they are not named explicitly:

Ingest the upstream artifact
Sanitize anything that should not be published as-is
Classify the record by what kind of evidence it really is
Render it through a static inspection surface.

The three lanes do not all behave the same way inside that surface:

EQA Lane: Closest to a deterministic computation archive.
BAV Lane: Pipeline- and governance-heavy. The ledger must distinguish between a genuine rerunnable experiment, a runtime audit, and a research or review artifact.
BSC Lane: Maps repository state directly to external compliance taxonomies.

Each EQA record carries at least two files. A machine-readable internal_data.json holds the receipt. A human-readable analysis_report.md holds the narrative. Some records also ship a SPAR review record. The strongest three current records:

Schwarzschild Planck-scale metric verification. Engine re-execution produced Omega = 0.9985.
de Sitter background check. Recorded at sqrt_jsd = 0.2722.
OpenAI Erdős Eq.(2.2) reproduction. Claim: matches the published value to 0.014 percent. Anchored to a public MIT repo and a Zenodo DOI.

The BAV lane is honest about its own scale. Of the 6 active cards, only one (EXP-031) carries a foldable input sequence. It is a 52-amino-acid input run against AlphaFold3, AlphaFold2, Chai-1, and Boltz-2.

The other five cards are governance and methodology experiments. They do not ship a re-run scaffold. That boundary is honest. We did not invent a fake fold to fill the slot.

The total is past 300 files. That number matters. A single maintainer can review 5 records by hand. Nobody can review 300. Slop scales with file count. Review does not.

2. What the EQA Lane Taught Us First

The first scar was about labels, not numbers. The numbers were correct. The labels around them were wrong.

In June 2026, we ran an internal audit on the EQA archive, the lane that publishes the physics and math reproductions. The plan was to spot-check the calculations. We re-ran the engine on a sample of records. The numbers matched.

A Schwarzschild horizon calculation came back at Omega = 0.9985. A de Sitter background check came back at sqrt_jsd = 0.2722. The math was honest.

The audit found a different problem.

The public website was showing 51 of the 56 records with a green PASS badge. A green PASS is supposed to mean: a numerical check ran and the result passed a threshold. That was not what the website was doing.

It was treating every record that had a markdown analysis file as PASS, whether or not a real check had ever run. Governance notes, scenario builds, and integration documents all showed up as if they had been verified.

A reader scanning the page saw "51 successful verifications." When we sat down and went through the 51 records by hand, only 7 of them had come from a real engine run. The other 44 were notes and supporting documents that had been imported into the lane over time.

The numbers did not change. The framing did.

We rewrote the page headline to say "7 verification runs and 44 supporting documents." We wrote a new rule into the package contract that lives next to the records on disk.

The rule says, in plain English: a green PASS badge can only come from a real threshold check. The mere presence of a report file is not a PASS. A grade copied in from someone else's report is not a fresh verdict. The five most recent records (numbered TOE-TEST-0052 through TOE-TEST-0056) carry their own real verdicts.

One thing to be clear about. This audit was a manual one-time read. The MICA validator did not catch the drift. We caught it by reading the records ourselves and asking what each one actually claimed. What MICA does now is preserve the lesson in the package contract and in the maintainers' workflow.

The contract status the validator emits when everything lines up is called CLOSED CONTRACT. That does not mean every semantic rule is automatically enforced by the validator itself.

It means the package structure, declared layers, and DI bindings are coherent, and the maintainer is expected to run inside that contract before changing the archive.

This is the kind of failure no CI gate or syntax check would catch. The math was correct. The framing was wrong. A markdown-only policy would have continued to allow it because every file would have parsed cleanly. The rule survived this kind of pressure because the contract records both what the rule says and the specific incident that forced it to exist.

The next scar hit a different part of the system. Not the math lane this time. The website itself.

3. The Forgiveness Budget Scientific Archives Don't Have

Most LLM-assisted writing operates on a forgiveness budget.

A blog post can be slightly overstated. A README can describe something the code does not quite do yet. A pitch deck can round 73% up to "over 70%." The reader corrects internally. The next revision absorbs the drift. The social cost of small overclaiming is low.

A scientific archive does not have that budget.

This archive is published in a form meant to be cited. The Schwarzschild Omega value, the Erdős reproduction match percentage, and the EXP-031 fold metrics are all the kind of claims that can become downstream references. The drift that is harmless in a blog post becomes a poisoned downstream paper citation here.

The model that helpfully rewrites a paragraph also helpfully invents a SMILES string (the text encoding chemists use for molecules) that looks chemically plausible. The agent that summarizes a build log will, if asked one too many times, invent a DOI. The same instinct that makes LLMs useful for prose makes them dangerous for an archive.

The objects that have to survive this environment are the ones an LLM is least equipped to verify on its own. SMILES strings. DOIs. AlphaFold pLDDT values (per-residue confidence scores for a fold). Numerical thresholds with physical meaning. Record-level provenance. None of these can be caught by spell-check or by a continuous-integration pipeline.

This is why the archive needs a gate that loads before any code is touched.

4. The Failure That Forced the Cross-Lane Gate

The second scar was inside the website that displays the archive.

The site used to ship a fallback copy of every record inside the JavaScript file that runs in the reader's browser (js/portal.js). The original purpose was harmless. Some readers download the repository and open the homepage by double-clicking it, which uses the file:// URL scheme. Some browsers refuse to load separate JSON files over file:// for security reasons, so the fallback let the page render anyway. Two small functions held the fallback. One returned a copy of the dataset for a record. The other returned a copy of the human-readable report.

The on-disk files kept changing. The inline copies inside the JavaScript did not. The drift grew quietly over weeks.

A maintainer wrote a small drift-checking script and ran it. It compared every on-disk record with its inline twin. It found 151 places where the two copies disagreed. The most striking example was a record about the Erdős reproduction whose schema_id field did not even share the same structure between its two copies. The AI maintainer had been editing the disk files. The website had been rendering the stale inline copies. Both sides looked fine internally. Neither side agreed with the other.

A GOVERNANCE.md had said "single source of truth" the whole time. The maintainer agreed with the policy. The model also agreed. The policy lived in prose. Nothing in code enforced it.

Same shape as the EQA framing audit. A human caught the drift, not the MICA validator. What MICA does now is preserve the new rule in the package contract and surrounding docs. The rule, in plain English, is: no inline copy of any record may ship inside the browser code. The two functions that used to return the inline copies were stripped out and replaced with stubs that return empty values. The stubs carry the history inline, so a future maintainer reading the file sees both the rule and the incident that forced it.

function getFallbackReportText(runId) {
  // Removed (v1.13.1): inlined report-text fallback drifted from the on-disk .md
  // reports. Single source of truth = the on-disk files fetched above.
  return '';
}

function getFallbackDataset(runId) {
  // Removed (v1.13.1): inlined fallback datasets had drifted from the on-disk JSON
  // (151 schema/value mismatches found 2026-06-02 by check_fallback_drift.py).
  // Single source of truth = the on-disk evidence files fetched above. This ledger
  // must be served over HTTP (e.g. "python -m http.server"), not opened via file://.
  // Returns null so the inspector shows an honest load error rather than stale data.
  return null;
}

The fix lives in three places. The archive's machine-readable contract carries the lesson. The browser code returns empty values where the inline copies used to live. The playbook (the human-facing operating guide) explains why those values are empty.

A new maintainer joining the project sees the rule from all three angles. The validator confirms the package still loads as a coherent contract. The code refuses to render the old fallback because the function returns nothing. The playbook explains why a human should not put the fallback back in.

This is the lesson that produced the title of the article. A memory gate that lives only in markdown is etiquette. The gate becomes structural the moment the contract, the code, and the playbook all point at each other.

5. What the Playbook Actually Does

People ask why we ship a human-readable playbook (a long markdown file) if the contract is already a machine-readable file. The clearest answer is a short list of cheap failures the playbook actually prevented.

A maintainer who reads only the machine-readable contract sees one rule: math must use an arbitrary-precision library at 200 bits or higher. That is precise. It is also blunt. It does not say why. The first time the maintainer hits a math sub-case the contract did not specifically name, they may default to the standard 64-bit floating-point library.

The playbook is where the original incident behind the rule lives. In our case, an early experiment where 64-bit floats silently underflowed to zero in a class-field calculation and produced a meaningless result of 0. The playbook tells that story in plain English. A maintainer who read it will not re-introduce the same bug in a new sub-case.

An AI maintainer that starts work without loading the playbook will, when asked to fix a wrong score in a record, simply edit the JSON file that stores the score. The contract forbids this in one terse sentence. The playbook expands that sentence into a behavior rule. Never edit a record's data file after it has been committed.

Instead, create a new record with a new ID and link the corrected record from the original one. A session that loaded the playbook reads that rule before any edit happens. A session that did not load it destroys the audit trail that lets a third party re-run the original computation.

The third example happens at render time. The website uses a small classifier (a regular expression) to decide what kind of colored label sits next to each metric. When the classifier does not recognize a metric name, it returns nothing, and the metric renders without any colored label.

The contract says what to do, in terse machine terms. The playbook documents the human procedure step by step. Add the new metric to the glossary, decide what kind of evidence backs it, assign the matching label, then merge.

Without the playbook, a record with a missing label might ship as if the missing label were on purpose.

The playbook is not the rule. The contract is. The playbook is the briefing for the maintainer about to face the rule, and the record of the specific past failure each rule was written to prevent.

6. Where MICA Sits, and What It Refuses

MICA is a small Python validator plus package format. In the workflow used here, the maintainer runs it at session start. The script reads the package contract first (a short YAML file). The contract names three other files.

The validator confirms those layers exist and that the package shape is coherent. The first is the archive's machine-readable rule list. The second is the human-readable playbook. The third is the credibility document that says what kinds of internal scores may or may not appear on the public surface.

After loading, the script runs 11 simple structural checks against the package. Each check catches one specific kind of cheap failure before write work proceeds.

The first group of checks refuses a half-formed package. The script asks whether the contract declares the required shape fields (mica_spec, mode, layers), whether the archive and playbook layers exist, and whether the mode/layer combination is coherent. The package is unusable until those fields line up.

The second group refuses drift between what the contract says and what the file system actually holds. The script asks whether every file the contract names exists on disk. A check here fails when a file was renamed in one place and not the other. This is the same shape of failure as the website's inline-fallback drift from the second scar, but caught much earlier.

The third group refuses critical rules that have no accountability behind them. Every critical archive rule is supposed to carry a short note naming the incident that forced the rule. The script asks whether binding.origin_episode is filled in for every critical rule.

A check here fails when a rule was written as a top-down policy with no recorded cost behind it. Rules like that are easy for a maintainer or an AI maintainer to rationalize past in the moment. A rule that names what was paid the last time it was missing is much harder to ignore.

The fourth group refuses stale package references. The script can check whether any declared binding.lesson_ref paths still resolve. A broken cross-reference is how a rule slowly becomes etiquette.

The sequence at session start looks like this:

If every check passes, the script emits the status CLOSED CONTRACT. If a hard-fail check trips, it emits INCOMPLETE. In the workflow used here, the maintainer fixes that state before any code change happens.

This is what we mean by a gate that is meant to run before any code is touched. It is not just a policy hope. It is a small Python validator with explicit hard-fail conditions.

The 28 archive rules do the same thing at the per-record level. We did not aim for 28. The number grew as incidents forced new rules. Every rule carries a short note pointing at the incident that produced it. The list is not a top-down policy. It is an accumulated record of past failures the team agreed not to repeat.

7. One Bad BAV Card, Step by Step

Note before the walk-through. This scenario is a constructed illustration, not a documented incident. The protein-folding cards on the archive today are all well-formed. The point of stepping through it is to show the refusal sequence at the granularity a peer reviewer can check, not to claim that a refusal of this exact shape has been logged in production.

To make the gate concrete, here is one fabrication the contract refuses.

Step 1. An AI maintainer is asked to add a new protein-folding card. There is no real protein sequence on disk to fold, but the model knows the file format the lane expects. It writes a record at bav/exp-035/reference_run.json that looks like a real fold result. The file carries pTM = 0.78, pLDDT_mean = 84.2, PAE = 4.3 Å. These numbers fall inside the same range as the only real fold on the archive (EXP-031), so they pass a casual eye-test.

Without the gate, the rest follows naturally.

Step 2. The website's small classifier reads the metric name pLDDT_mean. It matches the pattern for an externally-defined fold metric (AlphaFold defines pLDDT, so the website treats anything named that way as borrowed from outside, and therefore checkable by a third party). The card renders with a green "verifiable" badge. The classifier is just a short regular expression. Here is what it does:

function provClassOf(label) {
  const s = String(label == null ? '' : label).toLowerCase();
  if (/(plddt|\bpae\b|ptm|contact|brier|\bauc\b|\bece\b)/.test(s)) return 'EXTERNAL';
  if (/(p_e2e|e2e|capture|transfer)/.test(s)) return 'DERIVED';
  if (/(sr9|di2|sidrce|coherence|spar|nnsl|resonance|drift|omega|ω)/.test(s)) return 'ADVISORY-HEURISTIC';
  return null;  // incidental values (counts, dates, grades) carry no badge
}

The label pLDDT_mean contains the string plddt, so the first pattern matches. The function returns EXTERNAL. The badge turns green. The regular expression has no way to check whether the number behind the label came from a real fold or from a fabrication.

Step 3. A reader trusts the green badge. The value gets cited in a manuscript. A wet lab spends real money chasing a fold that was never run.

With the gate, the chain breaks at step 1.

The archive carries a rule that says a fold card can only claim re-runnable status if it ships two specific files alongside the result: a .fasta file containing the protein sequence that was folded, and a small JSON file naming the model version and the random seed used.

In this repository, that rule lives in the contract and in the surrounding spec, and the maintainer is expected to check it before publishing the card. If bav/exp-035 had no real input sequence, it could not honestly ship as a re-runnable fold. At most it would ship as non-re-runnable or stay unpublished. The reader would see the honest label.

A standalone GOVERNANCE.md would not have stopped step 1. A YAML config without a validator would not have noticed the missing input file. An agent system prompt would have been compressed away under context pressure. A CI check would have run too late, after the fabrication was already on the public surface.

The contract, the playbook, and the validator reduce the chance of that chain because all three point at each other, and because the workflow checks the contract before the file is allowed to settle into the public archive.

8. What This Pipeline Cannot Block

A pipeline that pretends to catch everything is the failure mode it was built to prevent.

Five things still slip past every layer above.

A plausible fabricated value inside the normal range. A fake pLDDT of 78.4 looks like a real one. The website's classifier labels it as externally-defined and the green badge appears. Only a third party re-running the fold catches the fabrication. This is why only EXP-031 ships the full re-run scaffold, and the other five active BAV cards do not claim independent re-runnability.

A new promotional pattern outside the word list. A small filter watches the public pages for 14 superlative terms such as revolutionary and breakthrough. A maintainer who writes something like "a novel adaptive coherence framework" defeats every entry on the list. The list is a floor, not a ceiling.

A fabrication marker silently removed. A separate filter looks for a literal [synthetic] tag in shipped files, the kind of marker a developer might leave on placeholder data. The filter only fires when the tag is present. A maintainer who deletes the tag while keeping the fabricated content underneath passes the filter cleanly.

A real DOI pointing at the wrong paper. Nothing in the pipeline fetches DOIs. A real URL pointing to a real but unrelated paper is invisible to the validator. Peer review is the only check.

A correct computation framed as the wrong thing. This was the failure that produced the framing rule in the math lane. The engine outputs were real. The headline treated the mere presence of a report file as a fresh PASS. The fix was structural, but the same shape of error can reappear in any new lane.

The honest claim is narrow. MICA makes cheap slop expensive enough to catch. It does not make expensive slop catchable.

9. What We Learned, What We Did Not Solve

The archive is small by industry standards. 56 math records. 34 biomolecular-validation experiments (6 active cards and a 26-entry foundational archive). 2 compliance audits. Around 90 experiments. Past 300 files in total.

That scale was large enough to teach us four things.

First, a markdown policy alone does not survive an AI maintainer. The 151-mismatch drift proved it. The policy was correct. Nothing in code enforced it.

Second, the rule list is where the policy actually lives. The playbook is the human reading layer. The validator is the structural gate. The workflow is the enforcement surface. Together, they form one operating contract.

Third, the gate works best when it runs before any code is touched. PR-time checks are necessary, but not sufficient. By the time cheap slop reaches the PR surface, the maintainer is already reviewing content that should have been constrained earlier.

Fourth, the pipeline only refuses cheap slop. It does not verify molecules, fold real proteins, or check that a DOI links to the paper it claims to. That work stays external. The pipeline buys reviewer time so the reviewer can do that external work on the few claims that genuinely need it.

What we did not solve.

The website's metric classifier still misclassifies on a typo. A maintainer who writes pLDT instead of pLDDT ships a card with no colored badge at all. Nothing automated catches it. Reading the PR diff before merge is the only safety net.

The fabrication-marker filter is bypassable. Anyone who knows the [synthetic] tag is there can delete it, and the underlying content goes through.

DOIs are not fetched. A real URL to a real but unrelated paper passes every layer.

And the article has not shown a logged production refusal by the MICA validator itself. The validator's refusal logic is exercised by a few test fixtures inside the MICA repository (small example packages deliberately broken in specific ways).

The fixtures prove the mechanism works as designed. They do not prove that the gate has fired in production on this archive yet. The two real incidents in this article were both caught by human attention. The framing drift in the math lane was caught by a one-time read.

The website's fallback drift was caught by a small script a maintainer ran. The rule list records both lessons. The next time either shape returns, the contract now names the failure pattern and creates a refusal point where automation, workflow, or human review can tighten around it. We do not yet have a refusal log entry to point at.

The pattern across these gaps is the same. Cheap slop is refused upstream. Expensive slop is left for the maintainer's reading and for peer review.

The maintainer has finite attention. Every minute spent catching a fabricated pTM = 0.78 is a minute not spent reading the molecule, the protocol, or the citation that actually needs human judgment.

Session-start refusal exists to move the cheap failures upstream so the saved attention can land on the expensive ones. The contract does not pretend to verify the world. It frees the maintainer to verify the parts that matter most.

This article was the practical side of what Parts 6 and 7 set up as a session-start contract. The next part of the MICA series will return to the framework side.

The reproduction handle for the strongest record on the ledger is short:

git clone https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction
cd openai-erdos-eq22-reproduction
python -m pytest

Treat any number on the ledger as a number to verify, not a number to trust.

DEV Community