<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kwansub Yun</title>
    <description>The latest articles on DEV Community by Kwansub Yun (@flamehaven01).</description>
    <link>https://dev.to/flamehaven01</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3508506%2Fe2f9bc29-10d2-41ec-8e77-19b8b5cfd9e9.jpg</url>
      <title>DEV Community: Kwansub Yun</title>
      <link>https://dev.to/flamehaven01</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/flamehaven01"/>
    <language>en</language>
    <item>
      <title>Making Equation (2.2) of the OpenAI Erdős Result Executable</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 26 May 2026 06:37:10 +0000</pubDate>
      <link>https://dev.to/flamehaven01/making-equation-22-of-the-openai-erdos-result-executable-ml7</link>
      <guid>https://dev.to/flamehaven01/making-equation-22-of-the-openai-erdos-result-executable-ml7</guid>
      <description>&lt;h2&gt;
  
  
  Why a proved theorem still needs reproducible claim custody
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2108443327152872531" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2108443327152872531" alt="open ai" width="900" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On May 20, 2026, &lt;a href="http://%20https://openai.com/index/model-disproves-discrete-geometry-conjecture/" rel="noopener noreferrer"&gt;OpenAI announced&lt;/a&gt; that an internal reasoning model had produced a counterexample to the Erdős planar unit-distance conjecture.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The problem is easy to state: given $n$ points in the plane, how many pairs of points can be exactly distance $1$ apart?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For nearly eighty years, the prevailing expectation was that square-grid-type constructions were essentially optimal up to a slowly growing exponent. OpenAI’s announcement changed that. Its internal reasoning model produced an infinite family of examples giving a polynomial improvement, and the proof was checked and written up in mathematical form by external mathematicians.&lt;/p&gt;

&lt;p&gt;In this article, “the remarks paper” refers to the companion PDF by Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Matchett Wood, linked from OpenAI’s announcement.&lt;/p&gt;

&lt;p&gt;The proof-level result belongs to those authors and the source papers.&lt;/p&gt;

&lt;p&gt;My focus here is narrower: equation (2.2) in that remarks paper, and whether its explicit numerical value can be reproduced as executable code.&lt;/p&gt;

&lt;p&gt;This is not about proving the theorem again. It is about what happens after a theorem contains a fragile numerical claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  The proof is not the artifact
&lt;/h2&gt;

&lt;p&gt;A mathematical proof and a software artifact do different jobs.&lt;/p&gt;

&lt;p&gt;The proof establishes the theorem. It gives the definitions, the argument, the dependencies, and the mathematical reason why the result holds.&lt;/p&gt;

&lt;p&gt;A software artifact should not pretend to replace that.&lt;/p&gt;

&lt;p&gt;But some claims inside a mathematical paper have a finite, numerical, or computationally checkable surface. Those claims can be preserved differently. They can be run. They can be tested. They can fail when precision is wrong.&lt;/p&gt;

&lt;p&gt;That is the narrow role of an executable reproduction artifact: not proof replacement, not automated peer review, and not authority over the theorem, but a reproducible object for the part of the claim that can be computed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The specific target: equation (2.2)
&lt;/h2&gt;

&lt;p&gt;In the OpenAI Erdős result, one checkable surface is equation (2.2) of the remarks paper.&lt;/p&gt;

&lt;p&gt;For the explicit choice&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7138879423288234316" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7138879423288234316" alt="math1" width="606" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;the remarks paper gives an explicit numerical lower bound on the exponent excess above the classical Erdős exponent:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13849924454096937923" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13849924454096937923" alt="math2" width="841" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These parameters are taken directly from the remarks paper without modification. The artifact does not derive the multiquadratic choice; it reproduces the finite numerical calculation built from that choice.&lt;/p&gt;

&lt;p&gt;This is not the later stronger explicit bound associated with Sawin’s separate preprint. It is not $\delta \approx 0.014$. It is the numerical value appearing in equation (2.2) of the remarks paper.&lt;/p&gt;

&lt;p&gt;That narrowness is important. It is exactly what makes the claim suitable for executable reproduction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the numerical fragility comes from
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D4133600104991436468" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D4133600104991436468" alt="4" width="900" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The numerical fragility comes from the exact form of equation (2.2), not from a large computation.&lt;/p&gt;

&lt;p&gt;Immediately after the published expression, the parameters are:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7110299839676694670" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7110299839676694670" alt="math3" width="754" height="43"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D483384573840666881" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D483384573840666881" alt="math 4" width="772" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the paper’s definitions of $u, v$, and $\delta$  substituted into equation (2.2), the exponent excess reduces to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2715587953765822422" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2715587953765822422" alt="math5" width="752" height="80"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The constant $36$ is not introduced by the implementation. It is already present in the remarks paper’s equation (2.2), both in the numerator term $u\pi/(36v)$ and in the denominator term $\log(36/\delta^2).$&lt;/p&gt;

&lt;p&gt;After substituting $u = K/r^2, v = r/2$, and $\delta = 101^{-2K}$, the numerator simplifies to $\log(K\pi / 18r^3)$, while the denominator becomes $\log 36 + 4K \log 101$.&lt;/p&gt;

&lt;p&gt;Here the $101$ comes from the finite prime in $S = {101, \infty}$.&lt;/p&gt;

&lt;p&gt;In other words, this artifact does not derive the constant $36$ from first principles; it reproduces the published equation with the stated substitutions.&lt;/p&gt;

&lt;p&gt;The precision problem is in the numerator:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D11575553626952662327" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D11575553626952662327" alt="math 7" width="254" height="53"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because $K$ is the ceiling of $18r^3 / \pi$, the ratio $K\pi / 18r^3$ is only barely larger than $1$.&lt;/p&gt;

&lt;p&gt;More precisely:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D10827487014404388139" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D10827487014404388139" alt="math8" width="339" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For $r = 510510$,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13091608971449808775" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13091608971449808775" alt="math 9" width="255" height="74"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the numerator is effectively $\log(1 + \varepsilon)$ with $\varepsilon$ at the  $10^{-18}$scale.&lt;/p&gt;

&lt;p&gt;IEEE 754 double precision has machine epsilon around $2.2 \times 10^{-16}$. A naive &lt;code&gt;float64&lt;/code&gt; computation therefore cannot reliably distinguish the near-one ratio from  $1$. The ratio rounds to $1$, leading to $\log(1) = 0.$&lt;/p&gt;

&lt;p&gt;The exponent excess disappears before the computation reaches the value stated in the paper.&lt;/p&gt;

&lt;p&gt;This is not a flaw in the mathematics. It is a precision failure in the numerical evaluation of a valid expression. That is the reason the artifact evaluates equation (2.2) using &lt;code&gt;mpmath&lt;/code&gt; at 200-bit precision.&lt;/p&gt;

&lt;p&gt;A PDF can state the value. A verifier can expose when the value disappears.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9321543817991300315" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9321543817991300315" alt="last" width="900" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We built:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction" rel="noopener noreferrer"&gt;https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The purpose is deliberately narrow: reproduce the finite, explicitly checkable numerical surface of equation (2.2) in the OpenAI Erdős unit-distance disproof remarks.&lt;/p&gt;

&lt;p&gt;The package evaluates the expression using &lt;code&gt;mpmath&lt;/code&gt; at 200-bit precision and returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6.2391e-38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matches the published two-significant-figure value $\approx 6.24 \times 10^{-38}$ to $1.4 \times 10^{-4}$ relative error.&lt;/p&gt;

&lt;p&gt;The repository includes 60 unit tests, 21 verifier checks, a frozen per-source-file SHA-256 manifest, GitHub Actions CI across Ubuntu and Windows, Python 3.11 / 3.12 verification, and a frozen-report mode that prints a verdict without mutating tracked evidence.&lt;/p&gt;

&lt;p&gt;The basic reproduction path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction&amp;gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;openai-erdos-eq22-reproduction
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; erdos_ant.verify
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Verdict: PASS
Checks: 21/21 passed
eq (2.2) exponent excess: 6.2391e-38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a large system. That is part of the point. A small claim with a clear boundary is easier to inspect than a broad claim that blurs proof, computation, and interpretation.&lt;/p&gt;




&lt;h2&gt;
  
  
  From reproduction to custody
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9085427059880693022" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9085427059880693022" alt="2" width="900" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This repository was not built as a one-off reaction to an OpenAI announcement. We are not announcing a grand framework here; we are showing the discipline in miniature.&lt;/p&gt;

&lt;p&gt;For us, the work is part of a longer routine: take a mathematical or technical claim, isolate the checkable surface, pin the environment, and make drift visible.&lt;/p&gt;

&lt;p&gt;That is intentionally plain work.&lt;/p&gt;

&lt;p&gt;Read the source.&lt;/p&gt;

&lt;p&gt;Extract the claim.&lt;/p&gt;

&lt;p&gt;Reproduce the computation.&lt;/p&gt;

&lt;p&gt;Record the boundary.&lt;/p&gt;

&lt;p&gt;Let the verifier fail if the result disappears.&lt;/p&gt;

&lt;p&gt;To execute this routine reliably, the scope must be uncomfortably narrow. This repository intentionally leaves the proof of Theorem 1.1, the construction of the infinite tower, and Sawin’s separate $\delta \approx 0.014$  preprint to their respective sources. It does not pretend to be peer review.&lt;/p&gt;

&lt;p&gt;This is not just a disclaimer. It is the point of the artifact.&lt;/p&gt;

&lt;p&gt;A sharp, restricted boundary is exactly what makes a claim inspectable, repeatable, and challengeable. This is what I mean here by claim custody.&lt;/p&gt;

&lt;p&gt;It addresses a technical governance question, but not in the policy sense: what exactly is being trusted, from which source, and what makes the claim fail if the implementation changes?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A PDF can state the value. A verifier can expose when the value disappears.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We claim no authority over the broader theorem. We simply maintain a reproducible boundary around the fragile numerical claim inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9984717360298612367" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9984717360298612367" alt="repo" width="900" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The theorem was proved in the mathematical papers.&lt;/p&gt;

&lt;p&gt;This repository asks a smaller question: can the numerical value in equation (2.2) survive execution?&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;float64&lt;/code&gt;, it does not. The exponent excess collapses to zero.&lt;/p&gt;

&lt;p&gt;At 200-bit precision, with the source parameters pinned and the verifier running under CI, the artifact recovers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6.2391e-38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;matching the published value to $1.4 \times 10^{-4}$ relative error.&lt;/p&gt;

&lt;p&gt;That is the point.&lt;/p&gt;

&lt;p&gt;Not a new theorem. Not a proof replacement.&lt;/p&gt;

&lt;p&gt;A reproducible claim surface for one precision-sensitive number in a major AI-assisted mathematical result.&lt;/p&gt;

&lt;p&gt;Repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction" rel="noopener noreferrer"&gt;https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paper / Zenodo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://doi.org/10.5281/zenodo.20383217" rel="noopener noreferrer"&gt;https://doi.org/10.5281/zenodo.20383217&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mathematics</category>
      <category>python</category>
      <category>openscience</category>
      <category>openai</category>
    </item>
    <item>
      <title>The README Was a Protocol. The Entrypoint Was Still Optional.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 21 May 2026 10:34:02 +0000</pubDate>
      <link>https://dev.to/flamehaven01/the-readme-was-a-protocol-the-entrypoint-was-still-optional-57hj</link>
      <guid>https://dev.to/flamehaven01/the-readme-was-a-protocol-the-entrypoint-was-still-optional-57hj</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3k7jz1voscq51d9kuu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3k7jz1voscq51d9kuu7.png" alt="cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation Hierarchy&lt;/strong&gt;: The operational ladder — &lt;code&gt;natural&lt;/code&gt;, &lt;code&gt;guided&lt;/code&gt;, &lt;code&gt;forced&lt;/code&gt; — that determines how MICA actually reaches a live session.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Activation Packet&lt;/strong&gt;: The compiled session-start object that declares read targets, load state, self-test posture, drift status, and gate outcome.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report&lt;/strong&gt;: The structured opening output that declares what was loaded, what the self-test found, and whether the session gate is open.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;README-as-Protocol&lt;/strong&gt;: The pattern where the model's natural tendency to read the README first is formalized as a declared invocation mechanism. Introduced in v0.1.8.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Where Part 6 Left Off
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/my-ai-maintainer-kept-making-wrong-calls-so-i-made-it-report-its-state-before-touching-anything-2df7"&gt;Part 6&lt;/a&gt; showed what MICA looks like inside a single maintenance agent — session report, drift detection, design invariants, deviation log. The structure held. The protocol ran.&lt;/p&gt;

&lt;p&gt;Part 6 ended with a harder question: &lt;strong&gt;what happens when accumulated session knowledge needs to govern the next session — inside a tool that runs within AI workflows itself?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer depends on a prior question: does the next session actually load what was accumulated?&lt;/p&gt;

&lt;p&gt;That is not a schema problem. It is an entrypoint problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Gap README-as-Protocol Left Open
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg14swus4yof9adum02mc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg14swus4yof9adum02mc.png" alt="The Entrypoint Gap" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/the-model-already-read-the-readme-mica-v018-made-it-a-protocol-37j9"&gt;Part 4&lt;/a&gt; made a specific assumption: in many repository-based AI workflows, the README is already the model's first orientation surface.&lt;/p&gt;

&lt;p&gt;That observation became README-as-Protocol.&lt;/p&gt;

&lt;p&gt;Instead of inventing a new installation mechanism, MICA formalized an existing behavior: the model reads the README, the README points to the archive, and the session is expected to load context, run checks, and report readiness before work begins.&lt;/p&gt;

&lt;p&gt;That assumption was useful.&lt;/p&gt;

&lt;p&gt;It gave MICA a path into the session without requiring plugins, services, or custom host infrastructure.&lt;/p&gt;

&lt;p&gt;But a protocol is not an entrypoint.&lt;/p&gt;

&lt;p&gt;The README can declare where the archive is, what invariants matter, what the session report must contain. None of that guarantees sequencing. A model can still skim the README, jump directly into code, or begin work before declaring its load state.&lt;/p&gt;

&lt;p&gt;A gate without a consequence is still only etiquette.&lt;/p&gt;

&lt;p&gt;That is the gap this version had to close.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Answer: An Invocation Hierarchy
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mlovwluvuidh83cfzb9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mlovwluvuidh83cfzb9.png" alt="The Activation Spectrum" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MICA does not auto-invoke by magic. If no human, host, wrapper, or launcher calls the memory contract, the archive can exist without governing anything. This is the same truth Part 2 identified: the structure can exist, and the model can still have no reliable way to know it exists.&lt;/p&gt;

&lt;p&gt;The answer is an explicit hierarchy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural&lt;/strong&gt; — the model reads the project surface voluntarily: README, &lt;code&gt;mica.yaml&lt;/code&gt;, archive JSON, playbook. No intervention required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guided&lt;/strong&gt; — a host agent requests the activation packet before work begins. The packet declares read targets, self-test posture, drift state, and gate outcome. The host uses it to preflight the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forced&lt;/strong&gt; — a launcher blocks repository work until the session report clears. This is the strongest path and the least elegant one. It is also the one that survives noisy real-world terminal workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What Changed in Code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1ip7e5d3uompdkvdxdh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1ip7e5d3uompdkvdxdh.png" alt="The output mechanism" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three concrete moves made this operational.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session report became a real runtime output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hook&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session-report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The opening report is now a compiled object — not a protocol expectation, not a prose description. A host can consume it directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invocation is now compiled, not described.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mica_invoke.py&lt;/code&gt; compiles read targets and session report into one activation packet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;packet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entry_strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_targets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;_layer_targets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_root&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the shift from documentation-first startup to packet-first startup. The host no longer has to infer the sequence from prose.&lt;/p&gt;

&lt;p&gt;In guided mode, the output is already shaped for host consumption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guided"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"entry_strategy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guided"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"read_targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"readme"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mica_yaml"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"archive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playbook"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lessons"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_report"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"archive_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.7.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"self_test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CLOSED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"closed_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"drift_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO_DRIFT"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"directive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Host agent should load declared MICA surfaces first and use the session report as opening state."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Forced mode now has consequence.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;packet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCKED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The simplest entry surface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;@echo &lt;span class="na"&gt;off&lt;/span&gt;
&lt;span class="kd"&gt;python&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="vm"&gt;%~dp0&lt;/span&gt;&lt;span class="s2"&gt;tools\mica_invoke.py"&lt;/span&gt; &lt;span class="err"&gt;%&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That wrapper gives MICA an enforceable terminal entrypoint instead of relying on good behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. STEM-BIO-AI: The Cleaner Case
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9t7xfu8e4n06e08j08p1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9t7xfu8e4n06e08j08p1.png" alt="Dependency Shift" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;STEM-BIO-AI&lt;/code&gt; already had a mature MICA memory layer — archive, playbook, lessons, invocation protocol, drift profile. What changed was not the memory model. It was how that model becomes operative before work begins.&lt;/p&gt;

&lt;p&gt;That difference is visible across all three invocation modes.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;natural&lt;/code&gt; mode, the helper preserves the README-first path and makes the expected read order explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MICA INVOKE] mode=natural
Gate       : PASS
State      : INVOCATION_MODE
PCT        : CLOSED
...
Directive: Prefer reading README first, then load mica.yaml, archive, and playbook before scan work.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;guided&lt;/code&gt; mode, the same startup becomes a host-consumable packet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guided"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"read_targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"readme"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mica_yaml"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"archive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playbook"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lessons"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_report"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"archive_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.7.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"self_test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CLOSED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"closed_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"drift_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO_DRIFT"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;forced&lt;/code&gt; mode, the launcher uses the same contract as a gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MICA INVOKE] mode=forced
Gate       : PASS
State      : INVOCATION_MODE
PCT        : CLOSED
...
Directive: Block work until the session report gate is not BLOCKED.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The session report now looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: 1.7.8
Load: {"state": "INVOCATION_MODE", "mica_yaml": "memory\\mica.yaml"}
Self-test: {"pct": "CLOSED", "closed_contract": true}
Drift: {"status": "NO_DRIFT"}
Active invariants: {"critical_count": 15, "high_count": 3}
Gate: PASS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before, the package told the operator how to start correctly. Now, the session declares whether it actually did.&lt;/p&gt;

&lt;p&gt;Before this version, starting a &lt;code&gt;STEM-BIO-AI&lt;/code&gt; session correctly still depended on the operator remembering to load the right memory surfaces in the right order. Now that dependency can move upward: in &lt;code&gt;guided&lt;/code&gt; mode to the host, and in &lt;code&gt;forced&lt;/code&gt; mode to the launcher.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. CCGE: The Harder Case
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sxhtvkrlw8yyd951osd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sxhtvkrlw8yyd951osd.png" alt="retaining identity" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CCGE&lt;/code&gt; is more important precisely because it is harder. It is already a governance-heavy runtime. If MICA's identity were weak, it would disappear into the larger framework.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CCGE&lt;/code&gt; here is the Care Chain Governance Engine: a fail-closed clinical governance runtime with its own execution core, artifact generation, policy layers, and approval logic. That is why it is the harder case. MICA is not being tested in isolation. It is being tested inside a system dense enough to swallow it.&lt;/p&gt;

&lt;p&gt;It did not.&lt;/p&gt;

&lt;p&gt;The boundary stayed explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MICA&lt;/strong&gt; = invocation, memory, invariants, drift control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CCGE Core&lt;/strong&gt; = fail-closed runtime and artifact generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STEM-AI&lt;/strong&gt; = trust re-audit and classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the important architectural result. In &lt;code&gt;STEM-BIO-AI&lt;/code&gt;, MICA is already close to the center of the tool's operational identity. In &lt;code&gt;CCGE&lt;/code&gt;, MICA has to retain its own identity inside a much larger runtime. It does so by remaining responsible for invocation, memory, invariants, and drift control, while &lt;code&gt;CCGE Core&lt;/code&gt; remains responsible for fail-closed execution and artifact logic.&lt;/p&gt;

&lt;p&gt;The current session report in &lt;code&gt;CCGE&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: None
Load: {"state": "INVOCATION_MODE", "mica_yaml": "mica.yaml"}
Self-test: {"pct": "CLOSED", "closed_contract": true}
Drift: {"status": "NO_DRIFT"}
Active invariants: {"critical_count": 0, "high_count": 0}
Gate: PASS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Archive: None&lt;/code&gt; with &lt;code&gt;Gate: PASS&lt;/code&gt; is not a contradiction. The baseline archive does not yet expose a &lt;code&gt;project.version&lt;/code&gt; field. MICA detected that gap and reported it before any work began. A system that hides its own incompleteness is not governed. A system that surfaces it at session start is.&lt;/p&gt;

&lt;p&gt;The reason is concrete: the active archive is still a baseline integration memory object, not yet a fully target-bound archive. Its &lt;code&gt;project&lt;/code&gt; block still carries placeholders like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"project"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;target-repo-name&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;absolute-or-repo-relative-path&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"owner"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;org-or-maintainer&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"integration_program"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CCGE Unified Model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"target_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"phase_1_candidate"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the current report is telling the truth about what exists: a coherent MICA package around a still-baseline archive.&lt;/p&gt;

&lt;p&gt;A README might have let that gap stay invisible. The session report surfaced it immediately. That is what honest governance looks like before an archive is fully populated.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. What This Means for Anyone Building Agent Workflows
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9iym06ebtjisrhcw3oq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9iym06ebtjisrhcw3oq.png" alt="Architecutural imperatives" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three lessons from running this against two different projects.&lt;/p&gt;

&lt;p&gt;Human-readable startup is not enough. If the only valid path lives in a README, the protocol is vulnerable to partial reading and host variance. &lt;code&gt;STEM-BIO-AI&lt;/code&gt; is the clean example here: the memory layer was already mature, but correct startup still depended too much on the operator remembering to load it.&lt;/p&gt;

&lt;p&gt;Session-start state must be machine-usable. If a host agent cannot consume the startup declaration as a structured object, it cannot reliably preflight the session. That is why &lt;code&gt;guided&lt;/code&gt; mode matters more than another explanatory document: it gives the host an object to act on, not just instructions to interpret.&lt;/p&gt;

&lt;p&gt;A gate needs an entrypoint. A session report can be a conceptual hard gate, but until a launcher or host uses it as an entry condition, it remains a convention. &lt;code&gt;CCGE&lt;/code&gt; is the stronger proof of that point because the environment is already dense with governance logic; without an explicit entry surface, MICA would have been easy to blur into the surrounding framework instead of remaining its own startup layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What This Does Not Claim
&lt;/h2&gt;

&lt;p&gt;MICA does not self-invoke automatically in all environments. There is still no natural law that forces an LLM session to load the governed archive first.&lt;/p&gt;

&lt;p&gt;The real claim is narrower:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MICA can now be read naturally&lt;/li&gt;
&lt;li&gt;MICA can now be requested deliberately&lt;/li&gt;
&lt;li&gt;MICA can now be enforced mechanically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not total automation. A realistic path to enforceable startup.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What Part 8 Will Address
&lt;/h2&gt;

&lt;p&gt;The startup path is now much stronger.&lt;/p&gt;

&lt;p&gt;But one question remains:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How much of the session-start contract should be owned by the archive itself, and how much should remain a runtime default?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The current line can emit &lt;code&gt;session-report&lt;/code&gt;, compile guided packets, and block in forced mode. The next step is stricter archive ownership — richer &lt;code&gt;session_report_format&lt;/code&gt;, explicit per-archive &lt;code&gt;session_gate_policy&lt;/code&gt;, better drift contracts.&lt;/p&gt;

&lt;p&gt;Part 8 is not about whether MICA should govern startup. It already does. It is about how much of that behavior should be declared by the archive rather than inferred by the runtime.&lt;/p&gt;

&lt;p&gt;The series continues only where there is something concrete to specify, test, or correct.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; A protocol is not yet an entrypoint. MICA becomes operational only when invocation is structured as &lt;code&gt;natural&lt;/code&gt;, &lt;code&gt;guided&lt;/code&gt;, or &lt;code&gt;forced&lt;/code&gt; — and the session begins from a declared activation packet, not from hope.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>contextengineering</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 19 May 2026 14:38:53 +0000</pubDate>
      <link>https://dev.to/flamehaven01/from-repo-scanner-to-audit-architecture-what-changed-in-stem-bio-ai-through-v178-500m</link>
      <guid>https://dev.to/flamehaven01/from-repo-scanner-to-audit-architecture-what-changed-in-stem-bio-ai-through-v178-500m</guid>
      <description>

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa5ste8u9hanwgyt9441.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa5ste8u9hanwgyt9441.png" alt="From repo scanner to audit architecture: the evolution of STEM BIO-AI through v1.7.8" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three technical changes that made the scanner less Python-shaped, the warning model more stable, and the reports more inspectable.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The last time I wrote about STEM BIO-AI, the focus was AIRI:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how a local repository scanner could expand its risk vocabulary without pretending to become a universal AI safety judge.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That was the right story for &lt;code&gt;1.7.0&lt;/code&gt; and &lt;code&gt;1.7.1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But the project changed meaningfully after that.&lt;/p&gt;

&lt;p&gt;For readers who have not followed the earlier posts: &lt;a href="https://dev.to/flamehaven01/beyond-repo-scanning-how-airi-expanded-the-risk-vocabulary-in-stem-bio-ai-17x-5bgo"&gt;Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By &lt;code&gt;1.7.8&lt;/code&gt;, the interesting question was no longer just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can this scanner attach a broader risk language to local findings?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Can this scanner make those findings more inspectable, less misleading, and more robust across real repository shapes?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That shift matters.&lt;/p&gt;

&lt;p&gt;Because in audit tooling, correctness is only the first battle. The second battle is whether a reviewer can see &lt;strong&gt;why&lt;/strong&gt; the tool landed where it did, and whether the output still makes sense when it leaves the terminal and becomes a report, a PDF packet, a Hugging Face demo, or a governance memo.&lt;/p&gt;

&lt;p&gt;From &lt;code&gt;1.7.6&lt;/code&gt; through &lt;code&gt;1.7.8&lt;/code&gt;, three changes mattered most.&lt;/p&gt;

&lt;p&gt;They changed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;what counts as evidence,&lt;/li&gt;
&lt;li&gt;how warning lanes are separated,&lt;/li&gt;
&lt;li&gt;and how the final artifact stays legible across surfaces.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the more technical story behind those releases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk
&lt;/h2&gt;

&lt;p&gt;Before getting into the release details, it helps to define what AIRI means in this series.&lt;/p&gt;

&lt;p&gt;AIRI refers here to &lt;strong&gt;&lt;a href="https://airisk.mit.edu/" rel="noopener noreferrer"&gt;the MIT AI Risk Repository&lt;/a&gt;&lt;/strong&gt;: a public AI risk resource from the MIT AI Risk Initiative that organizes fragmented AI risk language across research, policy, and industry sources.&lt;/p&gt;

&lt;p&gt;The repository includes an AI Risk Database, a Causal Taxonomy of AI Risks, and a Domain Taxonomy of AI Risks. According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications, while the public domain taxonomy organizes risks into 7 domains and 24 subdomains.&lt;/p&gt;

&lt;p&gt;That makes AIRI useful as a vocabulary source.&lt;/p&gt;

&lt;p&gt;But vocabulary is not truth.&lt;/p&gt;

&lt;p&gt;A local scanner should not say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this repository caused this risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It should say something more careful:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this local finding belongs to a broader class of AI risk language.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction is the design boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Problem: The scanner was still too Python-shaped
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1aqh2kcvfa3gy2i0l01h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1aqh2kcvfa3gy2i0l01h.png" alt="Universal dependency detection and provenance evidence across Python and JavaScript stacks" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the more useful failures in this line came from an uncomfortable result: a repository could obviously have dependency and lockfile evidence, and STEM BIO-AI could still miss it.&lt;/p&gt;

&lt;p&gt;That is not a philosophical problem.&lt;br&gt;
That is &lt;strong&gt;an implementation problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, the project was still too biased toward Python-native signals.&lt;/p&gt;

&lt;p&gt;That showed up most clearly in JavaScript or mixed-stack repositories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;package.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;package-lock.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pnpm-lock.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;yarn.lock&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;npm-shrinkwrap.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;were not being treated as first-class provenance and replication evidence in the same way that &lt;code&gt;requirements.txt&lt;/code&gt; or &lt;code&gt;pyproject.toml&lt;/code&gt; were.&lt;/p&gt;

&lt;p&gt;The result was a false negative pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 3 provenance (&lt;code&gt;B1&lt;/code&gt;) could be undercounted&lt;/li&gt;
&lt;li&gt;Stage 4 replication evidence could be undercounted&lt;/li&gt;
&lt;li&gt;and the report could quietly imply "no dependency evidence" when the repository clearly had dependency structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That kind of miss is more dangerous than it sounds.&lt;/p&gt;

&lt;p&gt;Not because it makes the score a little wrong.&lt;/p&gt;

&lt;p&gt;But because it damages trust in the scanner's worldview.&lt;/p&gt;

&lt;p&gt;If developers see a tool miss an obvious &lt;code&gt;pnpm-lock.yaml&lt;/code&gt;, they stop believing the harder claims too.&lt;/p&gt;


&lt;h3&gt;
  
  
  What changed in &lt;code&gt;1.7.6&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The fix was straightforward but important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JavaScript manifests and lockfiles were promoted into the same evidence families as the existing Python manifests where appropriate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Concretely, that meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;B1_data_provenance_controls&lt;/code&gt; started recognizing JS manifest/lock surfaces&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;S4_environment_lock_evidence&lt;/code&gt; started recognizing them&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;S4_exact_dependency_pins_or_hashes&lt;/code&gt; started recognizing them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was not a scoring philosophy change.&lt;/p&gt;

&lt;p&gt;It was a scope correction.&lt;/p&gt;

&lt;p&gt;The rule engine learned that a dependency ecosystem is a dependency ecosystem even when it is not Python.&lt;/p&gt;

&lt;p&gt;One boundary matters here.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;B1_data_provenance_controls&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; suddenly mean "dataset lineage was proven by a lockfile."&lt;/p&gt;

&lt;p&gt;In this lane, &lt;code&gt;B1&lt;/code&gt; is using dependency manifests as &lt;strong&gt;repository provenance surfaces&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what environment the repository expects,&lt;/li&gt;
&lt;li&gt;what dependency custody the repository exposes,&lt;/li&gt;
&lt;li&gt;and whether the repo surfaces any adjacent data-source, IRB, or dataset-citation language around that environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is weaker than dataset lineage evidence.&lt;/p&gt;

&lt;p&gt;But it is also much stronger than pretending a mixed-stack repository has no provenance surface at all.&lt;/p&gt;


&lt;h3&gt;
  
  
  A small before/after that makes the point
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;yorkeccak/bio&lt;/code&gt; case is a good example because the score movement was not philosophical. It was mechanical.&lt;/p&gt;

&lt;p&gt;Before the JS manifest fix, the same repository could produce:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: 1.7.5
final_score: 40
stage_3_code_bio: 6
B1_data_provenance_controls: 0 / 15
replication_score: 10
AIRI covered_count: 0 / 31
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the manifest and lockfile correction, the same repository shape produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: 1.7.8
final_score: 48
stage_3_code_bio: 25
B1_data_provenance_controls: 15 / 15
replication_score: 30
AIRI covered_count: 7 / 32

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not the score delta by itself.&lt;/p&gt;

&lt;p&gt;One small boundary is worth making explicit here.&lt;/p&gt;

&lt;p&gt;The AIRI change is doing two things at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the denominator moved from &lt;code&gt;31&lt;/code&gt; to &lt;code&gt;32&lt;/code&gt; because the governed AIRI detector-scope expanded by one mapping row across this release line,&lt;/li&gt;
&lt;li&gt;and the numerator moved from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;7&lt;/code&gt; because the current release can now carry more bounded AIRI links around the findings it actually surfaced.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That explains the AIRI coverage delta.&lt;/p&gt;

&lt;p&gt;The scoring delta came from a more mechanical correction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;package-lock.json&lt;/code&gt;, and &lt;code&gt;pnpm-lock.yaml&lt;/code&gt; stopped being invisible,&lt;/li&gt;
&lt;li&gt;Stage 3 stopped saying "no dependency/provenance manifest detected,"&lt;/li&gt;
&lt;li&gt;and Stage 4 stopped undercounting replication structure that was obviously there.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is what I mean by "blind spot removal" rather than score drift.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why that matters
&lt;/h3&gt;

&lt;p&gt;This is the kind of change that sounds small in a changelog but large in practice.&lt;/p&gt;

&lt;p&gt;Because it changes the relationship between the tool and the developer reading it.&lt;/p&gt;

&lt;p&gt;A scanner earns the right to say "this repo is weak on provenance" only after it can correctly see the basic surfaces that exist in the target stack.&lt;/p&gt;

&lt;p&gt;That correction also made later report outputs more believable.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;B1&lt;/code&gt; moved from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;15&lt;/code&gt; in affected repositories, that was not "score drift." It was the removal of a blind spot.&lt;/p&gt;

&lt;p&gt;And that distinction is exactly why audit tools need explicit versioned rationale.&lt;/p&gt;

&lt;p&gt;Without it, every score movement looks arbitrary.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Problem: The warning lanes were doing too many jobs at once
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fll48qnppqqso4q9k47ut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fll48qnppqqso4q9k47ut.png" alt="Dedicated warning lanes in STEM BIO-AI showing C4, C5, and C6 semantic separation" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before the split, it helps to read &lt;code&gt;C1–C6&lt;/code&gt; as code-integrity lanes.&lt;/p&gt;

&lt;p&gt;They are not general AI risk categories. They are reviewer-facing signals that tell you what kind of repository weakness the scanner found, and where to inspect next.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lane&lt;/th&gt;
&lt;th&gt;What it means in STEM BIO-AI&lt;/th&gt;
&lt;th&gt;What a reviewer should inspect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hardcoded credential signals&lt;/td&gt;
&lt;td&gt;exposed API keys, cloud keys, tokens, or credential-like patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dependency pinning and external-service fragility&lt;/td&gt;
&lt;td&gt;loose dependency ranges, missing exact pins, fragile external service assumptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deprecated patient-adjacent paths&lt;/td&gt;
&lt;td&gt;legacy, archive, or deprecated folders that still contain patient or clinical-adjacent patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fail-open exception handling&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;except: pass&lt;/code&gt;, &lt;code&gt;except Exception: pass&lt;/code&gt;, silent fallbacks, or code paths where errors can disappear&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Compliance and clinical-boundary integrity&lt;/td&gt;
&lt;td&gt;unsupported HIPAA, compliance, clinical-safe, self-hosted, or regulatory-adjacent claims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mock-auth or no-auth local/self-host trust boundaries&lt;/td&gt;
&lt;td&gt;auto-login, mock authentication, no-auth flows, or weak local trust-boundary assumptions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That table matters because &lt;code&gt;C4&lt;/code&gt;, &lt;code&gt;C5&lt;/code&gt;, and &lt;code&gt;C6&lt;/code&gt; are not interchangeable.&lt;/p&gt;

&lt;p&gt;A fail-open exception is not the same problem as an unsupported compliance claim.&lt;/p&gt;

&lt;p&gt;And an unsupported compliance claim is not the same problem as a mock-auth self-host boundary.&lt;/p&gt;

&lt;p&gt;That distinction became important once the report started surfacing more nuanced governance signals.&lt;/p&gt;

&lt;p&gt;The old &lt;code&gt;C4&lt;/code&gt; lane had started life as a code-oriented fail-open/exception surface.&lt;/p&gt;

&lt;p&gt;But as the scanner got better at spotting unsupported compliance language and boundary failures, more and more signals were being interpreted near that same lane.&lt;/p&gt;

&lt;p&gt;That made the result harder to read.&lt;/p&gt;

&lt;p&gt;If a reviewer sees:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;C4_exception_handling_clinical_adjacent_paths: WARN&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;they should be able to infer the remediation class immediately.&lt;/p&gt;

&lt;p&gt;They should know to inspect executable control flow.&lt;/p&gt;

&lt;p&gt;They should not have to wonder whether the warning is actually about a README compliance claim, a missing clinical boundary, or a mock-auth local path.&lt;/p&gt;

&lt;p&gt;Once one lane starts carrying all of those meanings, the ID stops doing its job.&lt;/p&gt;

&lt;p&gt;This is a common failure mode in rule systems.&lt;/p&gt;

&lt;p&gt;At first it feels efficient:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one warning lane,&lt;/li&gt;
&lt;li&gt;one bucket,&lt;/li&gt;
&lt;li&gt;multiple related issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then a few releases later the bucket becomes a junk drawer.&lt;/p&gt;

&lt;p&gt;That is exactly what had to be prevented here.&lt;/p&gt;




&lt;h3&gt;
  
  
  What changed in &lt;code&gt;1.7.7&lt;/code&gt; and &lt;code&gt;1.7.8&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The solution was to split the lane cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C4&lt;/code&gt; stayed reserved for executable fail-open exception behavior&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C5&lt;/code&gt; was introduced for unsupported compliance or boundary-integrity claims&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C6&lt;/code&gt; was introduced for mock-auth, auto-login, or no-auth self-host/local trust-boundary signals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was more than renaming.&lt;/p&gt;

&lt;p&gt;It made the model of the problem cleaner:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C4&lt;/code&gt; is code-path failure semantics&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C5&lt;/code&gt; is governance/claim integrity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C6&lt;/code&gt; is trust-boundary collapse in local or self-host flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction matters to developers because those are different remediation classes.&lt;/p&gt;

&lt;p&gt;If a repository triggers &lt;code&gt;C4&lt;/code&gt;, you inspect executable control flow.&lt;br&gt;
If it triggers &lt;code&gt;C5&lt;/code&gt;, you inspect public claim surfaces and supporting governance evidence.&lt;br&gt;
If it triggers &lt;code&gt;C6&lt;/code&gt;, you inspect local auth and trust-boundary design.&lt;/p&gt;

&lt;p&gt;One warning label should not try to be all three.&lt;/p&gt;

&lt;p&gt;The more interesting case is when two of those lanes fire together.&lt;/p&gt;

&lt;p&gt;A repository can claim something like "HIPAA-ready self-hosting" at the README layer and also expose a mock-auth or auto-login local path.&lt;/p&gt;

&lt;p&gt;That is not one problem.&lt;/p&gt;

&lt;p&gt;It is two related problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C5&lt;/code&gt; says the claim surface is overstating governance integrity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C6&lt;/code&gt; says the local trust boundary is weaker than the claim suggests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly why the split matters.&lt;/p&gt;

&lt;p&gt;If those two findings collapse into one bucket, the reviewer loses both remediation clarity and causal ordering.&lt;/p&gt;

&lt;p&gt;If they stay separate, the report can say:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the public claim is weak,&lt;/li&gt;
&lt;li&gt;the local boundary is weak,&lt;/li&gt;
&lt;li&gt;and both together make the repository easier to over-trust.&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  The code insight
&lt;/h3&gt;

&lt;p&gt;This is one of those places where good audit tooling starts looking more like good static analysis design.&lt;/p&gt;

&lt;p&gt;A useful warning family is not just one that catches things.&lt;/p&gt;

&lt;p&gt;It is one that stays semantically stable across releases.&lt;/p&gt;

&lt;p&gt;That is why this split mattered:&lt;/p&gt;

&lt;p&gt;it was not just about improving recall.&lt;/p&gt;

&lt;p&gt;It was about preserving interpretability under growth.&lt;/p&gt;

&lt;p&gt;Once a detector ID becomes ambiguous, your historical comparisons become weaker.&lt;/p&gt;

&lt;p&gt;And once historical comparisons become weaker, your audit system starts losing its memory.&lt;/p&gt;

&lt;p&gt;That is a bigger problem than one missed warning.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. Problem: The report could still be correct and yet hard to trust
&lt;/h2&gt;

&lt;p&gt;A repository scanner does not end its life in JSON.&lt;/p&gt;

&lt;p&gt;It ends up in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Markdown&lt;/li&gt;
&lt;li&gt;HTML&lt;/li&gt;
&lt;li&gt;PDF&lt;/li&gt;
&lt;li&gt;demos&lt;/li&gt;
&lt;li&gt;governance reviews&lt;/li&gt;
&lt;li&gt;screenshots&lt;/li&gt;
&lt;li&gt;and social arguments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the output architecture matters almost as much as the scoring logic.&lt;/p&gt;

&lt;p&gt;And there were two places where this became obvious.&lt;/p&gt;


&lt;h3&gt;
  
  
  First: AIRI numbers needed explanation, not just display
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqednhmhqla6mtzxieyyo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqednhmhqla6mtzxieyyo.png" alt="AIRI numbers needed explanation" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Earlier versions could show AIRI coverage as a count, but not always make it obvious why a covered risk appeared.&lt;/p&gt;

&lt;p&gt;That is a problem.&lt;/p&gt;

&lt;p&gt;Because a number like &lt;code&gt;7 / 32&lt;/code&gt; looks precise.&lt;/p&gt;

&lt;p&gt;But precision without causal explanation is fragile.&lt;/p&gt;

&lt;p&gt;Developers do not just want to know that a risk mapped.&lt;br&gt;
They want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which detector triggered it,&lt;/li&gt;
&lt;li&gt;why that detector maps to that AIRI risk,&lt;/li&gt;
&lt;li&gt;and what boundary still remains around that mapping.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the AIRI layer had to become more explicit.&lt;/p&gt;

&lt;p&gt;That is where &lt;code&gt;mapping_details&lt;/code&gt; mattered.&lt;/p&gt;

&lt;p&gt;Covered AIRI rows now carry bounded reasoning objects that can say, in effect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;detector ID&lt;/li&gt;
&lt;li&gt;mapping justification&lt;/li&gt;
&lt;li&gt;trigger reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a much stronger artifact than a bare coverage count.&lt;/p&gt;

&lt;p&gt;It turns AIRI from a visual add-on into an inspectable vocabulary layer.&lt;/p&gt;

&lt;p&gt;In practice the object now looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"24.01.03"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Safe exploration problem with widely deployed AI assistants"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"covered_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"C5_compliance_boundary_integrity"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mapping_details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"detector_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C5_compliance_boundary_integrity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"mapping_justification"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Weak compliance and clinical-boundary integrity can cause users to over-trust unsafe exploration in clinical-adjacent contexts."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"trigger_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unsupported legal/compliance claim surfaced in boundary-integrity lane."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That matters because the AIRI layer no longer asks the reviewer to trust a number alone.&lt;/p&gt;

&lt;p&gt;It now gives the reviewer a bounded reasoning object to inspect.&lt;/p&gt;




&lt;h3&gt;
  
  
  Second: The packets themselves needed re-architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm42xf261dgookg7em5c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm42xf261dgookg7em5c.png" alt="Artifact architecture showing brief, standard, and full evidence packet tiers across output surfaces" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The PDF tiers had also drifted into an awkward shape.&lt;/p&gt;

&lt;p&gt;The old packet boundaries were no longer matching the actual content density:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 4 could disappear or feel collapsed&lt;/li&gt;
&lt;li&gt;the closeout pages could become overcrowded&lt;/li&gt;
&lt;li&gt;and "5-page detailed packet" could stop meaning what users expected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That led to a cleaner packet model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;level 1&lt;/code&gt; = brief &lt;code&gt;1p&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;level 2&lt;/code&gt; = standard &lt;code&gt;5p&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;level 3&lt;/code&gt; = full &lt;code&gt;7p&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And just as importantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the default CLI path moved to &lt;code&gt;level 3&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a statement about what the project now considers the normal artifact.&lt;/p&gt;

&lt;p&gt;The normal artifact is no longer the brief scan.&lt;br&gt;
It is the full evidence packet.&lt;/p&gt;


&lt;h3&gt;
  
  
  Why that matters
&lt;/h3&gt;

&lt;p&gt;This is where the project moved from "scanner" toward "audit architecture."&lt;/p&gt;

&lt;p&gt;A scanner can stop at a result.&lt;/p&gt;

&lt;p&gt;An audit architecture has to preserve meaning across surfaces.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON must be canonical&lt;/li&gt;
&lt;li&gt;HTML must be navigable&lt;/li&gt;
&lt;li&gt;PDFs must honor real packet boundaries&lt;/li&gt;
&lt;li&gt;and the same warning semantics must survive in all of them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why these changes matter to developers.&lt;/p&gt;

&lt;p&gt;They are part of the correctness story.&lt;/p&gt;

&lt;p&gt;If the &lt;code&gt;why&lt;/code&gt; disappears when the result becomes a report, the audit object was never complete to begin with.&lt;/p&gt;


&lt;h2&gt;
  
  
  The hidden pattern behind all three changes
&lt;/h2&gt;

&lt;p&gt;These releases can look like a mixed bag:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JS manifest support&lt;/li&gt;
&lt;li&gt;legal/compliance claim surfacing&lt;/li&gt;
&lt;li&gt;external dependency risk&lt;/li&gt;
&lt;li&gt;C4/C5/C6 split&lt;/li&gt;
&lt;li&gt;AIRI reasoning&lt;/li&gt;
&lt;li&gt;packet restructuring&lt;/li&gt;
&lt;li&gt;demo/output alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there is a single pattern underneath them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;the system became less willing to let ambiguity hide inside a convenient surface.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That showed up in three ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;a manifest should count if it exists&lt;/li&gt;
&lt;li&gt;a warning lane should mean one thing&lt;/li&gt;
&lt;li&gt;a risk mapping should explain itself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That may sound almost obvious.&lt;/p&gt;

&lt;p&gt;But a lot of tools never make it that far.&lt;/p&gt;

&lt;p&gt;They accumulate clever features faster than they reduce ambiguity.&lt;/p&gt;

&lt;p&gt;This line of work did the opposite.&lt;/p&gt;

&lt;p&gt;It made *&lt;em&gt;the system stricter about what its outputs are allowed to imply.&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
That is a more durable path.&lt;/p&gt;


&lt;h2&gt;
  
  
  The more interesting lesson
&lt;/h2&gt;

&lt;p&gt;The most useful thing about &lt;code&gt;1.7.6&lt;/code&gt; through &lt;code&gt;1.7.8&lt;/code&gt; is not that STEM BIO-AI became "smarter."&lt;/p&gt;

&lt;p&gt;It is that it became harder to misread.&lt;/p&gt;

&lt;p&gt;That is a better goal for audit tooling.&lt;/p&gt;

&lt;p&gt;Especially now.&lt;/p&gt;

&lt;p&gt;Because in a world increasingly full of fluent agent outputs, the differentiator is not whether a tool can generate a plausible narrative.&lt;/p&gt;

&lt;p&gt;It is whether the narrative stays tethered to inspectable structure when the repository is messy, cross-stack, overclaimed, or partially misleading.&lt;/p&gt;

&lt;p&gt;That is where this release line got better.&lt;/p&gt;

&lt;p&gt;Not by pretending to know more than it does.&lt;/p&gt;

&lt;p&gt;But by making its own boundaries clearer.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I would tell developers evaluating this line
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoh0x6bamxqa14sn6vog.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoh0x6bamxqa14sn6vog.png" alt="What I would tell developers evaluating this line" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you only look at the release notes, you might think:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better AIRI&lt;/li&gt;
&lt;li&gt;more warnings&lt;/li&gt;
&lt;li&gt;nicer reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is true, but too shallow.&lt;/p&gt;

&lt;p&gt;The real changes are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the scanner is less Python-centric than it was&lt;/li&gt;
&lt;li&gt;the warning taxonomy is more semantically stable than it was&lt;/li&gt;
&lt;li&gt;the artifacts are more inspectable than they were&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination matters more than any one score change.&lt;/p&gt;

&lt;p&gt;It means the tool is becoming less of a clever repo grader and more of a reliable evidence instrument.&lt;/p&gt;

&lt;p&gt;That is the direction I care about.&lt;/p&gt;

&lt;p&gt;Because once the repository is politically messy, clinically adjacent, or governance-sensitive, "good-enough automation" is not enough.&lt;/p&gt;

&lt;p&gt;The system has to show its work.&lt;/p&gt;

&lt;p&gt;These versions got noticeably better at doing that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6ig153mid6rhvhjnxtj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6ig153mid6rhvhjnxtj.png" alt="A Reiable Edivdence Instrument for the Messy Reality" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;stem-ai
stem /path/to/repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you want the full packet explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default path now lands on the full evidence packet, and that is the point.&lt;/p&gt;

&lt;p&gt;In audit tooling, the serious path should not require an extra flag.&lt;/p&gt;




&lt;h2&gt;
  
  
  See the Artifact
&lt;/h2&gt;

&lt;p&gt;If you want to inspect the actual artifact shape behind this release line, these two public outputs are the best reference:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3kbwdk04yt6f8f8q3bqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3kbwdk04yt6f8f8q3bqd.png" alt="stem-bio-ai report" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interactive HTML report: &lt;a href="https://flamehaven01.github.io/flamehaven-audit-reports/stem-bio-ai/yorkeccak-bio/2026-05-15/report.html" rel="noopener noreferrer"&gt;Open interactive HTML report&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full &lt;code&gt;7p&lt;/code&gt; PDF packet: &lt;a href="https://flamehaven01.github.io/flamehaven-audit-reports/stem-bio-ai/yorkeccak-bio/2026-05-15/report.pdf" rel="noopener noreferrer"&gt;Open full &lt;code&gt;7p&lt;/code&gt; PDF packet&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point of &lt;code&gt;1.7.8&lt;/code&gt; is not just that the scanner scores the repository differently.&lt;/p&gt;

&lt;p&gt;It is that the same result now survives translation into JSON, Markdown, HTML, and a full review packet without losing too much meaning along the way.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>governance</category>
      <category>bioinformatics</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 14 May 2026 13:41:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/beyond-repo-scanning-how-airi-expanded-the-risk-vocabulary-in-stem-bio-ai-17x-5bgo</link>
      <guid>https://dev.to/flamehaven01/beyond-repo-scanning-how-airi-expanded-the-risk-vocabulary-in-stem-bio-ai-17x-5bgo</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyj7biyn850iewno8ywf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyj7biyn850iewno8ywf.png" alt="Beyond Repo Scanning" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the second half of the same &lt;code&gt;1.7.x&lt;/code&gt; transition.&lt;/p&gt;

&lt;p&gt;In the previous post, I wrote about calibration governance: how STEM BIO-AI keeps score authority from drifting when users simulate policy posture.&lt;/p&gt;

&lt;p&gt;That was about how the system decides.&lt;/p&gt;

&lt;p&gt;This post is about a different layer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how the system speaks about risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A local repository scanner can become trapped inside its own vocabulary.&lt;/p&gt;

&lt;p&gt;It can detect dependency issues, weak provenance language, shallow validation, reproducibility gaps, and risky exception handling.&lt;/p&gt;

&lt;p&gt;But if every finding stays only inside the scanner's internal language, the report may remain too narrow.&lt;/p&gt;

&lt;p&gt;That is the problem AIRI helped address in STEM BIO-AI &lt;code&gt;1.7.x&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In this context, AIRI is used as a local risk-vocabulary layer built from the MIT AI Risk Repository ecosystem.&lt;/p&gt;

&lt;p&gt;The point is not to replace deterministic repository scanning with an external risk database.&lt;/p&gt;

&lt;p&gt;The point is to give local findings a broader risk vocabulary without turning that vocabulary into a truth claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Basic AIRI Context
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17fvupudpdertnq7dy6i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17fvupudpdertnq7dy6i.png" alt="Expanding the language of risk" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://airisk.mit.edu/" rel="noopener noreferrer"&gt;The MIT AI Risk Repository&lt;/a&gt; is a public AI risk resource from the MIT AI Risk Initiative.&lt;/p&gt;

&lt;p&gt;It helps organize fragmented AI risk language across research, policy, and industry sources.&lt;/p&gt;

&lt;p&gt;The repository includes three main parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an AI Risk Database&lt;/li&gt;
&lt;li&gt;a Causal Taxonomy of AI Risks&lt;/li&gt;
&lt;li&gt;a Domain Taxonomy of AI Risks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications. The public domain taxonomy organizes risks into 7 domains and 24 subdomains.&lt;/p&gt;

&lt;p&gt;Some of those domain taxonomy nodes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;2. Privacy &amp;amp; Security&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;2.1 Compromise of privacy by obtaining, leaking or correctly inferring sensitive information&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;2.2 AI system security vulnerabilities and attacks&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;6.5 Governance failure&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7. AI System Safety, Failures, &amp;amp; Limitations&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7.4 Lack of transparency or interpretability&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That makes AIRI useful as a vocabulary source.&lt;/p&gt;

&lt;p&gt;But vocabulary is not truth.&lt;/p&gt;

&lt;p&gt;A local scanner should not say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this repository caused this risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It should say something more careful:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this local finding belongs to a broader class of AI risk language.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction is the design boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Problem AIRI Was Meant to Solve
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zurro5671x5iqraftvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zurro5671x5iqraftvh.png" alt="Local scanners are trapped in their own vocabulary" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;STEM BIO-AI began as a deterministic evidence-surface scanner for bio and medical AI repositories.&lt;/p&gt;

&lt;p&gt;That core remains.&lt;/p&gt;

&lt;p&gt;The scanner looks at observable repository surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README and docs&lt;/li&gt;
&lt;li&gt;code structure&lt;/li&gt;
&lt;li&gt;CI configuration&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;reproducibility signals&lt;/li&gt;
&lt;li&gt;clinical-adjacent boundary language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But once STEM BIO-AI started producing richer audit outputs, a new question appeared:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How should the system talk about the broader risk territory around a detected finding?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a fail-open exception path may have implications beyond code quality&lt;/li&gt;
&lt;li&gt;weak provenance language may connect to reproducibility and trust concerns&lt;/li&gt;
&lt;li&gt;shallow validation around sensitive inputs may point toward a wider harm surface than the repository alone makes obvious&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without a broader vocabulary, those findings remain local and narrow.&lt;/p&gt;

&lt;p&gt;AIRI helps widen the vocabulary without making the scanner less deterministic.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Short Note on Detector Families
&lt;/h2&gt;

&lt;p&gt;In this article, a detector family means a bounded local analysis surface inside STEM BIO-AI.&lt;/p&gt;

&lt;p&gt;It does not mean an AI model judging the repository.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;code integrity detectors such as hardcoded credential or fail-open exception checks&lt;/li&gt;
&lt;li&gt;AST contract detectors such as shallow validator checks&lt;/li&gt;
&lt;li&gt;bio diagnostics such as SMILES parser-guard or silent mock fallback checks&lt;/li&gt;
&lt;li&gt;provenance and reproducibility evidence surfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A detector family produces a local finding.&lt;/p&gt;

&lt;p&gt;The AIRI layer does not replace that finding.&lt;/p&gt;

&lt;p&gt;It gives the finding a broader vocabulary anchor.&lt;/p&gt;




&lt;h2&gt;
  
  
  AIRI Does Not Replace the Scan
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd7zc9lpl0946movl0id.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd7zc9lpl0946movl0id.png" alt="Vocabulary is not truth" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This boundary matters.&lt;/p&gt;

&lt;p&gt;The AIRI layer does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate that a real-world incident happened&lt;/li&gt;
&lt;li&gt;prove that a repository causes a given harm&lt;/li&gt;
&lt;li&gt;turn a detector hit into a clinical danger claim&lt;/li&gt;
&lt;li&gt;replace due diligence or domain review&lt;/li&gt;
&lt;li&gt;override the deterministic score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, it gives the system a structured way to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what broader risk territory a finding may relate to&lt;/li&gt;
&lt;li&gt;which risk vocabulary exists around that class of concern&lt;/li&gt;
&lt;li&gt;where known coverage gaps remain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why AIRI is a risk-vocabulary layer, not a truth layer.&lt;/p&gt;

&lt;p&gt;If a report says something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;covered risks: 12 / 31
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;that should not be read as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;the repository is 38% safe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;the scanner covers 38% of all AI risk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A better interpretation is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;within the detector scope currently mapped into the curated AIRI runtime layer, this scan triggered findings that connect to these AIRI risk entries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is narrower.&lt;/p&gt;

&lt;p&gt;It is also more useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  From External Repository to Local Governance Layer
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5fwgfgtfux2bro49eyfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5fwgfgtfux2bro49eyfl.png" alt="Three layers of local governance" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AIRI story in STEM BIO-AI changed during &lt;code&gt;1.7.x&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The initial direction was simple: use AIRI to provide broader risk labels around local findings.&lt;/p&gt;

&lt;p&gt;That was useful, but not enough.&lt;/p&gt;

&lt;p&gt;If an audit system relies on an external risk source, it needs governance around that source.&lt;/p&gt;

&lt;p&gt;So STEM BIO-AI separates AIRI into three local layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Local layer&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;airi_registry_full.v1.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;normalized full local registry derived from the upstream AIRI snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;airi_runtime_bundle.v1.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;curated runtime subset used by deterministic scans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;airi_detector_mapping.v1.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;detector-to-risk mapping registry plus known-gap records&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This separation prevents a common mistake:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;confusing the full upstream AIRI universe with the smaller curated runtime bundle used by the scanner.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The scanner uses the curated runtime bundle, not the entire upstream AIRI universe.&lt;/p&gt;

&lt;p&gt;That keeps runtime outputs deterministic, reviewable, and tied to a known local snapshot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “Governed” Means Here
&lt;/h2&gt;

&lt;p&gt;In the current &lt;code&gt;1.7.5&lt;/code&gt; state of the &lt;code&gt;1.7.x&lt;/code&gt; line, governed does not mean that every mapping has gone through an external review board.&lt;/p&gt;

&lt;p&gt;It means something narrower and more concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AIRI data is stored as versioned local artifacts&lt;/li&gt;
&lt;li&gt;runtime scan output uses a curated bundle, not the entire upstream universe&lt;/li&gt;
&lt;li&gt;detector mappings are separated from the full registry&lt;/li&gt;
&lt;li&gt;known gaps are recorded as part of the mapping layer&lt;/li&gt;
&lt;li&gt;artifact metadata surfaces AIRI registry, bundle, mapping, snapshot, and license information&lt;/li&gt;
&lt;li&gt;changes to registry, runtime bundle, or mapping versions require explicit version bumps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the current governance level.&lt;/p&gt;

&lt;p&gt;It is not final.&lt;/p&gt;

&lt;p&gt;But it is stronger than attaching a risk dataset as an unversioned appendix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Curation Logic
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8rcpyd6489hb61zceot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8rcpyd6489hb61zceot.png" alt="Curated by exclusion" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the part that matters most.&lt;/p&gt;

&lt;p&gt;AIRI is broad. STEM BIO-AI is narrow.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI does not need every AIRI entry active at runtime. It needs the subset that can be responsibly connected to deterministic repository evidence.&lt;/p&gt;

&lt;p&gt;So the runtime bundle is curated by exclusion as much as inclusion.&lt;/p&gt;

&lt;p&gt;A risk vocabulary node should stay outside the runtime bundle when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No local evidence surface exists&lt;/strong&gt;&lt;br&gt;
The scanner has no repository-level signal that can responsibly connect to that risk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The mapping would require causal inference&lt;/strong&gt;&lt;br&gt;
The scanner would have to imply that harm occurred, that users were affected, or that the repository caused a risk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The risk is too broad for repository-local evidence&lt;/strong&gt;&lt;br&gt;
Broad societal, geopolitical, or macroeconomic risks may be important in AIRI, but they should not become runtime scan outputs unless a local detector surface can support the mapping.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The mapping would confuse vocabulary with score authority&lt;/strong&gt;&lt;br&gt;
If a risk label might be read as changing the formal score or certifying danger, it should remain outside the runtime layer until the reporting semantics are clear.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the runtime bundle is not a summary of all AI risk.&lt;/p&gt;

&lt;p&gt;It is the subset of risk vocabulary that the scanner can use responsibly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example: Detector Hit to AIRI Domain Vocabulary
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvz0q2fw4vfp5okznynr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvz0q2fw4vfp5okznynr.png" alt="Connecting evidence to context" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A concrete example helps.&lt;/p&gt;

&lt;p&gt;Suppose STEM BIO-AI detects a shallow validator around sensitive or clinical-adjacent inputs.&lt;/p&gt;

&lt;p&gt;The local finding might be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CC3_shallow_validator:
validate_* or check_* function uses only length checks without structural validation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the repository level, this is a code-contract finding.&lt;/p&gt;

&lt;p&gt;It says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the function appears to validate input&lt;/li&gt;
&lt;li&gt;the validation is shallow&lt;/li&gt;
&lt;li&gt;the implementation may not enforce the boundary implied by its name&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AIRI layer should not turn that into:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this repository caused privacy harm.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That would be too strong.&lt;/p&gt;

&lt;p&gt;A safer mapping uses AIRI as vocabulary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Local detector surface&lt;/th&gt;
&lt;th&gt;Local meaning&lt;/th&gt;
&lt;th&gt;AIRI vocabulary anchor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CC3_shallow_validator&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;validation function appears shallower than its name implies&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;; possibly &lt;code&gt;2.1 Compromise of privacy...&lt;/code&gt; if sensitive information handling is in scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fail-open exception path&lt;/td&gt;
&lt;td&gt;code path may silently continue after failure&lt;/td&gt;
&lt;td&gt;&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hardcoded credential signal&lt;/td&gt;
&lt;td&gt;repository surface suggests exposed secret-like pattern&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.2 AI system security vulnerabilities and attacks&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weak provenance surface&lt;/td&gt;
&lt;td&gt;repository gives weak evidence about data/source traceability&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;7.4 Lack of transparency or interpretability&lt;/code&gt;; possibly &lt;code&gt;6.5 Governance failure&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silent mock fallback&lt;/td&gt;
&lt;td&gt;production-like path may fall back to simulated behavior&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;; &lt;code&gt;7.4 Lack of transparency or interpretability&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The mapping does not prove harm.&lt;/p&gt;

&lt;p&gt;It tells the reviewer which broader AIRI vocabulary may be relevant to the local finding.&lt;/p&gt;

&lt;p&gt;That is the difference between:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this detector proves a risk occurred&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this detector finding belongs near this risk-language area.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The second claim is weaker.&lt;/p&gt;

&lt;p&gt;It is also the correct claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local Provenance Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5suzlt3jdnbanqzvc6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5suzlt3jdnbanqzvc6e.png" alt="Provenance is not cosmetic" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AIRI is external.&lt;/p&gt;

&lt;p&gt;That means STEM BIO-AI needs to answer governance questions explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which upstream snapshot is being used?&lt;/li&gt;
&lt;li&gt;which subset is active at runtime?&lt;/li&gt;
&lt;li&gt;which risks are included in the curated bundle?&lt;/li&gt;
&lt;li&gt;which risks are known gaps?&lt;/li&gt;
&lt;li&gt;which detector maps to which AIRI entry?&lt;/li&gt;
&lt;li&gt;what version of the mapping is active?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why the AIRI work matters.&lt;/p&gt;

&lt;p&gt;It is not just adding labels.&lt;/p&gt;

&lt;p&gt;It is turning risk vocabulary into a governed local data layer.&lt;/p&gt;

&lt;p&gt;In the current governance note, the upstream source is recorded as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upstream source: &lt;code&gt;https://airisk.mit.edu/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;upstream artifact: &lt;code&gt;The AI Risk Repository V4_03&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;upstream license: &lt;code&gt;MIT&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;local snapshot date: &lt;code&gt;2026-04-23&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That provenance is not cosmetic.&lt;/p&gt;

&lt;p&gt;It allows an audit artifact to say which risk vocabulary it was using when the scan was produced.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Implemented in the Current 1.7.5 State of 1.7.x
&lt;/h2&gt;

&lt;p&gt;The current AIRI layer is implemented, but bounded.&lt;/p&gt;

&lt;p&gt;Implemented surfaces include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AIRI-backed coverage surfaces in scan outputs&lt;/li&gt;
&lt;li&gt;local curated runtime bundle&lt;/li&gt;
&lt;li&gt;local registry and mapping schemas&lt;/li&gt;
&lt;li&gt;detector-to-AIRI mapping layer&lt;/li&gt;
&lt;li&gt;known-gap reporting&lt;/li&gt;
&lt;li&gt;provenance and bundle/source labeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In current scan results, &lt;code&gt;airi_risk_coverage&lt;/code&gt; is the main artifact surface for this layer.&lt;/p&gt;

&lt;p&gt;The public result contract includes AIRI fields such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;airi_registry_version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_bundle_version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_mapping_version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_bundle_scope&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_upstream_snapshot_date&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_upstream_license&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;total_risks_in_registry&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;total_risks_in_bundle&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;total_risks_in_detector_scope&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;detectors_triggered&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;covered_risks&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;covered_count&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;coverage_rate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;known_gaps_in_bundle&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;known_gaps_outside_bundle&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These fields matter because they let a reviewer distinguish three things that are easy to confuse: the upstream AIRI source, the local runtime bundle, and the detector mapping actually used by the scan.&lt;/p&gt;

&lt;p&gt;The important part is not only that these fields exist.&lt;/p&gt;

&lt;p&gt;The important part is that AIRI usage becomes auditable from the artifact itself.&lt;/p&gt;

&lt;p&gt;If two scans use different AIRI snapshots or mappings, that difference should not be hidden.&lt;/p&gt;




&lt;h2&gt;
  
  
  Coverage Is Not a Safety Percentage
&lt;/h2&gt;

&lt;p&gt;AIRI coverage in STEM BIO-AI is an audit-surface concept, not a safety percentage.&lt;/p&gt;

&lt;p&gt;It does not mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the repository is safe&lt;/li&gt;
&lt;li&gt;the repository is unsafe&lt;/li&gt;
&lt;li&gt;the scanner covers all AI risk&lt;/li&gt;
&lt;li&gt;the covered percentage is a compliance score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a local deterministic finding has been mapped to a known risk-vocabulary entry inside the curated AIRI runtime layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is useful because it gives reviewers a wider frame.&lt;/p&gt;

&lt;p&gt;But it does not turn local evidence into a global safety claim.&lt;/p&gt;

&lt;p&gt;This is the same discipline used elsewhere in STEM BIO-AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoring is not clinical validation&lt;/li&gt;
&lt;li&gt;advisory interpretation is not scoring authority&lt;/li&gt;
&lt;li&gt;reproducibility evidence is not automatic score authority&lt;/li&gt;
&lt;li&gt;AIRI coverage is not a safety percentage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer has a role.&lt;/p&gt;

&lt;p&gt;Each layer has a boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed in 1.7.x
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;1.7.x&lt;/code&gt; AIRI story is not simply “we added AIRI.”&lt;/p&gt;

&lt;p&gt;The actual change was a move from loose risk labeling toward governed local vocabulary.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.0
&lt;/h3&gt;

&lt;p&gt;AIRI V4 integration appeared in scan outputs.&lt;/p&gt;

&lt;p&gt;The scanner began producing an &lt;code&gt;airi_risk_coverage&lt;/code&gt; section that maps triggered detector findings to AIRI risk IDs, coverage rate, and known gaps.&lt;/p&gt;

&lt;p&gt;The same release also introduced Layer 2 AST contract detectors such as &lt;code&gt;CC1&lt;/code&gt;, &lt;code&gt;CC2&lt;/code&gt;, and &lt;code&gt;CC3&lt;/code&gt;, which expanded the local detector surface available for risk-vocabulary mapping.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.1
&lt;/h3&gt;

&lt;p&gt;AIRI became a governed local data layer.&lt;/p&gt;

&lt;p&gt;The architecture separated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full local registry&lt;/li&gt;
&lt;li&gt;curated runtime bundle&lt;/li&gt;
&lt;li&gt;detector mapping registry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This release also replaced hardcoded AIRI detector mappings and known-gap lists with packaged local registry files.&lt;/p&gt;

&lt;p&gt;Runtime outputs began surfacing registry version, bundle version, mapping version, upstream snapshot date, license, attribution note, and split known gaps into &lt;code&gt;known_gaps_in_bundle&lt;/code&gt; and &lt;code&gt;known_gaps_outside_bundle&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.2
&lt;/h3&gt;

&lt;p&gt;No major AIRI architecture change.&lt;/p&gt;

&lt;p&gt;The important governance point was regression stability: same-target self-scan comparison verified no drift in &lt;code&gt;airi_risk_coverage&lt;/code&gt; alongside score, tier, code contract, detector summary, and evidence ledger count.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.3
&lt;/h3&gt;

&lt;p&gt;No major AIRI architecture change.&lt;/p&gt;

&lt;p&gt;The release focused on runtime cleanup, stale demo wording, layout stabilization, and output routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.4
&lt;/h3&gt;

&lt;p&gt;AIRI presentation became clearer across demo and report outputs.&lt;/p&gt;

&lt;p&gt;The release surfaced AIRI summary material more clearly across the Hugging Face overview card and markdown/explain report sections.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.5
&lt;/h3&gt;

&lt;p&gt;No new AIRI data architecture change.&lt;/p&gt;

&lt;p&gt;But artifact-level governance improved more broadly through additive evidence-ledger quality fields and audit-freshness metadata.&lt;/p&gt;

&lt;p&gt;That matters because AIRI is most useful when it lives inside a report surface that already carries freshness, evidence quality, and provenance signals.&lt;/p&gt;

&lt;p&gt;The important change across the line is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AIRI moved from attached dataset toward versioned local risk-vocabulary layer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What This Still Does Not Do
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F455nj5o0eixcl5y3g2ho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F455nj5o0eixcl5y3g2ho.png" alt="Local evidence first, external vocabulary second" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AIRI layer still does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verify real incidents&lt;/li&gt;
&lt;li&gt;prove causality&lt;/li&gt;
&lt;li&gt;certify repository safety&lt;/li&gt;
&lt;li&gt;replace domain review&lt;/li&gt;
&lt;li&gt;turn AIRI categories into deterministic truth claims&lt;/li&gt;
&lt;li&gt;collapse the full upstream AIRI universe into the runtime scanner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not missing features.&lt;/p&gt;

&lt;p&gt;They are the boundaries that keep the layer useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Could Go
&lt;/h2&gt;

&lt;p&gt;The next useful direction is not to overload the scanner with external systems.&lt;/p&gt;

&lt;p&gt;It is to improve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;registry provenance&lt;/li&gt;
&lt;li&gt;bundle governance&lt;/li&gt;
&lt;li&gt;mapping confidence&lt;/li&gt;
&lt;li&gt;known-gap clarity&lt;/li&gt;
&lt;li&gt;artifact-visible mapping metadata&lt;/li&gt;
&lt;li&gt;disciplined links to incident-oriented resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The broader MIT AIRI ecosystem also includes related incident-oriented resources such as the AI Incident Tracker.&lt;/p&gt;

&lt;p&gt;That ecosystem is relevant context, but it is not the same thing as current runtime integration in STEM BIO-AI.&lt;/p&gt;

&lt;p&gt;A future version may choose to reference incident-oriented resources more explicitly, but deterministic scans should not ingest them casually or blur them with repository-local findings.&lt;/p&gt;

&lt;p&gt;A future version should be able to say not only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this detector maps to this AIRI risk vocabulary area.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But also:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this mapping has this confidence level, this review status, this local evidence family, and this known limitation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the next governance step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtlt5pk3yg79aconki1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtlt5pk3yg79aconki1l.png" alt="A governed bridge for STEM BIO-AI" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is the role of AIRI in this release line.&lt;/p&gt;

&lt;p&gt;Not truth replacement.&lt;/p&gt;

&lt;p&gt;Not safety certification.&lt;/p&gt;

&lt;p&gt;Not incident proof.&lt;/p&gt;

&lt;p&gt;A governed vocabulary bridge.&lt;/p&gt;

&lt;p&gt;Local evidence first.&lt;/p&gt;

&lt;p&gt;External vocabulary second.&lt;/p&gt;

&lt;p&gt;Explicit provenance always.&lt;/p&gt;




&lt;h2&gt;
  
  
  References and Acknowledgment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;MIT AI Risk Repository: &lt;a href="https://airisk.mit.edu/" rel="noopener noreferrer"&gt;https://airisk.mit.edu/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MIT AI Incident Tracker: &lt;a href="https://airisk.mit.edu/ai-incident-tracker" rel="noopener noreferrer"&gt;https://airisk.mit.edu/ai-incident-tracker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;STEM BIO-AI repository: &lt;a href="https://github.com/flamehaven01/STEM-BIO-AI" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/STEM-BIO-AI&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This AIRI-related direction in STEM BIO-AI was informed by broader public AI risk work, including the MIT AI Risk Repository ecosystem.&lt;/p&gt;

&lt;p&gt;The framing around AIRI as a broader risk-vocabulary layer, rather than a repository-local truth layer, was also strengthened by public commentary and ecosystem work from people in this space, including Peter Slattery, PhD.&lt;/p&gt;

&lt;p&gt;These references informed the vocabulary and governance direction described here. They do not imply endorsement of STEM BIO-AI or responsibility for its implementation choices.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>bioinformatics</category>
      <category>opensource</category>
    </item>
    <item>
      <title>When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 14 May 2026 05:41:41 +0000</pubDate>
      <link>https://dev.to/flamehaven01/when-control-becomes-authority-calibration-governance-in-stem-bio-ai-17x-52hf</link>
      <guid>https://dev.to/flamehaven01/when-control-becomes-authority-calibration-governance-in-stem-bio-ai-17x-52hf</guid>
      <description>&lt;p&gt;Control slowly becomes authority when nobody marks the boundary.&lt;/p&gt;

&lt;p&gt;That is the calibration problem I kept running into while building STEM BIO-AI.&lt;/p&gt;

&lt;p&gt;At first, STEM BIO-AI was centered on the score. It scanned a local bio or medical AI repository, inspected observable repository surfaces, and mapped the repository to a structured review tier.&lt;/p&gt;

&lt;p&gt;That was useful.&lt;/p&gt;

&lt;p&gt;But it was not enough.&lt;/p&gt;

&lt;p&gt;The harder problem was not producing a number. The harder problem was preventing every useful adjacent signal from becoming part of that number.&lt;/p&gt;

&lt;p&gt;In a bio/medical AI repository review system, several lanes can look similar if the tool is not careful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deterministic scoring&lt;/li&gt;
&lt;li&gt;diagnostic findings&lt;/li&gt;
&lt;li&gt;replication evidence&lt;/li&gt;
&lt;li&gt;advisory interpretation&lt;/li&gt;
&lt;li&gt;domain-specific review posture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They all matter.&lt;/p&gt;

&lt;p&gt;But they should not all have the same authority.&lt;/p&gt;

&lt;p&gt;That is the core reason calibration became a governance problem in the &lt;code&gt;1.7.x&lt;/code&gt; line.&lt;/p&gt;

&lt;p&gt;The principle is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;easy experimentation, hard drift&lt;/strong&gt;&lt;br&gt;
STEM BIO-AI should let researchers express review posture. It should let operators simulate policy changes. It should make policy metadata visible in artifacts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But it should not let those inputs silently mutate the official score.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Short Context for New Readers
&lt;/h2&gt;

&lt;p&gt;STEM BIO-AI is a deterministic evidence-surface scanner for bio and medical AI repositories.&lt;/p&gt;

&lt;p&gt;It does not validate biomedical efficacy. It does not certify clinical safety. It does not prove that a model is correct.&lt;/p&gt;

&lt;p&gt;It scans observable repository surfaces such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README and docs&lt;/li&gt;
&lt;li&gt;code structure&lt;/li&gt;
&lt;li&gt;CI configuration&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;evidence and boundary language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The formal score is currently built from three weighted score-bearing stages, plus an explicit credential penalty and clinical cap or hard-floor logic:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1&lt;/td&gt;
&lt;td&gt;README / stated evidence boundary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2R&lt;/td&gt;
&lt;td&gt;repo-local consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3&lt;/td&gt;
&lt;td&gt;code and bio-responsibility surface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The active formula still also applies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C1_penalty&lt;/code&gt; when hardcoded credentials are detected&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;score_cap&lt;/code&gt; or &lt;code&gt;t0_hard_floor&lt;/code&gt; when clinical-adjacent boundary rules require it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stage 4 exists, but it is a separate replication lane. It reports reproducibility and replication posture without automatically changing the formal score.&lt;/p&gt;

&lt;p&gt;That separation is intentional.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Actually Implemented in the Current 1.7.5 State of 1.7.x
&lt;/h2&gt;

&lt;p&gt;Before discussing calibration philosophy, the implementation boundary has to be clear.&lt;/p&gt;

&lt;p&gt;In the current &lt;code&gt;1.7.5&lt;/code&gt; state of the &lt;code&gt;1.7.x&lt;/code&gt; line, STEM BIO-AI has implemented a real calibration architecture, but it is still mostly a mirror-only and preview-oriented architecture.&lt;/p&gt;

&lt;p&gt;This post describes the current released state of the &lt;code&gt;1.7.x&lt;/code&gt; line as of &lt;code&gt;v1.7.5&lt;/code&gt;, not a future authoritative-read-through design.&lt;/p&gt;

&lt;p&gt;Implemented surfaces include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packaged calibration profiles&lt;/li&gt;
&lt;li&gt;schema and runtime validation&lt;/li&gt;
&lt;li&gt;profile identity surfaced in result metadata&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy list&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy explain&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy derive&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy simulate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;simulation-only local profile files&lt;/li&gt;
&lt;li&gt;profile hashes and read-mode metadata in artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current named recommendation surface is intentionally narrow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;default&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;strict_clinical_adjacency&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;reproducibility_first&lt;/code&gt; is still a draft posture, not an active release-grade named recommendation.&lt;/p&gt;

&lt;p&gt;The important limitation is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the authoritative scan scoring path is still protected from arbitrary user-provided profile mutation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In other words, &lt;code&gt;scan --policy &amp;lt;name&amp;gt;&lt;/code&gt; can surface selected profile metadata. &lt;code&gt;policy derive&lt;/code&gt; and &lt;code&gt;policy simulate&lt;/code&gt; can show governed preview behavior. But user-provided profile files do not simply become the official scoring authority.&lt;/p&gt;

&lt;p&gt;More specifically, local profile files are currently accepted only by &lt;code&gt;stem policy simulate&lt;/code&gt;, and the CLI rejects them unless the file remains &lt;code&gt;mirror_only&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is not a missing convenience.&lt;/p&gt;

&lt;p&gt;That is the boundary being tested before it is allowed to become authority.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pressure That Causes Drift
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lcdsepmkhe1s4e9or9z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lcdsepmkhe1s4e9or9z.png" alt="Formal score and advisory tuning drift" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One question pushed this design forward:&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;If advisory AI becomes more capable, will teams really keep the boundary between formal score and advisory interpretation?&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
I do not think the answer is automatically yes.&lt;/p&gt;

&lt;p&gt;If an advisory layer becomes helpful, there will always be pressure to let it influence the formal score "just a little."&lt;/p&gt;

&lt;p&gt;That is usually how audit systems drift.&lt;/p&gt;

&lt;p&gt;The score stops being a stable artifact and starts becoming a moving interpretation layer.&lt;/p&gt;

&lt;p&gt;The danger is not that users want control.&lt;/p&gt;

&lt;p&gt;The danger is that control slowly becomes authority without anyone noticing.&lt;/p&gt;

&lt;p&gt;So the design question is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we let people tune the system more freely?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The design question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we let people express domain judgment without making the formal score easy to mutate?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is where calibration enters.&lt;/p&gt;


&lt;h2&gt;
  
  
  Calibration Is Not a Tuning Console
&lt;/h2&gt;

&lt;p&gt;The wrong calibration UX looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_1_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2r_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_3_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ca_no_disclaimer_cap"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;61&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"b2_partial_credit_mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"looser"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is editable.&lt;/p&gt;

&lt;p&gt;But editable is not the same as governed.&lt;/p&gt;

&lt;p&gt;Most researchers, operators, and domain reviewers do not think in raw score constants. They usually know something closer to this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clinical-adjacent claims should be treated very strictly&lt;/li&gt;
&lt;li&gt;reproducibility matters strongly in this environment&lt;/li&gt;
&lt;li&gt;README polish should not outweigh code evidence&lt;/li&gt;
&lt;li&gt;a casual mention of "limitations" should not count as meaningful transparency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the current calibration design starts with posture questions, not raw constants.&lt;/p&gt;

&lt;p&gt;The goal is not to ask a researcher to become a scoring-engine maintainer.&lt;/p&gt;

&lt;p&gt;The goal is to let a researcher express domain posture while keeping the formal scoring boundary visible, versioned, and difficult to mutate accidentally.&lt;/p&gt;




&lt;h2&gt;
  
  
  The &lt;code&gt;1–5&lt;/code&gt; Scale Is Input, Not Authority
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda62dggwxf9j1962dh7k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda62dggwxf9j1962dh7k.png" alt="Posture over raw constants" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the current design, the user-facing intent layer uses a &lt;code&gt;1–5&lt;/code&gt; scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt; = minimal emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;2&lt;/code&gt; = light emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;3&lt;/code&gt; = moderate emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;4&lt;/code&gt; = strong emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;5&lt;/code&gt; = very strong emphasis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important line is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the &lt;code&gt;1–5&lt;/code&gt; scale is a UX input surface, not part of the formal score engine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means the user can express posture in a natural way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clinical strictness&lt;/li&gt;
&lt;li&gt;code-integrity priority&lt;/li&gt;
&lt;li&gt;reproducibility priority&lt;/li&gt;
&lt;li&gt;structured limitations requirement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But those answers do not directly become score constants.&lt;/p&gt;

&lt;p&gt;They are translated through explicit rules.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wha5pt8fn2grrequosz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wha5pt8fn2grrequosz.png" alt="Governing decision matrix" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The current decision table is intentionally narrow:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;clinical_strictness &amp;gt;= 4&lt;/code&gt; and &lt;code&gt;reproducibility_priority &amp;lt;= 3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;recommend &lt;code&gt;strict_clinical_adjacency&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all four values are &lt;code&gt;2&lt;/code&gt; or &lt;code&gt;3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;keep &lt;code&gt;default&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no named-profile rule matches&lt;/td&gt;
&lt;td&gt;generate a &lt;code&gt;preview_only&lt;/code&gt; profile delta from bounded deltas only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This table should not be mistaken for an empirically optimized model.&lt;/p&gt;

&lt;p&gt;It is a conservative governance rule table.&lt;/p&gt;

&lt;p&gt;The current threshold choices are design-steward decisions, not claims of statistical optimality. Their purpose is to keep the translation layer narrow, reviewable, and non-authoritative until a stronger benchmark-backed promotion process exists.&lt;/p&gt;

&lt;p&gt;That matters because a calibration system can fail in two opposite ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it can be too rigid for domain experts to use&lt;/li&gt;
&lt;li&gt;it can be so flexible that every local preference becomes a new score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The initial rule table chooses the safer failure mode.&lt;/p&gt;

&lt;p&gt;If a posture is clearly within an existing release-grade profile, the system can recommend that profile. If the posture is ambiguous or combines competing priorities, the system falls back to &lt;code&gt;preview_only&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clinical_strictness = 4
reproducibility_priority = 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That does not automatically recommend &lt;code&gt;strict_clinical_adjacency&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It falls back to &lt;code&gt;preview_only&lt;/code&gt;, because two strong postures are competing and no release-grade named profile currently resolves that conflict.&lt;/p&gt;

&lt;p&gt;A hidden similarity function might produce something that looks more flexible.&lt;/p&gt;

&lt;p&gt;But it would also make the governance harder to audit.&lt;/p&gt;

&lt;p&gt;A narrow rule table is less magical.&lt;/p&gt;

&lt;p&gt;It is also safer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the CLI Is Allowed to Do
&lt;/h2&gt;

&lt;p&gt;![&lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n4mi9izlgqwmzhb62e7g.png" rel="noopener noreferrer"&gt;Easy experimentation, hard drift — sandbox and vault&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The preview workflow can look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem policy derive &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--clinical-strictness&lt;/span&gt; 5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--code-integrity-priority&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reproducibility-priority&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--structured-limitations-requirement&lt;/span&gt; 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem policy simulate /path/to/repo &lt;span class="nt"&gt;--profile-file&lt;/span&gt; my_profile.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But those flows are not the same as saying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--stage1-weight&lt;/span&gt; 0.35 &lt;span class="nt"&gt;--cap&lt;/span&gt; 72
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first two are governed preview surfaces.&lt;/p&gt;

&lt;p&gt;The last one is an untracked tuning console.&lt;/p&gt;

&lt;p&gt;The design intentionally supports the first and rejects the shape of the last.&lt;/p&gt;

&lt;p&gt;This is the practical meaning of easy experimentation, hard drift.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Gets Verified
&lt;/h2&gt;

&lt;p&gt;The central claim of this design is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the current calibration rules are perfect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The claim is narrower:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;calibration changes should not become score authority without a visible governance path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That claim can be tested by checking whether the system exposes or blocks the relevant control surfaces.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Drift risk&lt;/th&gt;
&lt;th&gt;Expected control&lt;/th&gt;
&lt;th&gt;How to verify it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;arbitrary score tuning&lt;/td&gt;
&lt;td&gt;no free-form CLI weight / cap override&lt;/td&gt;
&lt;td&gt;CLI help and accepted options do not expose direct score constants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hidden profile mutation&lt;/td&gt;
&lt;td&gt;profile status and read mode are surfaced&lt;/td&gt;
&lt;td&gt;result artifacts expose profile metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;unclear profile identity&lt;/td&gt;
&lt;td&gt;profile name, version, and hash are visible&lt;/td&gt;
&lt;td&gt;scan output includes calibration profile identity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;advisory influence leakage&lt;/td&gt;
&lt;td&gt;advisory output cannot override score&lt;/td&gt;
&lt;td&gt;advisory response validation cannot mutate &lt;code&gt;final_score&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;reproducibility overcompensation&lt;/td&gt;
&lt;td&gt;Stage 4 remains separate&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;replication_score&lt;/code&gt; does not change &lt;code&gt;formal_tier&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;premature named-profile expansion&lt;/td&gt;
&lt;td&gt;ambiguous postures fall back to preview&lt;/td&gt;
&lt;td&gt;derive/simulate returns &lt;code&gt;preview_only&lt;/code&gt; when no named rule matches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;detector promotion drift&lt;/td&gt;
&lt;td&gt;evidence-only detectors are not score-authoritative&lt;/td&gt;
&lt;td&gt;detector policy is versioned in policy files and governance docs, even though per-detector score-integration status is not yet surfaced as first-class artifact metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is still not the same as a full empirical benchmark.&lt;/p&gt;

&lt;p&gt;But it is a real verification target.&lt;/p&gt;

&lt;p&gt;The system can be checked for whether it allows the forbidden mutation path.&lt;/p&gt;

&lt;p&gt;That is the level of proof appropriate for this release line: not "the final policy is optimal," but "the policy cannot quietly become authoritative without leaving a trace."&lt;/p&gt;

&lt;p&gt;That trace is stronger for some surfaces than others. Profile identity, hash, and read mode are already artifact-visible in &lt;code&gt;1.7.5&lt;/code&gt;. Detector promotion semantics are already versioned and documented, but they are not yet surfaced as first-class per-detector policy metadata in the result object.&lt;/p&gt;




&lt;h2&gt;
  
  
  The B2 Tightening Example
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxrlajsvmqp5jiurvatd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxrlajsvmqp5jiurvatd.png" alt="Deterministic boundary changes in B2 tightening" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The clearest scoring example is Stage 3 B2.&lt;/p&gt;

&lt;p&gt;B2 is the bias and limitations measurement surface. Earlier scoring behavior allowed a weaker boundary: a simple vocabulary-level signal could still receive partial credit.&lt;/p&gt;

&lt;p&gt;That became too permissive.&lt;/p&gt;

&lt;p&gt;A repository that mentions "bias" or "limitations" once is not necessarily disclosing a meaningful boundary. It may only be surface signaling.&lt;/p&gt;

&lt;p&gt;So the B2 rule became stricter.&lt;/p&gt;

&lt;p&gt;The important change is not a marketing claim about benchmark improvement. The important change is a deterministic boundary change:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;Earlier posture&lt;/th&gt;
&lt;th&gt;Tightened posture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;no bias / limitations vocabulary&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minimal single-term mention only&lt;/td&gt;
&lt;td&gt;partial credit possible&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;structured limitations language&lt;/td&gt;
&lt;td&gt;partial credit possible&lt;/td&gt;
&lt;td&gt;partial credit possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;quantitative measurement evidence&lt;/td&gt;
&lt;td&gt;full credit possible&lt;/td&gt;
&lt;td&gt;full credit possible&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the first place where calibration becomes visible as more than a principle.&lt;/p&gt;

&lt;p&gt;The rule change creates a concrete score path difference:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a repository that previously depended only on a minimal single-term limitations mention no longer has a B2 partial-credit path after the tightening.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the current public claim.&lt;/p&gt;

&lt;p&gt;I am not presenting a benchmark-wide before/after score delta here, because that would require a pinned fixture set and published comparison protocol.&lt;/p&gt;

&lt;p&gt;Without that, a claimed "T3 became T2" example would be anecdotal at best and misleading at worst.&lt;/p&gt;

&lt;p&gt;So the honest evidence level is rule-level impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the credit path changed&lt;/li&gt;
&lt;li&gt;the changed path is deterministic&lt;/li&gt;
&lt;li&gt;the changed path is inspectable&lt;/li&gt;
&lt;li&gt;benchmark-level deltas should be published only when the fixture protocol is pinned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In clinical-adjacent repositories, limitation language is not decoration. It is part of the claim boundary.&lt;/p&gt;

&lt;p&gt;A one-word mention does not carry the same weight as a structured limitations section, demographic coverage statement, known failure-mode description, or quantitative subgroup analysis.&lt;/p&gt;

&lt;p&gt;This is why calibration cannot be only a UI problem.&lt;/p&gt;

&lt;p&gt;If a user asks for a stricter limitations posture, the system should not silently subtract points through a hidden override. It should expose the rule that changed and the reason that rule exists.&lt;/p&gt;

&lt;p&gt;That is the difference between a score tweak and a governed scoring rationale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Stage 4 Stays Separate
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dqje510bvt6vrw13vov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dqje510bvt6vrw13vov.png" alt="Importance is not score authority" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stage 4 is the place where the strongest counterargument appears.&lt;/p&gt;

&lt;p&gt;The counterargument is fair:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If reproducibility is important, why does it not affect the formal score?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My answer is that importance and score authority are not the same thing.&lt;/p&gt;

&lt;p&gt;Stage 4 measures replication posture: containers, reproducibility targets, dependency locks, artifact references, seeds, citation surfaces, and similar evidence.&lt;/p&gt;

&lt;p&gt;Those signals matter.&lt;/p&gt;

&lt;p&gt;But they do not mean the same thing as the formal claim boundary.&lt;/p&gt;

&lt;p&gt;A repository can be highly reproducible and still make unsafe or unbounded clinical claims.&lt;/p&gt;

&lt;p&gt;A repository can have clean containers and dependency locks while still lacking a clinical-use disclaimer.&lt;/p&gt;

&lt;p&gt;A repository can be easy to rerun while still having weak data provenance or shallow limitation language.&lt;/p&gt;

&lt;p&gt;If Stage 4 were allowed to lift the formal score too early, reproducibility could start compensating for claim-boundary weakness.&lt;/p&gt;

&lt;p&gt;That would be a different scoring philosophy.&lt;/p&gt;

&lt;p&gt;It may become valid in the future, but only if the rule is explicit.&lt;/p&gt;

&lt;p&gt;For now, Stage 4 is reported as a separate lane because the system is saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reproducibility matters&lt;/li&gt;
&lt;li&gt;reproducibility should be visible&lt;/li&gt;
&lt;li&gt;reproducibility should affect review interpretation&lt;/li&gt;
&lt;li&gt;reproducibility should not silently override the formal score boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why stronger reproducibility intent currently falls back to &lt;code&gt;preview_only&lt;/code&gt; instead of becoming a release-grade named profile.&lt;/p&gt;

&lt;p&gt;The system is not saying reproducibility is unimportant.&lt;/p&gt;

&lt;p&gt;It is saying reproducibility has not yet been granted formal score authority.&lt;/p&gt;




&lt;h2&gt;
  
  
  Advisory AI Uses the Same Boundary
&lt;/h2&gt;

&lt;p&gt;Advisory AI follows the same rule.&lt;/p&gt;

&lt;p&gt;Helpful interpretation is not score authority.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI can export provider-neutral advisory packets and validate downstream advisory responses, but the deterministic scanner does not need an external model runtime to produce the formal score.&lt;/p&gt;

&lt;p&gt;If an advisory system becomes useful, it may help interpret findings, prioritize review, or explain evidence patterns.&lt;/p&gt;

&lt;p&gt;But unless a future release explicitly changes the policy, advisory output remains structurally subordinate to the deterministic score.&lt;/p&gt;

&lt;p&gt;That is enough for this article.&lt;/p&gt;

&lt;p&gt;The broader advisory boundary is a separate topic.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Scoring Tool to Audit Workflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhihip6jhydpu5n3hszqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhihip6jhydpu5n3hszqd.png" alt="From scoring tool to audit custody" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;1.7.x&lt;/code&gt; transition is best understood as a shift in the questions the tool is expected to answer.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Earlier scoring-tool question&lt;/th&gt;
&lt;th&gt;Audit-workflow question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What score did the repository get?&lt;/td&gt;
&lt;td&gt;Which policy profile was visible when the score was produced?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which stage contributed most?&lt;/td&gt;
&lt;td&gt;Was that stage score-authoritative, diagnostic, or separate-lane evidence?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What evidence triggered the tier?&lt;/td&gt;
&lt;td&gt;Did the evidence change the formal score or only the review posture?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What should the user fix?&lt;/td&gt;
&lt;td&gt;Would a proposed policy change be preview-only, experimental, benchmark-candidate, or release-authoritative?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is why I describe &lt;code&gt;1.7.x&lt;/code&gt; as an audit-system transition.&lt;/p&gt;

&lt;p&gt;The score still matters.&lt;/p&gt;

&lt;p&gt;But the system is increasingly designed around the custody of the score: where it came from, what was allowed to influence it, and what was intentionally kept outside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Still Does Not Do
&lt;/h2&gt;

&lt;p&gt;This boundary is just as important as the implementation.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI still does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate biomedical efficacy&lt;/li&gt;
&lt;li&gt;certify benchmark truth&lt;/li&gt;
&lt;li&gt;determine clinical deployment safety&lt;/li&gt;
&lt;li&gt;let advisory AI overwrite the formal score&lt;/li&gt;
&lt;li&gt;open arbitrary numeric tuning in the official scan path&lt;/li&gt;
&lt;li&gt;allow profile experimentation to become official policy without governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not missing conveniences.&lt;/p&gt;

&lt;p&gt;They are boundaries.&lt;/p&gt;

&lt;p&gt;A strong repository evidence tier is still an observable repository-surface signal. It is not clinical clearance, regulatory approval, or proof of biomedical validity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Next Version Direction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitvcewljd7d2ydmfsmoh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitvcewljd7d2ydmfsmoh.png" alt="The next step: policy parity" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next important step is not adding more knobs.&lt;/p&gt;

&lt;p&gt;It is authoritative policy read-through in parity mode.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the default policy profile becomes the source read by the scoring path&lt;/li&gt;
&lt;li&gt;existing fixtures should show no score or tier drift&lt;/li&gt;
&lt;li&gt;policy hashes remain visible in artifacts&lt;/li&gt;
&lt;li&gt;non-default and researcher-provided profiles remain governed preview surfaces until promoted&lt;/li&gt;
&lt;li&gt;score-affecting policy changes become explicit release events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a big-bang rewrite.&lt;/p&gt;

&lt;p&gt;It is authority relocation.&lt;/p&gt;

&lt;p&gt;The goal is to move score-affecting constants into versioned policy objects without changing the score by accident.&lt;/p&gt;

&lt;p&gt;Only after that parity step does it become safe to discuss broader named profiles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Position
&lt;/h2&gt;

&lt;p&gt;The calibration problem is not really about giving users more control.&lt;/p&gt;

&lt;p&gt;It is about deciding when control becomes authority.&lt;/p&gt;

&lt;p&gt;If every useful signal can gradually influence the score, the score stops being an audit artifact.&lt;/p&gt;

&lt;p&gt;It becomes a negotiation.&lt;/p&gt;

&lt;p&gt;That is what STEM BIO-AI is trying to avoid.&lt;/p&gt;

&lt;p&gt;Researchers should be able to express posture.&lt;/p&gt;

&lt;p&gt;Operators should be able to simulate alternatives.&lt;/p&gt;

&lt;p&gt;Policy stewards should be able to promote changes.&lt;/p&gt;

&lt;p&gt;But the formal score should not move unless the governance path says it moved.&lt;/p&gt;

&lt;p&gt;That is the difference between a tuning console and an audit system.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>bioinformatics</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building a Deterministic Governance Kernel: Separating Custody from Truth</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 12 May 2026 06:00:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/building-a-deterministic-governance-kernel-separating-custody-from-truth-57l5</link>
      <guid>https://dev.to/flamehaven01/building-a-deterministic-governance-kernel-separating-custody-from-truth-57l5</guid>
      <description>&lt;p&gt;A governance engine should not pretend to know the truth of every domain.&lt;/p&gt;

&lt;p&gt;That was the architectural lesson behind CGF.&lt;/p&gt;

&lt;p&gt;At Flamehaven Labs, we build B2B governance engines for highly regulated environments. Over the past year, we developed specialized deterministic systems for different review contexts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CareChainGovernanceEngine (CCGE)&lt;/strong&gt;: a fail-closed clinical-governance engine for enforcing safety-oriented review gates in bio-AI workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Analyst's Problem Framework (TAP)&lt;/strong&gt;: a “Proof Custody” engine designed to package and audit mathematical proof candidates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both worked inside their own domains. But both also exposed the same architectural problem: reusable custody mechanics were mixed with domain-specific decision semantics.&lt;/p&gt;

&lt;p&gt;We needed to audit new targets — such as external open-source intake, RAG retrieval receipts, and AI evolution proposals. If we did not extract a common, domain-neutral kernel, we would be doomed to rewrite the entire scanning, hashing, and reporting pipeline for every new vertical.&lt;/p&gt;

&lt;p&gt;The result was the &lt;strong&gt;Custody Governance Framework (CGF)&lt;/strong&gt;: a domain-neutral custody kernel for B2B technical review workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggo4b1sjt3zsixrr75v4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggo4b1sjt3zsixrr75v4.png" alt="The Architectural Flaw" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is an architecture extraction note on how we decoupled domain truth from custody mechanics, the API design that powers it, and why we specifically rejected the modern trend of “LLM-agentic” governance.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with “Agentic” Governance
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjih0ep6gy6hz0g5uoycv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjih0ep6gy6hz0g5uoycv.png" alt="The Problem with “Agentic” Governance" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Many emerging AI governance workflows are becoming document-shaped or agent-shaped: a YAML file, a Markdown policy, or an LLM prompt that says, “check whether this is safe.”&lt;/p&gt;

&lt;p&gt;The problem is not that LLMs are useless. The problem is that they can produce compliance-shaped language without producing verifiable compliance artifacts.&lt;/p&gt;

&lt;p&gt;In a strict B2B handoff — where auditability, legal review, and future regulatory mapping to frameworks such as the EU AI Act or NIST AI RMF may matter — you cannot rely on non-deterministic evaluations.&lt;/p&gt;

&lt;p&gt;CGF takes the opposite approach: &lt;strong&gt;strict determinism&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The framework does not own domain-truth semantics. It owns the custody mechanics around findings, profiles, evidence, approvals, and artifacts.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: A Deterministic Data Flow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7efal66s6ug3q0ok9nz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7efal66s6ug3q0ok9nz.png" alt="The Architecture: A Deterministic Data Flow" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To enforce this separation, we designed CGF as a deterministic pipeline with a narrow side-effect boundary.&lt;/p&gt;

&lt;p&gt;The core engine takes a normalized review input and a &lt;code&gt;GovernanceProfile&lt;/code&gt;, transforming them into immutable artifact dataclasses. The writer layer then materializes those objects as files, manifests, and release artifacts.&lt;/p&gt;

&lt;p&gt;Here is what the end-to-end flow looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwvftadwopko2nqlqrbv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwvftadwopko2nqlqrbv.jpg" alt="mermaid" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The API Boundary
&lt;/h2&gt;

&lt;p&gt;At the core boundary, the framework does not evaluate whether a finding is “bad” by its own logic.&lt;/p&gt;

&lt;p&gt;It relies on a &lt;code&gt;StatusDeriver&lt;/code&gt; driven by the injected profile.&lt;/p&gt;

&lt;p&gt;A simplified sketch of the boundary looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified sketch, not the full implementation
&lt;/span&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GovernancePipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GovernanceProfile&lt;/span&gt;
    &lt;span class="n"&gt;status_deriver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StatusDeriver&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_packet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ScanResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;GovernancePacket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Inject profile-specific requirements, such as mandatory surfaces
&lt;/span&gt;        &lt;span class="n"&gt;governed_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_with_profile_requirements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Derive deterministic status via the profile
&lt;/span&gt;        &lt;span class="n"&gt;status_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_deriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;derive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Assemble the immutable packet
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;GovernancePacket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;profile_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profile_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;status_reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;compliance_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;_compute_compliance_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact implementation also handles timestamps, approval bridges, validation, artifact writing, and manifest verification.&lt;/p&gt;

&lt;p&gt;The important point is architectural: the core does not decide domain truth. It records how a profile interpreted the evidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Highlight 1: Inspectable Artifacts Over Silent Mutation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xr9csrctkgeo7ywjtm3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xr9csrctkgeo7ywjtm3.png" alt="Architecture Highlight 1" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A major risk in governance automation is the engine silently mutating the target repository, for example by automatically injecting compliance boilerplate.&lt;/p&gt;

&lt;p&gt;Many tools solve this with a &lt;code&gt;--dry-run&lt;/code&gt; flag that prints logs to stdout.&lt;/p&gt;

&lt;p&gt;In a B2B audit, stdout logs are not enough. You need an auditable, verifiable artifact.&lt;/p&gt;

&lt;p&gt;CGF implements a preview-first artifact flow. When the pipeline runs, it does not mutate the target repository by default. Instead, it consumes a normalized review input and emits a deterministic custody bundle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;cgf run &lt;span class="nt"&gt;--profile&lt;/span&gt; proof_custody.json &lt;span class="nt"&gt;--scan&lt;/span&gt; target_scan.json &lt;span class="nt"&gt;--out&lt;/span&gt; audit/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In real deployments, &lt;code&gt;target_scan.json&lt;/code&gt; may be produced by a vertical adapter, repository scanner, RAG receipt processor, or customer-specific intake layer.&lt;/p&gt;

&lt;p&gt;The output is an inspectable custody bundle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audit/
├── governance_packet.json       # Machine-readable audit state
├── preview_report.md            # Human-readable summary
├── chain_ribbon.md              # Markdown tag for custody-chain review state
└── manifest.json                # Artifact manifest with file hashes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important point is not that CGF edits the repository.&lt;/p&gt;

&lt;p&gt;It does not.&lt;/p&gt;

&lt;p&gt;The important point is that the review state, findings, proposed next actions, and artifact hashes become inspectable before any external system decides what to do next.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Highlight 2: The Reality of Audit Chains
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgsvgjb1yqf419dg58sf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgsvgjb1yqf419dg58sf.png" alt="Architecture Highlight 2" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In B2B handoffs, customers often ask for tamper-resistance.&lt;/p&gt;

&lt;p&gt;CGF supports a &lt;code&gt;GovernanceAuditChain&lt;/code&gt;, an append-only JSONL ledger where packet records can be linked through SHA-256 hashes.&lt;/p&gt;

&lt;p&gt;But we need to be honest about the tradeoff:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local hash chains are tamper-evident, not tamper-resistant.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a bad actor has write access to the filesystem, they can delete the audit directory and regenerate the entire chain from scratch.&lt;/p&gt;

&lt;p&gt;CGF does not use a blockchain. The cost and complexity of a distributed ledger would outweigh the benefits for local repository scanning.&lt;/p&gt;

&lt;p&gt;Instead, CGF provides local tamper-evidence.&lt;/p&gt;

&lt;p&gt;To achieve true tamper-resistance, the deployment environment still matters: CI/CD artifact signing, external timestamping, identity providers, or customer-controlled archival systems.&lt;/p&gt;

&lt;p&gt;For example, an external identity token can be wired into an approval bridge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wiring an external identity token into the local chain
&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApprovalBridge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;approved_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compliance_lead_JWT_subject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cryptographic signature from Identity Provider XYZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;approved_packet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GovernancePipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_approval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework provides the cryptographic hooks and chronological integrity.&lt;/p&gt;

&lt;p&gt;The deployment environment provides the immutability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz467oliq5o7qr67tqavf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz467oliq5o7qr67tqavf.png" alt="Current Limitations" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This post is not a release announcement or a regulatory certification claim.&lt;/p&gt;

&lt;p&gt;It is an architecture note about the extraction pattern: how we separated reusable governance mechanics from domain-specific truth semantics.&lt;/p&gt;

&lt;p&gt;CGF is still early. It is not a compliance platform, not a hosted governance service, and not a regulatory certification product.&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;CGF does not prove that a medical system is safe. It does not prove that a mathematical argument is true. It does not certify legal compliance. It does not replace domain experts, auditors, clinicians, lawyers, or reviewers.&lt;/p&gt;

&lt;p&gt;What it does is narrower, but more concrete:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It makes the custody surface inspectable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are also practical limitations.&lt;/p&gt;

&lt;p&gt;As mentioned, local hash chains are tamper-evident, not tamper-resistant. True immutability still has to come from the deployment environment: CI/CD signing, external timestamping, identity providers, or customer-controlled archival systems.&lt;/p&gt;

&lt;p&gt;CGF is also not yet a complete enterprise governance platform. Authentication, RBAC, multi-tenant profile registries, async approval workflows, and regulatory citation mapping are still roadmap items, not solved infrastructure.&lt;/p&gt;

&lt;p&gt;Each domain still needs adapters, profiles, thresholds, and human review policies.&lt;/p&gt;

&lt;p&gt;The kernel provides the custody mechanics. The domain owner still has to define what evidence matters.&lt;/p&gt;

&lt;p&gt;That is intentional.&lt;/p&gt;

&lt;p&gt;A generic governance kernel should not pretend to know the truth of every field.&lt;/p&gt;




&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn8btw5jcbqoc6luwy2s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn8btw5jcbqoc6luwy2s.png" alt="Roadmap" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The roadmap is not to turn CGF into a giant all-knowing compliance agent.&lt;/p&gt;

&lt;p&gt;The roadmap is to keep the kernel small, deterministic, and inspectable while adding stronger boundaries around the places where real B2B workflows need them.&lt;/p&gt;

&lt;p&gt;The next layers are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory mapping&lt;/strong&gt;: mapping finding codes to frameworks such as the EU AI Act, NIST AI RMF, and ISO/IEC 42001 without turning CGF itself into a legal authority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval policy hardening&lt;/strong&gt;: adding stronger policy checks around &lt;code&gt;ApprovalBridge&lt;/code&gt; so approvals can be scoped, expired, and externally verified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async approval workflows&lt;/strong&gt;: allowing human review, compliance sign-off, or customer approval to arrive after the initial custody packet is generated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profile registries&lt;/strong&gt;: supporting versioned, tenant-scoped governance profiles so different customers can use different policies without changing the kernel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signed external receipts&lt;/strong&gt;: allowing RAG systems, technology scanners, quality engines, and external tools to produce receipts that CGF can verify and attach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical adapters&lt;/strong&gt;: binding existing domain systems back to the kernel without importing their domain-specific truth semantics into the core.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every roadmap item has to preserve the same rule:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The kernel may govern custody, but it must not absorb domain truth.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The direction is deliberately conservative: more custody, more verification, more explicit boundaries — not more autonomous magic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfb5uu077mkda41epvtc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfb5uu077mkda41epvtc.png" alt="Why This Matters" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of AI governance today is still document-shaped.&lt;/p&gt;

&lt;p&gt;A policy lives in Markdown. A checklist lives in YAML. A prompt says the system should be safe, transparent, aligned, compliant, or human-reviewed.&lt;/p&gt;

&lt;p&gt;Those documents are not useless. They are often necessary.&lt;/p&gt;

&lt;p&gt;But they are not governance by themselves.&lt;/p&gt;

&lt;p&gt;Governance becomes real only when it has an execution surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a typed input boundary&lt;/li&gt;
&lt;li&gt;normalized findings&lt;/li&gt;
&lt;li&gt;profile-owned status derivation&lt;/li&gt;
&lt;li&gt;explicit evidence references&lt;/li&gt;
&lt;li&gt;generated review artifacts&lt;/li&gt;
&lt;li&gt;manifest hashes&lt;/li&gt;
&lt;li&gt;approval metadata&lt;/li&gt;
&lt;li&gt;release bundles&lt;/li&gt;
&lt;li&gt;verification commands&lt;/li&gt;
&lt;li&gt;clear non-goals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the difference CGF is trying to make.&lt;/p&gt;

&lt;p&gt;It is not another Markdown file describing how governance should work.&lt;/p&gt;

&lt;p&gt;It is a deterministic custody pipeline that turns review inputs into inspectable artifacts.&lt;/p&gt;

&lt;p&gt;The goal is not to make governance sound more sophisticated.&lt;/p&gt;

&lt;p&gt;The goal is to make it harder to fake.&lt;/p&gt;

&lt;p&gt;A governance system should leave behind more than confidence.&lt;/p&gt;

&lt;p&gt;It should leave behind artifacts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblh4bvppqc3mioyrdxrr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblh4bvppqc3mioyrdxrr.png" alt="Conclusion" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Extracting the Custody Governance Framework taught us that governance architecture has to separate process from truth.&lt;/p&gt;

&lt;p&gt;Truth belongs to domains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;medicine&lt;/li&gt;
&lt;li&gt;mathematics&lt;/li&gt;
&lt;li&gt;law&lt;/li&gt;
&lt;li&gt;security&lt;/li&gt;
&lt;li&gt;finance&lt;/li&gt;
&lt;li&gt;science&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Process belongs to the governance kernel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what was reviewed&lt;/li&gt;
&lt;li&gt;which profile was applied&lt;/li&gt;
&lt;li&gt;which findings fired&lt;/li&gt;
&lt;li&gt;what evidence was attached&lt;/li&gt;
&lt;li&gt;what status was derived&lt;/li&gt;
&lt;li&gt;who approved it&lt;/li&gt;
&lt;li&gt;whether the resulting artifacts can be verified later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation is the reason CGF exists.&lt;/p&gt;

&lt;p&gt;It does not try to be an AI judge. It does not ask an LLM to guess whether a system is compliant. It does not hide governance inside a prompt, a policy document, or a YAML file.&lt;/p&gt;

&lt;p&gt;It creates custody artifacts that can be inspected.&lt;/p&gt;

&lt;p&gt;For us, that is the real boundary between governance as language and governance as infrastructure.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>softwareengineering</category>
      <category>governance</category>
      <category>python</category>
    </item>
    <item>
      <title>From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Fri, 08 May 2026 08:25:50 +0000</pubDate>
      <link>https://dev.to/flamehaven01/from-score-to-workflow-turning-stem-bio-ai-into-a-local-audit-system-5amp</link>
      <guid>https://dev.to/flamehaven01/from-score-to-workflow-turning-stem-bio-ai-into-a-local-audit-system-5amp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Earlier in this series, I wrote about why bio/medical AI repositories need more than benchmarks, what I learned after auditing 10 public repositories, and why an AI auditor itself needs a memory contract.&lt;/p&gt;

&lt;p&gt;That work led to STEM-AI v1.1.2 and the MICA layer: a memory-contracted initialization step that forces the auditor to load bounded rules before scoring begins. If you have not read that part, the relevant post is here:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2"&gt;How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the broader arc, the full series is here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/series/37087"&gt;STEM-AI / STEM BIO-AI series&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But after that, a different engineering problem took over.&lt;/p&gt;

&lt;p&gt;The audit logic was stricter.&lt;br&gt;&lt;br&gt;
The reports were richer.&lt;br&gt;&lt;br&gt;
The reasoning was more bounded.&lt;/p&gt;

&lt;p&gt;But the developer workflow still felt too loose.&lt;/p&gt;

&lt;p&gt;So the next question was no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do I score trust?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How does a bio-AI audit tool become something an engineer can actually run, gate, inspect, and integrate?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer turned out to be less about seeing more signals and more about refusing to confuse them.&lt;/p&gt;

&lt;p&gt;That is the core argument of this post:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I took that seriously, STEM BIO-AI stopped looking like “one score plus some extra metadata” and started looking like a system with distinct lanes, distinct boundaries, and distinct operator workflows.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The problem was no longer scoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgevfs1ir5axbqcnehca4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgevfs1ir5axbqcnehca4.png" alt="The problem was no longer scoring" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the time I reached the 1.6.x line, the rubric was no longer the main bottleneck.&lt;/p&gt;

&lt;p&gt;The bottleneck was operational clarity.&lt;/p&gt;

&lt;p&gt;A trust audit tool is not very useful if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the normal path is one long command with too many flags&lt;/li&gt;
&lt;li&gt;CI has to reverse-engineer the result from human-readable stdout&lt;/li&gt;
&lt;li&gt;bio-specific diagnostics are mixed directly into the same surface as formal scoring&lt;/li&gt;
&lt;li&gt;regulatory relevance shows up as vague implication instead of explicit traceability&lt;/li&gt;
&lt;li&gt;advisory AI is present, but its relationship to the official score is unclear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the tool stops being hard to trust for conceptual reasons and starts being hard to trust for operational reasons.&lt;/p&gt;

&lt;p&gt;That is a different class of problem.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The CLI had to reflect operator intent&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The earlier CLI was functional, but too flat.&lt;/p&gt;

&lt;p&gt;You could do things like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
stem /path/to/repo &lt;span class="nt"&gt;--tier-gate&lt;/span&gt; T3 &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--quiet&lt;/span&gt;
stem /path/to/repo &lt;span class="nt"&gt;--advisory&lt;/span&gt; packet
stem /path/to/repo &lt;span class="nt"&gt;--advisory-response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of that worked.&lt;/p&gt;

&lt;p&gt;The issue was that it treated very different operator intents as one long option surface.&lt;/p&gt;

&lt;p&gt;In practice, these are separate workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scan a repository and generate artifacts&lt;/li&gt;
&lt;li&gt;enforce a gate in CI/CD&lt;/li&gt;
&lt;li&gt;export a bounded advisory packet&lt;/li&gt;
&lt;li&gt;validate a downstream provider response&lt;/li&gt;
&lt;li&gt;cross an explicit provider-call boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I refactored the CLI around workflows instead of flag accumulation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan &amp;lt;folder&amp;gt;
stem gate &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--min-tier&lt;/span&gt; T2
stem advisory validate &amp;lt;folder&amp;gt;
stem advisory packet &amp;lt;folder&amp;gt;
stem advisory call &amp;lt;folder&amp;gt;
stem advisory check-response &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--response&lt;/span&gt; FILE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The older paths still exist for compatibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem &amp;lt;folder&amp;gt;
stem audit &amp;lt;folder&amp;gt;
stem &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--tier-gate&lt;/span&gt; T2 &lt;span class="nt"&gt;--quiet&lt;/span&gt;
stem &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--advisory&lt;/span&gt; packet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But they are no longer the conceptual center.&lt;/p&gt;

&lt;p&gt;That matters more than it sounds.&lt;/p&gt;

&lt;p&gt;Once the command names match the operator’s intent, the system becomes easier to teach, easier to remember, and easier to wire into pipelines.&lt;/p&gt;

&lt;p&gt;This is not just a DX cleanup. In a medical or bio-adjacent audit context, command ambiguity is part of the trust problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Repository trust needed four separate lanes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cnkkkskgtmge5xk6hwe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cnkkkskgtmge5xk6hwe.png" alt="Repository trust needed four separate lanes" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was the biggest architectural shift.&lt;/p&gt;

&lt;p&gt;I stopped treating repository trust as one object.&lt;/p&gt;

&lt;p&gt;In practice, it needed four separate lanes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;deterministic structural scoring&lt;/li&gt;
&lt;li&gt;deterministic diagnostics&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;optional AI advisory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If all of those collapse into one final confidence score, the tool becomes harder to reason about.&lt;/p&gt;

&lt;p&gt;The more regulated the domain, the more dangerous it becomes to collapse every useful signal into one score.&lt;/p&gt;

&lt;p&gt;Some evidence should change the score.&lt;br&gt;
Some evidence should only raise review priority.&lt;br&gt;
Some evidence should support traceability.&lt;br&gt;
Some evidence should be handed to a human or advisory system.&lt;/p&gt;

&lt;p&gt;The maturity of the tool is not that it sees all of them.&lt;/p&gt;

&lt;p&gt;The maturity is that it does not confuse them.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;This separation is not just conceptual. It exists in the code path.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One reasonable objection to any architecture write-up is: are these really separate lanes, or are they just different labels on the same output object?&lt;/p&gt;

&lt;p&gt;In STEM BIO-AI, the answer is visible in the execution order.&lt;/p&gt;

&lt;p&gt;The scanner computes the formal score first. In the result object, that means keys like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 1&lt;/li&gt;
&lt;li&gt;Stage 2R&lt;/li&gt;
&lt;li&gt;Stage 3&lt;/li&gt;
&lt;li&gt;risk penalty&lt;/li&gt;
&lt;li&gt;score cap&lt;/li&gt;
&lt;li&gt;&lt;code&gt;final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does it append the non-scoring layers, again as explicit result keys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;regulatory_basis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stage_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;regulatory_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;reasoning_model&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;optional &lt;code&gt;ai_advisory&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ordering matters.&lt;/p&gt;

&lt;p&gt;The score is not derived from the advisory lane.&lt;br&gt;
The regulatory mapping does not mutate the formal tier.&lt;br&gt;
The diagnostics lane can emit evidence without becoming a hidden score multiplier.&lt;/p&gt;

&lt;p&gt;This is also why the JSON shape ended up more layered than earlier versions. The output had to preserve the distinction the code was already enforcing.&lt;/p&gt;

&lt;p&gt;That execution order is the architectural reason the next four sections exist.&lt;/p&gt;

&lt;p&gt;Once I had the lanes separated in code, each lane needed its own claim boundary, its own output semantics, and its own reason for not being collapsed into the others.&lt;/p&gt;

&lt;p&gt;Put differently, the next four sections answer four different questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is allowed to change the formal tier&lt;/li&gt;
&lt;li&gt;what is useful enough to emit, but not yet mature enough to score&lt;/li&gt;
&lt;li&gt;what can support regulatory review without pretending to be compliance&lt;/li&gt;
&lt;li&gt;what can involve AI without letting AI become the scoring authority&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;1. Deterministic structural scoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrypiyfrasqki89isbj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrypiyfrasqki89isbj.png" alt="The official baseline for triage" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This remains the official score and tier.&lt;/p&gt;

&lt;p&gt;It measures the main repository-visible signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README evidence&lt;/li&gt;
&lt;li&gt;repo-local consistency&lt;/li&gt;
&lt;li&gt;code and bio responsibility&lt;/li&gt;
&lt;li&gt;dependency hygiene&lt;/li&gt;
&lt;li&gt;changelog and provenance surfaces&lt;/li&gt;
&lt;li&gt;code-integrity patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lane is local, deterministic, and machine-checkable.&lt;/p&gt;

&lt;p&gt;That is the part that can legitimately drive a formal triage tier.&lt;/p&gt;

&lt;p&gt;I am not claiming this is the only possible architecture. A different system could have folded diagnostics or replication more aggressively into one unified score.&lt;/p&gt;

&lt;p&gt;I chose not to, because the narrower score proved easier to defend. A smaller claim with cleaner boundaries was more valuable here than a broader score with ambiguous semantics.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;2. Deterministic diagnostics&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is where the deterministic diagnostics spec became important.&lt;/p&gt;

&lt;p&gt;I needed a place for findings that are real, useful, and inspectable, but should not silently perturb the main score until they are calibrated.&lt;/p&gt;

&lt;p&gt;That is what &lt;code&gt;docs/DETERMINISTIC_DIAGNOSTICS.md&lt;/code&gt; defines.&lt;/p&gt;

&lt;p&gt;It separates the diagnostic problem into two lanes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lane A: deterministic local diagnostics&lt;/li&gt;
&lt;li&gt;Lane B: optional AI-assisted semantic review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation is central.&lt;/p&gt;

&lt;p&gt;The deterministic lane is authoritative for hard findings.&lt;br&gt;
The AI lane is advisory only.&lt;/p&gt;

&lt;p&gt;The local diagnostic lane currently focuses on evidence-bearing bio-specific signals such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;malformed or suspicious SMILES-like outputs&lt;/li&gt;
&lt;li&gt;missing parser guards&lt;/li&gt;
&lt;li&gt;silent mock or simulated-data fallbacks&lt;/li&gt;
&lt;li&gt;risky subprocess construction around bio tools&lt;/li&gt;
&lt;li&gt;traceability manifest surfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point was not to create a “bio slop detector” with a catchy label.&lt;/p&gt;

&lt;p&gt;The point was to create a local evidence lane that could say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;here is the file&lt;/li&gt;
&lt;li&gt;here is the line&lt;/li&gt;
&lt;li&gt;here is the snippet&lt;/li&gt;
&lt;li&gt;here is the bounded interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is much more useful than a vague semantic warning.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why diagnostics stayed evidence-only
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4uyzvdequdx3f2juhj58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4uyzvdequdx3f2juhj58.png" alt="Retaining evidence without inflating the score" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was one of the harder engineering decisions.&lt;/p&gt;

&lt;p&gt;It would have been easy to push every new bio-specific detector directly into the final score.&lt;/p&gt;

&lt;p&gt;I did not do that.&lt;/p&gt;

&lt;p&gt;The deterministic diagnostics spec is explicit that many of these findings begin as evidence-only. In practice, they are emitted as line-level records in the result object's &lt;code&gt;evidence_ledger&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;findings are emitted into the result object’s &lt;code&gt;evidence_ledger&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;findings appear in Markdown and &lt;code&gt;--explain&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;findings do not change &lt;code&gt;final_score&lt;/code&gt; or &lt;code&gt;formal_tier&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the right default.&lt;/p&gt;

&lt;p&gt;For example, the SMILES lane can be very useful for detecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;malformed surface strings&lt;/li&gt;
&lt;li&gt;low-entropy placeholders&lt;/li&gt;
&lt;li&gt;repeated trivial outputs&lt;/li&gt;
&lt;li&gt;missing parser guards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does not prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;medicinal usefulness&lt;/li&gt;
&lt;li&gt;synthetic feasibility&lt;/li&gt;
&lt;li&gt;binding plausibility&lt;/li&gt;
&lt;li&gt;biological efficacy&lt;/li&gt;
&lt;li&gt;full chemical validity in every edge case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That boundary is important.&lt;/p&gt;

&lt;p&gt;A detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/p&gt;

&lt;p&gt;Just as importantly, this is not meant to be a permanent holding area for every detector. The diagnostics spec is explicit that score impact should only happen after commit-pinned benchmark evidence, explicit false-positive review, and reproducible calibration. In other words, evidence-only is the temporary safe default until a detector has earned score authority.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;3. Regulatory traceability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftesjgssomvk8xj4t0sye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftesjgssomvk8xj4t0sye.png" alt="Traceability is not a permission slip" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second document that became central was &lt;code&gt;docs/REGULATORY_MAPPING.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This solved a different problem.&lt;/p&gt;

&lt;p&gt;Once you audit clinical-adjacent repositories, people naturally ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does this align with EU AI Act themes?&lt;/li&gt;
&lt;li&gt;does this help with FDA-oriented review?&lt;/li&gt;
&lt;li&gt;is there anything relevant to IMDRF or SaMD evidence families?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The wrong answer would be to turn those questions into a fake compliance score.&lt;/p&gt;

&lt;p&gt;So I did the opposite.&lt;/p&gt;

&lt;p&gt;The regulatory layer is explicitly framed as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a traceability aid, not a compliance verdict&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That document maps observed evidence classes to requirement families with bounded confidence labels like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strong&lt;/li&gt;
&lt;li&gt;moderate&lt;/li&gt;
&lt;li&gt;weak-moderate&lt;/li&gt;
&lt;li&gt;weak&lt;/li&gt;
&lt;li&gt;not assessed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it makes an important distinction:&lt;/p&gt;

&lt;p&gt;the confidence applies to the mapping relationship, not to legal acceptability.&lt;/p&gt;

&lt;p&gt;Those confidence labels are not model outputs and they are not inferred at runtime. They are fixed, rule-level mapping judgments attached to evidence classes in the mapping document itself. For example, changelog / checksum / config-manifest style evidence is treated as a moderate traceability signal for Article 12-style review, while human-oversight interface signals stay weak because interface presence is not the same thing as oversight procedure.&lt;/p&gt;

&lt;p&gt;That means the tool can say things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;versioned manifests and changelogs may support record-keeping / traceability review&lt;/li&gt;
&lt;li&gt;intended-use and disclaimer sections may support transparency scaffolding review&lt;/li&gt;
&lt;li&gt;override interfaces may support human-oversight interface review&lt;/li&gt;
&lt;li&gt;subgroup measurement language may support weak evidence of data-governance intent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without claiming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;legal compliance&lt;/li&gt;
&lt;li&gt;regulatory clearance&lt;/li&gt;
&lt;li&gt;clinical certification&lt;/li&gt;
&lt;li&gt;deployer conformance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a regulated domain, traceability is useful only when it does not pretend to be permission.&lt;/p&gt;
&lt;h3&gt;
  
  
  A concrete example: why Article 12 is traceability, not compliance
&lt;/h3&gt;

&lt;p&gt;The best example here is EU AI Act Article 12 style traceability.&lt;/p&gt;

&lt;p&gt;The regulatory mapping layer treats signals like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;checksum manifests&lt;/li&gt;
&lt;li&gt;versioned config surfaces&lt;/li&gt;
&lt;li&gt;audit-log schema fragments&lt;/li&gt;
&lt;li&gt;decision-event or override-event schema tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;as evidence that a repository may have traceability scaffolding.&lt;/p&gt;

&lt;p&gt;That is useful.&lt;/p&gt;

&lt;p&gt;It is also bounded.&lt;/p&gt;

&lt;p&gt;The mapping document is explicit that changelog presence is not the same thing as deploy-time event logging, and that current scope does not establish runtime log completeness.&lt;/p&gt;

&lt;p&gt;So the output can legitimately say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;there is structural evidence relevant to traceability review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;while refusing to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this system satisfies traceability obligations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly the kind of distinction I wanted this lane to enforce.&lt;/p&gt;

&lt;p&gt;What this buys in practice is not a compliance shortcut, but a faster review question. If a repository exposes none of the scaffolding signals in this lane — no change history, no artifact hashes, no versioned manifests, no event-schema surfaces — then there is very little reason to treat it as traceability-ready for deeper institutional review. If those signals do exist, the next step is still expert inspection, but the scanner has at least opened the right folder and pointed at the right files.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why regulatory mapping stayed subordinate to evidence
&lt;/h3&gt;

&lt;p&gt;This was non-negotiable.&lt;/p&gt;

&lt;p&gt;Regulatory relevance had to remain downstream from evidence, not a score multiplier pretending to be law.&lt;/p&gt;

&lt;p&gt;That is why the output shape separates things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;regulatory_basis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stage_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;regulatory_traceability&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;from the actual score computation.&lt;/p&gt;

&lt;p&gt;And it is not just decorative structure.&lt;/p&gt;

&lt;p&gt;The regulatory basis object is registry-driven. It can mark &lt;code&gt;review_required&lt;/code&gt; when the basis registry is stale or required source families are missing. That is a traceability control on the mapping layer itself, not an input into the scoring formula.&lt;/p&gt;

&lt;p&gt;This is also why the regulatory note belongs in a muted traceability panel, not next to the main score.&lt;/p&gt;

&lt;p&gt;If a repo has traceability-relevant scaffolding, that is useful.&lt;/p&gt;

&lt;p&gt;If a repo has traceability-relevant scaffolding, that is still not compliance.&lt;/p&gt;

&lt;p&gt;The distinction has to remain visible in both the code and the artifacts.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;4. Optional AI advisory&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zlifs3rxmxuwm87n8z0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zlifs3rxmxuwm87n8z0.png" alt="Enforcing a bounded intelligence sandbox" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fourth lane is the advisory layer.&lt;/p&gt;

&lt;p&gt;This one exists for bounded model-assisted review, but it does not get to rewrite the official outcome.&lt;/p&gt;

&lt;p&gt;That means workflows like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory packet /path/to/repo
stem advisory check-response /path/to/repo &lt;span class="nt"&gt;--response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can exist without creating ambiguity about who owns the formal result.&lt;/p&gt;

&lt;p&gt;The advisory layer can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;export a provider-neutral packet&lt;/li&gt;
&lt;li&gt;validate downstream response structure&lt;/li&gt;
&lt;li&gt;enforce finding-ID citation rules&lt;/li&gt;
&lt;li&gt;reject prohibited claims&lt;/li&gt;
&lt;li&gt;surface runtime and secret boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it cannot do is silently override:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;score.final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;score.formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How that rule is actually enforced
&lt;/h3&gt;

&lt;p&gt;This is not just policy language in the README.&lt;/p&gt;

&lt;p&gt;The advisory validator explicitly checks for score-override attempts. If a response includes fields like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;replication_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;replication_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;or sets &lt;code&gt;final_score_override&lt;/code&gt;, the response is marked invalid with &lt;code&gt;final_score_override_requested&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The packet contract also exports the rule in plain language:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Do not modify or override &lt;code&gt;final_score&lt;/code&gt;, &lt;code&gt;formal_tier&lt;/code&gt;, &lt;code&gt;replication_score&lt;/code&gt;, or &lt;code&gt;replication_tier&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And provider responses must cite exact values from &lt;code&gt;allowed_finding_ids&lt;/code&gt;; citation strings are not repaired or loosely matched later.&lt;/p&gt;

&lt;p&gt;So the advisory lane is bounded in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it has no authority to change the deterministic result&lt;/li&gt;
&lt;li&gt;it cannot cite evidence outside the bounded packet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of mechanism I mean when I say “better boundaries.” If the rule cannot be checked, it is not really part of the architecture yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What operational use looks like now&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqaat0z96ca7h4e0obx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqaat0z96ca7h4e0obx8.png" alt="One execution driving distinct operator surfaces" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once these lanes were separated, the CLI became much easier to reason about.&lt;/p&gt;

&lt;p&gt;Local engineering review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI/CD gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem gate /path/to/repo &lt;span class="nt"&gt;--min-tier&lt;/span&gt; T2 &lt;span class="nt"&gt;--summary&lt;/span&gt; off &lt;span class="nt"&gt;--output&lt;/span&gt; results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Offline advisory packet generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory packet /path/to/repo &lt;span class="nt"&gt;--output&lt;/span&gt; advisory_out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Downstream provider response validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory check-response /path/to/repo &lt;span class="nt"&gt;--response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important point is not just that these commands exist.&lt;/p&gt;

&lt;p&gt;It is that each one represents a distinct trust boundary.&lt;/p&gt;

&lt;p&gt;That made the project feel more like engineering infrastructure and less like a scoring demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;A real v1.6.2 packet&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To make that less abstract, I re-ran STEM BIO-AI v1.6.2 against a local clone of &lt;a href="https://github.com/ClawBio/ClawBio" rel="noopener noreferrer"&gt;ClawBio&lt;/a&gt;, which describes itself as a local-first, privacy-focused, reproducible bioinformatics-native AI skill library.&lt;/p&gt;

&lt;p&gt;The command was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; stem_ai.cli scan /path/to/ClawBio &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitfu4woqxooan1vs88zb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitfu4woqxooan1vs88zb.png" alt="ClawBio_ClawBio_detailed_5p-1" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6lqg9kj5y8v5n0vcpia.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6lqg9kj5y8v5n0vcpia.png" alt="ClawBio_ClawBio_detailed_5p-2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On my machine, that run took about &lt;strong&gt;9.4 seconds&lt;/strong&gt; and emitted the usual CLI output set: a machine-readable JSON result, a Markdown report, a 5-page PDF packet, and a line-level explain trace.&lt;/p&gt;

&lt;p&gt;Before the numbers, the important context is that STEM BIO-AI uses a published triage scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;T0&lt;/code&gt; = 0-39&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T1&lt;/code&gt; = 40-54&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T2&lt;/code&gt; = 55-69&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T3&lt;/code&gt; = 70-84&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T4&lt;/code&gt; = 85-100&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stage 4 replication is reported separately as its own lane, where &lt;code&gt;R2&lt;/code&gt; means some reproducibility scaffolding is present, but not yet enough to call the repository replication-strong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Governance note:&lt;br&gt;
This is not a “bad repository” scoreboard, a clinical safety verdict, or a moral ranking. It is a deterministic evidence-surface pre-screen intended to support review, not replace it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that in mind, the result was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;67 / 100&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;T2 Caution&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication lane: 55 / 100 (R2)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clinical adjacency: CA-DIRECT&lt;/strong&gt; (the repository surface makes direct healthcare-facing claims, even though it also carries an explicit non-clinical boundary)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code integrity warnings: C2 dependency pinning, C4 exception handling&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the workflow shift I wanted the tool to support.&lt;/p&gt;

&lt;p&gt;The same deterministic scan is rendered into multiple operator surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON for automation&lt;/li&gt;
&lt;li&gt;Markdown for review&lt;/li&gt;
&lt;li&gt;PDF for human-facing packet inspection&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--explain&lt;/code&gt; for file / line / snippet proof tracing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That output shape is only possible because the result object already separates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formal score and tier&lt;/li&gt;
&lt;li&gt;replication lane&lt;/li&gt;
&lt;li&gt;diagnostics lane&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;advisory boundary state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the PDF is not a separate product. It is a view over the same bounded audit object.&lt;/p&gt;

&lt;p&gt;Two details from this run are worth calling out.&lt;/p&gt;

&lt;p&gt;First, the scanner did &lt;strong&gt;not&lt;/strong&gt; manufacture chemistry findings just because ClawBio is bio-adjacent. The deterministic diagnostics lane reported:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SMILES Surface Integrity: not_detected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SMILES RDKit Validation: not_applicable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SMILES Parser Guard: not_detected&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the behavior I want. If a detector has no evidence, it should stay silent instead of inflating the report with domain-flavored noise. This is what the earlier thesis looks like when it hits real output: a detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/p&gt;

&lt;p&gt;Second, the score is strict about observable repository conventions. ClawBio uses &lt;code&gt;ClawBio_README_Repo.md&lt;/code&gt; rather than a root &lt;code&gt;README.md&lt;/code&gt;, so the scan records &lt;code&gt;S1_missing_readme: -20&lt;/code&gt;. A human reviewer might decide that this is acceptable contextually. The scanner does not make that leap for them. It only records what the repository exposes through the surfaces it knows how to measure.&lt;/p&gt;

&lt;p&gt;That distinction matters. A &lt;code&gt;T2 Caution&lt;/code&gt; result here does not mean “ClawBio is unsafe.” It means the current repository surface still raises review-relevant signals under the published deterministic rules, including dependency-pinning warnings, exception-handling warnings in a clinical-adjacent surface, and a stricter-than-human README convention check.&lt;/p&gt;

&lt;p&gt;And that is exactly why the next section matters: once the workflow is concrete, the remaining question is not whether the tool can produce an answer, but where its current boundaries still need to stay visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What still has to stay bounded&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The system is better than it was, but there are still obvious next steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The public surface is broad
&lt;/h3&gt;

&lt;p&gt;There is now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoring&lt;/li&gt;
&lt;li&gt;diagnostics&lt;/li&gt;
&lt;li&gt;replication&lt;/li&gt;
&lt;li&gt;advisory packeting&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;JSON / Markdown / PDF / explain outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is useful, but it increases onboarding cost.&lt;/p&gt;

&lt;p&gt;The CLI is clearer now, but the broader public surface has to stay disciplined.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The deterministic diagnostics lane is still missing a published calibration threshold
&lt;/h3&gt;

&lt;p&gt;The diagnostics lane is evidence-first by design, but one practical gap remains: the public release does not yet ship a benchmark-backed threshold document saying exactly when a detector is mature enough to graduate from evidence-only into score-bearing territory.&lt;/p&gt;

&lt;p&gt;Right now the rule is conceptually clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;commit-pinned fixtures&lt;/li&gt;
&lt;li&gt;reproducible detector output&lt;/li&gt;
&lt;li&gt;explicit false-positive review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the public decision boundary is still partly narrative. Until that calibration surface is published in a more operational form, keeping diagnostics evidence-only is the safer choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The regulatory confidence labels are rule-authored, not empirically validated
&lt;/h3&gt;

&lt;p&gt;The mapping labels like &lt;code&gt;strong&lt;/code&gt;, &lt;code&gt;moderate&lt;/code&gt;, and &lt;code&gt;weak-moderate&lt;/code&gt; are currently fixed rule-level judgments in the mapping document. They are not runtime model outputs, but they are also not yet backed by inter-rater reliability studies or a published reviewer-agreement benchmark.&lt;/p&gt;

&lt;p&gt;That means they are useful as bounded structural mapping language, but they should not be treated as empirical proof that multiple auditors would converge on exactly the same label distribution.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Earlier context&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f"&gt;Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5"&gt;How Auditing 10 Bio-AI Repositories Shaped STEM-AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2"&gt;How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Try it yourself&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;STEM BIO-AI is Apache 2.0 and fully open source.&lt;/p&gt;

&lt;p&gt;If you want to know whether a bio/medical AI repository is actually exposing reviewable evidence, or whether your own repository is weaker than you think, run it yourself.&lt;/p&gt;

&lt;p&gt;That is the real test.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/flamehaven01/STEM-BIO-AI" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/STEM-BIO-AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;License: Apache 2.0&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Final thought&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The earlier STEM-AI posts were about why repository trust deserves its own audit layer.&lt;/p&gt;

&lt;p&gt;This phase was about something more practical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;what does that audit layer have to look like if an engineer is actually going to run it, inspect it, and put it in a pipeline?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For me, the answer was simple:&lt;/p&gt;

&lt;p&gt;Separate the workflows.&lt;br&gt;
Separate the lanes.&lt;br&gt;
Keep diagnostics evidence-first.&lt;br&gt;
Keep regulatory mapping subordinate to evidence.&lt;br&gt;
Keep advisory AI bounded.&lt;/p&gt;

&lt;p&gt;Optimize for inspectability, not just score production.&lt;/p&gt;

&lt;p&gt;That is what changed the project.&lt;/p&gt;

&lt;p&gt;Not bigger claims.&lt;/p&gt;

&lt;p&gt;Better boundaries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6pay2ivnar9ryd22kjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6pay2ivnar9ryd22kjk.png" alt="Final thought" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>opensource</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:51:45 +0000</pubDate>
      <link>https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2</link>
      <guid>https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fj5kfblnvz80fjttvnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fj5kfblnvz80fjttvnl.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Previous article:&lt;br&gt;
&lt;a href="https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5"&gt;&lt;strong&gt;How Auditing 10 Bio-AI Repositories Shaped STEM-AI&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.&lt;/p&gt;

&lt;p&gt;The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.&lt;/p&gt;

&lt;p&gt;The more useful lesson was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Text-only review is too weak for bio/medical AI. You have to inspect the code path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That worked.&lt;/p&gt;

&lt;p&gt;But it exposed the next problem.&lt;/p&gt;

&lt;p&gt;If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?&lt;/p&gt;

&lt;p&gt;LLMs drift. &lt;br&gt;
One session can enforce a clinical boundary strictly. &lt;br&gt;
Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.&lt;/p&gt;

&lt;p&gt;STEM-AI v1.1.2 is my answer to that problem.&lt;/p&gt;

&lt;p&gt;It does not try to make the LLM deterministic by writing a longer prompt.&lt;/p&gt;

&lt;p&gt;It binds the audit to a memory contract.&lt;/p&gt;


&lt;h2&gt;
  
  
  What v1.1.2 adds
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcsghndeq1guwwkoy2y7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcsghndeq1guwwkoy2y7.png" alt="standard audit vs Bio/Medical AI audit" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;STEM-AI v1.1.2 introduces &lt;a href="https://dev.to/flamehaven01/series/37087"&gt;MICA: Memory-Injected Contract Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The v1.1.2 layer includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;memory/mica.yaml&lt;/code&gt; -- composition contract&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai.mica.v1.1.2.json&lt;/code&gt; -- machine-checkable memory archive&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai-playbook.v1.1.2.md&lt;/code&gt; -- session playbook and drift guard&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai-lessons.v1.1.2.md&lt;/code&gt; -- historical failure-mode archive&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;spec/STEM-AI_v1.1.2_CORE.md&lt;/code&gt; -- canonical audit spec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The contract pins 18 invariants.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage order is fixed: README intent, cross-platform evidence, code/bio evidence.&lt;/li&gt;
&lt;li&gt;Stage weights are fixed.&lt;/li&gt;
&lt;li&gt;Tier boundaries are fixed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T0_HARD_FLOOR&lt;/code&gt; cannot be bypassed.&lt;/li&gt;
&lt;li&gt;Stage 2 may use external evidence or Stage 2R repo-local consistency in LOCAL_ANALYSIS mode.&lt;/li&gt;
&lt;li&gt;Governance overlay cannot raise the formal base tier.&lt;/li&gt;
&lt;li&gt;C1-C4 code-integrity checks only run in LOCAL_ANALYSIS mode.&lt;/li&gt;
&lt;li&gt;Mandatory clinical-use disclaimers cannot be omitted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a claim that the LLM becomes perfectly deterministic.&lt;/p&gt;

&lt;p&gt;It is a narrower claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The auditor is forced to operate inside a contract whose scoring rules, hard floors, and evidence requirements are inspectable.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the useful layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  What "loading the contract" means
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9k78n4gct6xfq15ls2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9k78n4gct6xfq15ls2.png" alt="Forcing the auditor to operate inside a machine-checkable memory contract" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/series/37087"&gt;MICA&lt;/a&gt;&lt;/strong&gt; is not hidden model memory.&lt;/p&gt;

&lt;p&gt;It is also not a claim that the model provider changed the LLM.&lt;/p&gt;

&lt;p&gt;In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;memory/mica.yaml
memory/stem-ai.mica.v1.1.2.json
memory/stem-ai-playbook.v1.1.2.md
memory/stem-ai-lessons.v1.1.2.md
spec/STEM-AI_v1.1.2_CORE.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcucfrq4bst2cxc3olb1d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcucfrq4bst2cxc3olb1d.png" alt="Pinning the audit rules mathematically" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The auditor then performs a pre-execution contract test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;confirm the canonical spec exists&lt;/li&gt;
&lt;li&gt;confirm the memory archive exists&lt;/li&gt;
&lt;li&gt;confirm the invariant count is 18&lt;/li&gt;
&lt;li&gt;confirm the fixed tier boundaries are present&lt;/li&gt;
&lt;li&gt;confirm the Stage 2 / Stage 2R lane rule is present&lt;/li&gt;
&lt;li&gt;confirm Stage 3G cannot raise the formal tier&lt;/li&gt;
&lt;li&gt;confirm C1-C4 mode gating is active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does the audit proceed.&lt;/p&gt;

&lt;p&gt;This does not make the LLM mathematically deterministic.&lt;/p&gt;

&lt;p&gt;It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.&lt;/p&gt;

&lt;p&gt;That is the difference between &lt;strong&gt;"please be consistent"&lt;/strong&gt; and &lt;strong&gt;"execute this versioned contract."&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit workflow
&lt;/h2&gt;

&lt;p&gt;STEM-AI v1.1.2 runs as a structured audit workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frihb9k729ll3vbqpvu3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frihb9k729ll3vbqpvu3c.png" alt="STEM-AI v1.1.2 workflow" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.&lt;/p&gt;

&lt;p&gt;It can inspect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;package metadata&lt;/li&gt;
&lt;li&gt;workflow files&lt;/li&gt;
&lt;li&gt;test definitions&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;source-code paths&lt;/li&gt;
&lt;li&gt;deprecated or dead-code paths&lt;/li&gt;
&lt;li&gt;exception handling&lt;/li&gt;
&lt;li&gt;credential patterns&lt;/li&gt;
&lt;li&gt;provenance and hash-checking logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output is intentionally split into two files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report.md                  # human-readable audit judgment
experiment_results.json    # machine-readable evidence and score object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kvqqnty1q4d011wcyo6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kvqqnty1q4d011wcyo6.png" alt="Separating subjective reasoning from verifiable mathematics" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That split matters.&lt;/p&gt;

&lt;p&gt;The report explains the reasoning.&lt;/p&gt;

&lt;p&gt;The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.&lt;/p&gt;




&lt;h2&gt;
  
  
  A real target audit, not a synthetic example
&lt;/h2&gt;

&lt;p&gt;For this v1.1.2 demonstration, I used a real public repository:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/artic-network/fieldbioinformatics" rel="noopener noreferrer"&gt;artic-network/fieldbioinformatics&lt;br&gt;
&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The target is not the protagonist of this post.&lt;/p&gt;

&lt;p&gt;It is only the specimen used to show the audit workflow against a real bioinformatics codebase.&lt;/p&gt;

&lt;p&gt;The local audit produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audits/fieldbioinformatics_v1_1_2/report.md
audits/fieldbioinformatics_v1_1_2/experiment_results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The target snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"artic-network/fieldbioinformatics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"remote"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/artic-network/fieldbioinformatics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"branch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"master"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"commit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8008b4c97c2193a82308ff6f0be507b1d9306e36"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;114&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the important part: the audit did not ask, "Does this README sound trustworthy?"&lt;/p&gt;

&lt;p&gt;It asked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do README claims match actual package metadata and entry points?&lt;/li&gt;
&lt;li&gt;Are there real CI and domain-specific tests?&lt;/li&gt;
&lt;li&gt;Are dependencies reproducible enough?&lt;/li&gt;
&lt;li&gt;Are there credential leaks?&lt;/li&gt;
&lt;li&gt;Are there deprecated patient-adjacent paths?&lt;/li&gt;
&lt;li&gt;Do clinical-adjacent output paths fail closed?&lt;/li&gt;
&lt;li&gt;Does the repository include governance evidence, or only governance absence?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where STEM-AI is useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  The score object
&lt;/h2&gt;

&lt;p&gt;The machine-readable result records the score like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_1_readme_intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_cross_platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_repo_local_consistency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_lane"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STAGE_2R_REPO_LOCAL_CONSISTENCY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"External Stage 2 was not collected; LOCAL_ANALYSIS used Stage 2R in the fixed 0.20 Stage 2 slot."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_3_code_bio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"weights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"risk_penalty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"formal_tier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"T2 Caution"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;External Stage 2 is explicitly represented as &lt;code&gt;null&lt;/code&gt; for this local-only audit.&lt;/p&gt;

&lt;p&gt;That does not mean cross-platform consistency is unimportant.&lt;/p&gt;

&lt;p&gt;It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.&lt;/p&gt;

&lt;p&gt;Stage 2R asks whether the repository's own surfaces agree with each other:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README vs package metadata and CLI entry points&lt;/li&gt;
&lt;li&gt;README vs docs, tutorials, and troubleshooting&lt;/li&gt;
&lt;li&gt;README test claims vs CI workflow and test definitions&lt;/li&gt;
&lt;li&gt;clinical-adjacent outputs vs local intended-use boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The contract defines the fixed-weight calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Final = (Stage 1 x 0.40) + (Stage 2R x 0.20) + (Stage 3 x 0.40) - Risk Penalty
      = (65 x 0.40) + (75 x 0.20) + (55 x 0.40) - 0
      = 63
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final tier is therefore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T2 Caution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not because the prose sounded balanced.&lt;/p&gt;

&lt;p&gt;Because the contract math forces that result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the T0 hard floor did not trigger
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdntp9d5l6yysb9sbndw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdntp9d5l6yysb9sbndw.png" alt="Why the T0 hard floor did not trigger" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;T0_HARD_FLOOR&lt;/code&gt; is the rule that prevents a clinically dangerous repository from escaping rejection through good wording.&lt;/p&gt;

&lt;p&gt;In simplified form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If a repository is CA-DIRECT
and it has no substantive code implementation,
then final tier = T0 regardless of score math.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.&lt;/p&gt;

&lt;p&gt;The audited repository did not trigger that floor because STEM-AI classified it as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clinical_adjacent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ca_severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"t0_hard_floor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.&lt;/p&gt;

&lt;p&gt;So the result is not T0.&lt;/p&gt;

&lt;p&gt;But it is also not high-trust.&lt;/p&gt;

&lt;p&gt;The bounded result is T2 Caution.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3o3nr2j7trtap0med4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3o3nr2j7trtap0med4y.png" alt="Stem-AI Audit v1.1.2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Code-integrity findings
&lt;/h2&gt;

&lt;p&gt;The same JSON records C1-C4 LOCAL_ANALYSIS checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C1_hardcoded_credentials"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C2_dependency_pinning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C3_dead_or_deprecated_patient_adjacent_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C4_exception_handling_clinical_adjacent_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the difference between a general review and a code-path audit.&lt;/p&gt;

&lt;p&gt;A text review can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The project appears technically mature.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A code-path audit can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Credential patterns were checked. Dependency pinning is weak. Deprecated patient-adjacent metadata exists. One clinical-adjacent filtering path does not fail closed on missing depth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a more useful governance object.&lt;/p&gt;

&lt;p&gt;It is not a certificate.&lt;/p&gt;

&lt;p&gt;It is a map of what a reviewer should trust, distrust, or inspect next.&lt;/p&gt;




&lt;h2&gt;
  
  
  A small Python verifier
&lt;/h2&gt;

&lt;p&gt;Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;


&lt;span class="n"&gt;RESULT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audits/fieldbioinformatics_v1_1_2/experiment_results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;69&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;84&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;filled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;█&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;filled&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;░&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;filled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RESULT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;stage_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_1_readme_intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;stage_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_2_repo_local_consistency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stage_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_3_code_bio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;risk_penalty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;computed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;risk_penalty&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;computed&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;formal_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 1  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 2R &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 3  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;formal_tier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_integrity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected digest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1   65/100  █████████████░░░░░░░
Stage 2R  75/100  ███████████████░░░░░
Stage 3   55/100  ███████████░░░░░░░░░
Final     63/100  █████████████░░░░░░░
Tier      T2 Caution
C1_hardcoded_credentials: PASS
C2_dependency_pinning: WARN
C3_dead_or_deprecated_patient_adjacent_paths: WARN
C4_exception_handling_clinical_adjacent_paths: WARN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Bio/medical AI governance is full of language that sounds safe but is hard to verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"research use only"&lt;/li&gt;
&lt;li&gt;"not medical advice"&lt;/li&gt;
&lt;li&gt;"validated pipeline"&lt;/li&gt;
&lt;li&gt;"clinical-grade"&lt;/li&gt;
&lt;li&gt;"responsible AI"&lt;/li&gt;
&lt;li&gt;"human-in-the-loop"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those phrases are not enough.&lt;/p&gt;

&lt;p&gt;STEM-AI asks for observable structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source-code reality&lt;/li&gt;
&lt;li&gt;test reality&lt;/li&gt;
&lt;li&gt;CI reality&lt;/li&gt;
&lt;li&gt;dependency reality&lt;/li&gt;
&lt;li&gt;clinical boundary reality&lt;/li&gt;
&lt;li&gt;governance artifact reality&lt;/li&gt;
&lt;li&gt;code-integrity reality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;v1.1.2 adds another layer:&lt;/p&gt;

&lt;p&gt;auditor reality.&lt;/p&gt;

&lt;p&gt;The AI auditor itself has to load a memory contract before it scores.&lt;/p&gt;

&lt;p&gt;That is what MICA is for.&lt;/p&gt;

&lt;p&gt;The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.&lt;/p&gt;

&lt;p&gt;Not hype.&lt;/p&gt;

&lt;p&gt;Not rejection by default.&lt;/p&gt;

&lt;p&gt;A bounded trust judgment with evidence paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The follow-on lane should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provision the target dependency environment&lt;/li&gt;
&lt;li&gt;run selected target tests in a controlled shell&lt;/li&gt;
&lt;li&gt;capture command, exit code, environment hash, and output digest&lt;/li&gt;
&lt;li&gt;attach a replay manifest to &lt;code&gt;experiment_results.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;keep runtime evidence separate from source/document/CI evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0wo50wt3x5bfg8xrip3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0wo50wt3x5bfg8xrip3.png" alt="Stem-AI" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;STEM-AI is &lt;strong&gt;not a clinical certifier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is also &lt;strong&gt;not trying to replace scientific review, regulatory review, or domain experts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Its role is narrower: &lt;strong&gt;make the governance conversation start from observable evidence instead of presentation quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, that means asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What did the repository claim?&lt;/li&gt;
&lt;li&gt;What does the code actually implement?&lt;/li&gt;
&lt;li&gt;Do the local surfaces agree with each other?&lt;/li&gt;
&lt;li&gt;Are the tests domain-specific or merely infrastructural?&lt;/li&gt;
&lt;li&gt;Are clinical-adjacent boundaries explicit?&lt;/li&gt;
&lt;li&gt;Can the auditor's own scoring logic be inspected?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where I think STEM-AI belongs in AI governance.&lt;/p&gt;

&lt;p&gt;Not as the final authority.&lt;/p&gt;

&lt;p&gt;As the evidence gate before authority is invoked.&lt;/p&gt;

&lt;p&gt;It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does this repository establish enough observable trust to be considered, contained, or rejected?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>bioinformatics</category>
      <category>medicalai</category>
      <category>aigovernance</category>
      <category>ai</category>
    </item>
    <item>
      <title>Each /slop Is a Calibration Signal — AI-SLOP Detector v3.6.0 and the Claude Code Skill</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 28 Apr 2026 12:04:44 +0000</pubDate>
      <link>https://dev.to/flamehaven01/each-slop-is-a-calibration-signal-ai-slop-detector-v360-and-the-claude-code-skill-3909</link>
      <guid>https://dev.to/flamehaven01/each-slop-is-a-calibration-signal-ai-slop-detector-v360-and-the-claude-code-skill-3909</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2hqsg05873bhhlmli4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2hqsg05873bhhlmli4u.png" alt="The Quiet Failure of AI Development" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI-assisted development has a quiet failure mode: the assistant that creates the pattern often becomes the assistant that reviews it.&lt;/p&gt;

&lt;p&gt;When you and Claude work inside the same session, you drift together. The review criteria shift with the assistant's habits. After enough sessions, the same assistant that wrote the hollow function body is also the one approving the pull request. There is no external reference point — unless you build one.&lt;/p&gt;

&lt;p&gt;That is the problem AI-SLOP Detector v3.6.0 addresses with the Claude Code skill.&lt;/p&gt;

&lt;p&gt;Every time you run &lt;code&gt;/slop&lt;/code&gt; inside a session, the scan result is recorded to a project-scoped history. When enough re-scan evidence accumulates, bounded self-calibration adjusts the detection weights for your codebase — automatically, without a manual command. The scanner does not drift with the session. It stays anchored to observed scan outcomes.&lt;/p&gt;

&lt;p&gt;It does not get smarter every time. It builds calibration signal every time. That is a more accurate claim, and the distinction matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Skill Does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5vvu9dumj5w8b54vzd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5vvu9dumj5w8b54vzd2.png" alt="The Skill layer Quality Policy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; claude-skills/slop-detector ~/.claude/skills/slop-detector
&lt;span class="c"&gt;# restart Claude Code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four slash commands become available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full project scan — interprets findings, prioritizes fixes, proposes patch plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-file [path]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-file deep-dive — explains each metric, gives concrete fix per pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-gate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hard gate decision — PASS or FAIL, lists blocking files with deficit_score &amp;gt;= 70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-spar&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adversarial validation — probes metric boundaries, catches calibration drift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The intended workflow inside a Claude session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. /slop               → baseline scan, identify top offenders
2. review findings     → Claude prioritizes by deficit_score
3. patch files         → fix patterns with Claude's help
4. /slop-file &amp;lt;path&amp;gt;   → verify improvement per file
5. /slop               → confirm project aggregate improved
6. /slop-gate          → gate decision before merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality policy lives in the skill layer. You do not re-explain what &lt;code&gt;CRITICAL_DEFICIT&lt;/code&gt; means or which patterns are critical on every session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The LEDA Flywheel
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy5sjec49yyfddet0bqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy5sjec49yyfddet0bqj.png" alt="The LEDA Flywheel" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the part that matters.&lt;/p&gt;

&lt;p&gt;LEDA is not model retraining. It is bounded weight calibration based on repeated scan outcomes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/slop&lt;/code&gt; runs &lt;code&gt;slop-detector --project . --json&lt;/code&gt; — without &lt;code&gt;--no-history&lt;/code&gt;. Every invocation auto-records results to &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt;, tagged with a &lt;code&gt;project_id&lt;/code&gt; (sha256 of cwd) so signals never mix across different repositories.&lt;/p&gt;

&lt;p&gt;After every &lt;strong&gt;10 re-scanned files&lt;/strong&gt;, the tool runs the LEDA self-calibration loop automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/slop called
    │
    ├─► scan result → recorded to history.db (project-scoped)
    │
    ├─► 10 re-scanned files milestone?
    │       └─► SelfCalibrator: 4D grid-search over run history
    │               (ldr × inflation × ddc × purity weights)
    │               └─► confidence gap &amp;gt; 0.10?
    │                       └─► .slopconfig.yaml updated silently
    │
    └─► next /slop → calibrated weights, sharper detection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The calibrator uses re-scanned files as signal — not raw record count. A file counts toward the milestone only when the tool has seen it improve or degrade across at least two runs. This prevents first-time project scans from triggering calibration on noise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1yf4jspidkr41hpvoe8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1yf4jspidkr41hpvoe8.png" alt="Constrained to Reality" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three constraints keep calibration bounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain-anchored&lt;/strong&gt; — grid search is constrained to ±0.15 around domain baseline weights. Detection cannot drift outside the meaningful range for your project type.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence gate&lt;/strong&gt; — only applies when the top candidate weight set beats the second by &amp;gt; 0.10. Ambiguous signals produce no change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift warnings&lt;/strong&gt; — &lt;code&gt;CalibrationResult.warnings&lt;/code&gt; flags any dimension that shifted &amp;gt; 0.25 from the anchor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;/slop-spar&lt;/code&gt; adds a separate adversarial layer: it probes known-pattern anchors, metric boundary cases, and existence conditions. When it detects that measured behavior has diverged from metric claims, it recommends &lt;code&gt;--self-calibrate --apply-calibration&lt;/code&gt; explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Data Shows — and What We Won't Claim
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzbr2rhp3wc97g1teul2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzbr2rhp3wc97g1teul2.png" alt="Workflow telemetry, not empty claims" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will not tell you that AI-SLOP Detector improves code quality by X%.&lt;/p&gt;

&lt;p&gt;We have not run a controlled study. We have not compared matched projects with and without the tool. Any number we put here would be a claim we cannot prove, and this tool is built specifically to catch that kind of thing.&lt;/p&gt;

&lt;p&gt;What we do have: the tool scanning itself. Every time a core module was changed, it got re-scanned. N = 14,367 records across all projects in &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not outcome evidence. It is workflow telemetry. Here is what the scan history shows for the eight most-improved files in this codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;File                   Scans  Worst → Best   Improvement
─────────────────────────────────────────────────────────
ddc.py                   86   87.8 →  11.0    -76.8 pts
placeholder.py           92   70.3 →   0.0    -70.3 pts
cross_file.py            89   70.3 →   5.0    -65.3 pts
ci_gate.py               88   69.3 →   6.2    -63.1 pts
cli.py                   88   68.4 →   8.4    -60.0 pts
ldr.py                   90   58.0 →   0.1    -57.9 pts
python_advanced.py       95   74.0 →  18.0    -55.9 pts
context_jargon.py        86   55.7 →   5.0    -50.7 pts
─────────────────────────────────────────────────────────
Source: self-scan, history.db — not an independent study
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the weekly project aggregate (avg deficit score):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week      Avg Deficit   Critical Files   Note
────────────────────────────────────────────────────────
2026-W09     11.9            3           baseline
2026-W10     22.1           20           structural refactor spike
2026-W14     20.0           58           large feature addition
2026-W15     11.9           14           post-refactor recovery
2026-W17     12.2           13           current — stable CLEAN state
────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mechanism is not mysterious. Scan reveals structural problems → Claude sees exact pattern names and line references → Claude (or the developer) fixes them → rescan confirms improvement → LEDA registers the delta and adjusts detection weights accordingly.&lt;/p&gt;

&lt;p&gt;The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.&lt;/p&gt;

&lt;p&gt;Whether that loop improves your codebase is something your &lt;code&gt;history.db&lt;/code&gt; will tell you — not us.&lt;/p&gt;




&lt;h2&gt;
  
  
  Also in v3.6.0
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo24mnhsdtwi84p8wzx0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo24mnhsdtwi84p8wzx0l.png" alt="System diagnostics &amp;amp; Protocol refinements" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI gate exit code fix.&lt;/strong&gt; &lt;code&gt;--ci-mode hard&lt;/code&gt; without &lt;code&gt;--ci-report&lt;/code&gt; was returning exit 0 even on &lt;code&gt;CRITICAL_DEFICIT&lt;/code&gt; files — a two-line fix in &lt;code&gt;_evaluate_ci_gate()&lt;/code&gt; (commit &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/commit/0d67997" rel="noopener noreferrer"&gt;&lt;code&gt;0d67997&lt;/code&gt;&lt;/a&gt;). This affected v3.1.1 through v3.5.0 on the specific path of using the gate without the reporting flag. A regression test at the subprocess level was added to prevent recurrence (commit &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/commit/0208af4" rel="noopener noreferrer"&gt;&lt;code&gt;0208af4&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-commit hooks rewritten.&lt;/strong&gt; Three hook variants now use &lt;code&gt;python -m slop_detector.cli&lt;/code&gt; as entry point (bypasses Windows &lt;code&gt;.exe&lt;/code&gt; wrapper exit-code issue), and &lt;code&gt;--severity high&lt;/code&gt; (nonexistent flag) replaced with &lt;code&gt;--ci-mode&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;repos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/flamehaven01/AI-SLOP-Detector&lt;/span&gt;
    &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v3.6.0&lt;/span&gt;
    &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slop-detector&lt;/span&gt;           &lt;span class="c1"&gt;# hard gate&lt;/span&gt;
      &lt;span class="c1"&gt;# - id: slop-detector-warn    # report only&lt;/span&gt;
      &lt;span class="c1"&gt;# - id: slop-detector-patterns  # fast per-file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;VS Code Extension v3.6.0.&lt;/strong&gt; Version tracks core library. No behavior changes from v3.5.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shape of the Loop
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaifh98pgwka02u110lx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaifh98pgwka02u110lx.png" alt="An External reference point" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill + LEDA loop is the external reference point. Detection weights stay grounded in observed scan outcomes — files that improved across re-scans, files that stayed problematic — rather than in what the assistant believes is correct at any given moment.&lt;/p&gt;

&lt;p&gt;The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.&lt;/p&gt;

&lt;p&gt;We won't tell you what percentage your code will improve. That would make us the thing we are trying to detect.&lt;/p&gt;

&lt;p&gt;The scanner is not Claude's opinion about code quality. It is a measurement that gets calibrated against reality, session by session. Your &lt;code&gt;history.db&lt;/code&gt; will tell you the rest.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y0o9wh1nmvimjajm12r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y0o9wh1nmvimjajm12r.png" alt="The Shape of the Loop" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/ai-slop-detector/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/docs/CLAUDE_CODE_SKILL.md" rel="noopener noreferrer"&gt;Claude Code Skill docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/docs/SELF_CALIBRATION.md" rel="noopener noreferrer"&gt;Self-Calibration docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>claudeai</category>
      <category>codequality</category>
      <category>ai</category>
    </item>
    <item>
      <title>When an AI Pipeline Passes — But One Path Still Must Be Held: EXP-034</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 27 Apr 2026 10:09:19 +0000</pubDate>
      <link>https://dev.to/flamehaven01/when-an-ai-pipeline-passes-but-one-path-still-must-be-held-exp-034-16af</link>
      <guid>https://dev.to/flamehaven01/when-an-ai-pipeline-passes-but-one-path-still-must-be-held-exp-034-16af</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fianvzlqvv7dw9ptnldvx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fianvzlqvv7dw9ptnldvx.png" alt="Cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;No efficacy, causal, or clinical claims are made in this report.&lt;/em&gt;&lt;br&gt;
RExSyn is an experimental Bio-AI governance pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You do not need to know the earlier experiments to read this report.&lt;/p&gt;

&lt;p&gt;Most AI pipeline reports ask one question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the system pass?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;EXP-034 asked a stricter one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which path was allowed to count?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;In a multi-stage AI pipeline, a final &lt;code&gt;PASS&lt;/code&gt; can hide a lot of unresolved risk. A branch may be unstable. A regeneration path may drift. A new external API may enter the chain without being governed. A new modality may appear to improve the system while quietly changing the basis of judgment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt7hho1hqvwps2xp0m9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt7hho1hqvwps2xp0m9p.png" alt="The real result" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So EXP-034 was not designed to produce a clean success story.&lt;/p&gt;

&lt;p&gt;It was designed to separate three things:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anchored expansion path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GO&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accepted path for EXP-034 reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Current regeneration path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HOLD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Diagnostic evidence, not acceptance baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Next remediation cycle&lt;/td&gt;
&lt;td&gt;&lt;code&gt;EXP-035&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;RCA and repair target&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is the real result.&lt;/p&gt;

&lt;p&gt;EXP-034 passed, but not because every path passed.&lt;/p&gt;

&lt;p&gt;It passed because the accepted anchor remained stable, the expansion tracks did not break the judgment system, and the unresolved regeneration path was explicitly held instead of being silently mixed into acceptance.&lt;/p&gt;




&lt;h2&gt;
  
  
  What EXP-034 tested
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve4hb86wepyckegmtdfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve4hb86wepyckegmtdfz.png" alt="Locking the Boundary" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;EXP-033 had already established a parity baseline.&lt;/p&gt;

&lt;p&gt;EXP-034 asked whether that baseline could survive controlled expansion while adding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a modal update track,&lt;/li&gt;
&lt;li&gt;a live AlphaFold EBI observer endpoint,&lt;/li&gt;
&lt;li&gt;and AlphaGenome / AG measurement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The operating rule was simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reproduce the parity baseline first.&lt;/li&gt;
&lt;li&gt;Only then allow expansion.&lt;/li&gt;
&lt;li&gt;Only then compare governance behavior across experiment cycles.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the parity anchor breaks, the rest is not expansion.&lt;/p&gt;

&lt;p&gt;It is regression.&lt;/p&gt;

&lt;p&gt;The scope was also locked: methodology, governance, and reproducibility only. The experiment did not claim biological efficacy, causal inference, or clinical recommendation.&lt;/p&gt;

&lt;p&gt;That boundary is important because this kind of system can easily sound more powerful than what was actually measured. EXP-034 was not asking whether the pipeline discovered a better biological answer.&lt;/p&gt;

&lt;p&gt;It was asking whether the judgment system stayed governable after new signals entered the chain.&lt;/p&gt;




&lt;h2&gt;
  
  
  The key split: PASS did not mean everything passed
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsjcpcij1hext56s3s1b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsjcpcij1hext56s3s1b.png" alt="The key split" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Track-A produced the defining decision of the experiment.&lt;/p&gt;

&lt;p&gt;The accepted legacy replay anchor preserved the required PASS/BLOCK separation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Legacy replay anchor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sample accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sample balanced accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arm accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arm balanced accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dangerous false-pass rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;false reject rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That was the path allowed to anchor EXP-034.&lt;/p&gt;

&lt;p&gt;But the current regeneration path did not recover:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Current regeneration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sample accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sample balanced accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HOLD&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the most important part of the experiment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EXP-034 did not pretend the regeneration path passed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It kept that result inside the experiment as diagnostic evidence, but did not allow it to redefine the accepted baseline.&lt;/p&gt;

&lt;p&gt;That separation is not a minor operational detail. It is the governance result.&lt;/p&gt;

&lt;p&gt;A weak pipeline would have blended the two paths and still reported a final success. EXP-034 did the opposite. It allowed the stable anchor to proceed and held the unstable path for RCA.&lt;/p&gt;

&lt;p&gt;That is how a stage-gated system avoids changing its own question after seeing the result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why path splitting matters
&lt;/h2&gt;

&lt;p&gt;The concrete governance problem is this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A pipeline can pass for the wrong reason.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;valid_report = stable_anchor × traceable_extension × contained_instability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the anchor is not stable, the report cannot be trusted.&lt;/p&gt;

&lt;p&gt;If the extension is not traceable, the new signal becomes an ungoverned side channel.&lt;/p&gt;

&lt;p&gt;If instability is not contained, a diagnostic failure can quietly contaminate acceptance.&lt;/p&gt;

&lt;p&gt;A single final &lt;code&gt;PASS&lt;/code&gt; is not enough when several branches contribute to a verdict. You need to know which branch produced the accepted decision, which branch failed, which branch was only diagnostic, and which branch is allowed to affect future work.&lt;/p&gt;

&lt;p&gt;EXP-034 passed because all three conditions were enforced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the legacy replay anchor held,&lt;/li&gt;
&lt;li&gt;the new observer and AG paths were measured under governance,&lt;/li&gt;
&lt;li&gt;and the regeneration HOLD remained outside acceptance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the difference between a pipeline that merely outputs a verdict and a pipeline that controls which verdicts are allowed to count.&lt;/p&gt;




&lt;h2&gt;
  
  
  Adding AlphaFold EBI as an observer, not a predictor
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F703iwj95lg12xjdtdgw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F703iwj95lg12xjdtdgw9.png" alt="Controlled Expansion" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Relative to EXP-033, EXP-034 added a live AlphaFold Protein Structure Database / EBI observer line.&lt;/p&gt;

&lt;p&gt;This was not promoted into a primary predictor.&lt;/p&gt;

&lt;p&gt;It was wired as an observer/reference oracle and traced into governance as &lt;code&gt;ebi_g2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AlphaFold EBI direct endpoint for &lt;code&gt;P23219&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GO&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 7 observer tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2 passed&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;ebi_g2&lt;/code&gt; governance traceability&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;BLOCKED_IDP&lt;/code&gt; mapping path&lt;/td&gt;
&lt;td&gt;validated in test&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The point is not simply that an external endpoint responded.&lt;/p&gt;

&lt;p&gt;The point is that the external signal entered the system through a governed path. It was not allowed to float beside the pipeline as informal context.&lt;/p&gt;

&lt;p&gt;EXP-034 tested whether the new observer could be admitted without becoming an ungoverned side channel.&lt;/p&gt;




&lt;h2&gt;
  
  
  AG-live: non-degradation, not repair
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadd3dcaw37lha9ua1zch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadd3dcaw37lha9ua1zch.png" alt="AG-live: non-degradation, not repair" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Track-C tested a simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If AG-live enters the pipeline, does it change the final decision?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer was no.&lt;/p&gt;

&lt;p&gt;AG-live did enter the pipeline.&lt;/p&gt;

&lt;p&gt;The AlphaGenome field was present with:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AG field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;source&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alphagenome_api_live&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pathogenicity_score&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;confidence&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.7143&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;clinical_significance&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uncertain&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are sanitized branch artifact values, not implementation code or full raw artifacts.&lt;/p&gt;

&lt;p&gt;AG-live did not change classification.&lt;/p&gt;

&lt;p&gt;Both controls remained governed by the same conservative decision boundary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;th&gt;Observed&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;EXP032-BLOCK-001&lt;/code&gt; negative control&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK_EXPECTED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK / ESCALATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail-closed behavior preserved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;EXP032-PASS-001&lt;/code&gt; pass-eligible control&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS_ELIGIBLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK / ESCALATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;conservative over-blocking persisted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is the key nuance.&lt;/p&gt;

&lt;p&gt;AG-live did not create a dangerous false-pass. The negative control stayed blocked.&lt;/p&gt;

&lt;p&gt;But AG-live also did not repair the current regeneration hold. The pass-eligible control still failed to recover and remained blocked under &lt;code&gt;R2_component_floor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The governance surface moved slightly, but the verdict did not:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Earlier AG branch&lt;/th&gt;
&lt;th&gt;AG-live branch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p_e2e&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0912&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0947&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;clinical status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rule&lt;/td&gt;
&lt;td&gt;&lt;code&gt;R2_component_floor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;R2_component_floor&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So the correct conclusion is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AG improved the pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The correct conclusion is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AG-live changed the measurement surface slightly, but did not change the decision boundary.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is exactly what non-degradation means here.&lt;/p&gt;

&lt;p&gt;It preserved fail-closed behavior on the negative control while leaving the pass-eligible control over-blocked.&lt;/p&gt;

&lt;p&gt;This is why Track-C can only be called non-degradation, not repair.&lt;/p&gt;




&lt;h2&gt;
  
  
  Contract passed, but governance still blocked
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclbqwjvytk3ql600icp1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclbqwjvytk3ql600icp1.png" alt="Contract passed, but governance still blocked" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the most useful details in EXP-034 is that the contract layer and governance layer did not collapse into one verdict.&lt;/p&gt;

&lt;p&gt;The contract inspection reported:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;pipeline contract score&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.9077&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weakest connection&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C2&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dangerous pass risk&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gate recommendation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;overall OK&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But the clinical governance layer still blocked the case.&lt;/p&gt;

&lt;p&gt;That is not a contradiction.&lt;/p&gt;

&lt;p&gt;It means the pipeline connection was valid enough to inspect, but the decision was not safe enough to accept.&lt;/p&gt;

&lt;p&gt;This distinction matters.&lt;/p&gt;

&lt;p&gt;A weaker system might treat a passing contract as permission to pass the whole output. EXP-034 did not do that. It allowed the contract layer to say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The pipeline is connected.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;while the governance layer could still say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The claim should not pass.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That separation is exactly what a governance layer is supposed to preserve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cross-cycle comparison: EXP-032 → EXP-033 → EXP-034
&lt;/h2&gt;

&lt;p&gt;Track-D compared the accepted anchor path across cycles.&lt;/p&gt;

&lt;p&gt;You do not need the earlier experiments as background. They matter here for one reason only:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EXP-034 was not allowed to invent a new success criterion.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;EXP-032 and EXP-033 provided the previous PASS/BLOCK baseline. EXP-034 tested whether that baseline survived expansion.&lt;/p&gt;

&lt;p&gt;The classification baseline stayed fixed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compare&lt;/th&gt;
&lt;th&gt;Accuracy / balanced accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EXP-032 → EXP-034&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0 / 1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EXP-033 → EXP-034&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0 / 1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At the same time, governance signals moved:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Governance signal&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ccge_p_e2e_mean&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+0.04447488775996111&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nnsl_sr9_tech_mean&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+0.04692394788063081&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nnsl_di2_tech_mean&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-0.03667940951579321&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The interpretation is narrow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The judgment baseline stayed fixed while the governance surface became more measurable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is what EXP-034 was allowed to claim.&lt;/p&gt;

&lt;p&gt;It did not prove biological efficacy.&lt;/p&gt;

&lt;p&gt;It did not prove that every branch of the system was now stable.&lt;/p&gt;

&lt;p&gt;It proved that controlled expansion could happen without breaking the accepted PASS/BLOCK baseline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage-gate result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt0jedqsq4ivhh5du6an.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt0jedqsq4ivhh5du6an.png" alt="Cross-cycle comparison" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;EXP-034 ended with all five stage gates passing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;G1 parity&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G2 reproducibility&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G3 cross-experiment compare&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G4 governance traceability&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G5 extension safety&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Final state:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;overall status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;anchor mode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;legacy_replay&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;first failed gate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;null&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;diagnostic hold&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Track-A current regeneration&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the important nuance:&lt;/p&gt;

&lt;p&gt;The experiment passed with a retained diagnostic hold.&lt;/p&gt;

&lt;p&gt;That is not a contradiction. It is the point of the control system.&lt;/p&gt;

&lt;p&gt;The accepted anchor path was allowed to proceed. The current regeneration path was not. The remediation target was moved to EXP-035.&lt;/p&gt;

&lt;p&gt;That separation is the actual proof EXP-034 provides: not that every branch became stable, but that instability was not allowed to contaminate acceptance.&lt;/p&gt;




&lt;h2&gt;
  
  
  What EXP-034 actually showed
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69c63nfb2kt1gso8awp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69c63nfb2kt1gso8awp2.png" alt="What EXP-034 actually showed" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;EXP-034 did not show that the entire pipeline is now stable.&lt;/p&gt;

&lt;p&gt;It showed something narrower and more useful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A method-locked Bio-AI governance pipeline can admit modal expansion, AlphaFold EBI observer wiring, and AG-live measurement without losing its accepted PASS/BLOCK baseline — while keeping the unstable regeneration path out of acceptance.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track-C sharpened that conclusion.&lt;/p&gt;

&lt;p&gt;AG-live entered.&lt;br&gt;&lt;br&gt;
Metrics moved slightly.&lt;br&gt;&lt;br&gt;
The verdict did not change.&lt;br&gt;&lt;br&gt;
Dangerous false-pass did not appear.&lt;br&gt;&lt;br&gt;
Conservative over-blocking remained.&lt;/p&gt;

&lt;p&gt;That is not a clean success story.&lt;/p&gt;

&lt;p&gt;It is a governed result.&lt;/p&gt;


&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rj4k1d45y6vyolfgksu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rj4k1d45y6vyolfgksu.png" alt="The Mark of a Mature AI Pipeline" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stage-gated experimentation is not just about getting a result.&lt;/p&gt;

&lt;p&gt;It is about deciding whether the result should be allowed to exist.&lt;/p&gt;

&lt;p&gt;In EXP-034, the answer was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GO   for the anchored expansion path
HOLD for current regeneration
NEXT for EXP-035 remediation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That may sound less dramatic than a clean success story.&lt;/p&gt;

&lt;p&gt;But in governance work, that is exactly the point.&lt;/p&gt;

&lt;p&gt;A mature AI pipeline is not the one that claims everything passed.&lt;/p&gt;

&lt;p&gt;It is the one that can say:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This path passed.&lt;br&gt;&lt;br&gt;
This path did not.&lt;br&gt;&lt;br&gt;
And we did not mix them.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>reproducibility</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>FLAMEHAVEN FileSearch: Why This RAG Engine Feels Different from the Usual Stack</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 20 Apr 2026 14:27:37 +0000</pubDate>
      <link>https://dev.to/flamehaven01/flamehaven-filesearch-why-this-rag-engine-feels-different-from-the-usual-stack-e83</link>
      <guid>https://dev.to/flamehaven01/flamehaven-filesearch-why-this-rag-engine-feels-different-from-the-usual-stack-e83</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa8waxadewljs6a47aqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa8waxadewljs6a47aqw.png" alt="cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  FLAMEHAVEN FileSearch: Why This RAG Engine Feels Different from the Usual Stack
&lt;/h2&gt;

&lt;p&gt;RAG is no longer an exotic idea.&lt;/p&gt;

&lt;p&gt;At this point, most developers have seen the familiar stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parser&lt;/li&gt;
&lt;li&gt;chunker&lt;/li&gt;
&lt;li&gt;embeddings&lt;/li&gt;
&lt;li&gt;vector store&lt;/li&gt;
&lt;li&gt;LLM&lt;/li&gt;
&lt;li&gt;framework wrapper&lt;/li&gt;
&lt;li&gt;demo query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not the interesting part anymore.&lt;/p&gt;

&lt;p&gt;The interesting part is what happens after the diagram:&lt;br&gt;
how much infrastructure the stack quietly demands, how much of the retrieval path is actually auditable, how much of the system is still mechanical rather than opaque, and how much operational tax the user is forced to absorb just to get a search engine running.&lt;/p&gt;

&lt;p&gt;That is where &lt;strong&gt;FLAMEHAVEN FileSearch&lt;/strong&gt; gets more interesting than the usual "another RAG repo" framing.&lt;/p&gt;

&lt;p&gt;This is not a feature announcement. It is a technical look at what the project is actually doing differently.&lt;/p&gt;


&lt;h2&gt;
  
  
  The real problem with many RAG stacks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyo70b5r5bwbjlsh76s0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyo70b5r5bwbjlsh76s0.png" alt="Most RAG systems are assembly instructions" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of RAG systems are not products. They are assembly instructions.&lt;/p&gt;

&lt;p&gt;They give you flexibility, but they also leave you responsible for stitching together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file parsing&lt;/li&gt;
&lt;li&gt;chunking strategy&lt;/li&gt;
&lt;li&gt;embeddings&lt;/li&gt;
&lt;li&gt;lexical retrieval&lt;/li&gt;
&lt;li&gt;semantic retrieval&lt;/li&gt;
&lt;li&gt;answer generation&lt;/li&gt;
&lt;li&gt;attribution&lt;/li&gt;
&lt;li&gt;storage&lt;/li&gt;
&lt;li&gt;auth&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;caching&lt;/li&gt;
&lt;li&gt;deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is fine if you want a blank canvas.&lt;/p&gt;

&lt;p&gt;It is less fine if what you actually want is a document search engine that can be deployed without turning the setup itself into a second project.&lt;/p&gt;

&lt;p&gt;That is the first reason this repo feels different: it is trying to compress more of that surface area into one codebase.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is technically different here
&lt;/h2&gt;
&lt;h2&gt;
  
  
  1) Hybrid retrieval is treated as the baseline, not the upgrade path
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnojxzw992pkl5t22sme2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnojxzw992pkl5t22sme2.png" alt="compressing the stack" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of RAG repos still behave as if semantic retrieval is the main event and lexical matching is an optional add-on.&lt;/p&gt;

&lt;p&gt;That is backwards for real document systems.&lt;/p&gt;

&lt;p&gt;FLAMEHAVEN FileSearch builds around three explicit modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keyword&lt;/li&gt;
&lt;li&gt;semantic&lt;/li&gt;
&lt;li&gt;hybrid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting part is the hybrid path itself.&lt;/p&gt;

&lt;p&gt;The retrieval stack combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;Korean + English tokenizer&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;lazy per-store BM25 rebuild path&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point matters more than it sounds. The BM25 index is not eagerly rebuilt on every upload. It is marked dirty (&lt;code&gt;_bm25_dirty&lt;/code&gt;) and rebuilt on first hybrid search after mutation. That is a very practical decision. It keeps ingestion cheaper without pretending indexing is free.&lt;/p&gt;

&lt;p&gt;This is one of the deeper differences from many vector-first RAG demos: the system does not assume semantic retrieval should dominate exact-match behavior. It assumes production search needs both.&lt;/p&gt;


&lt;h2&gt;
  
  
  2) The indexing model is not just "document in, chunks out"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qk2weat5bt8415ovnzc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qk2weat5bt8415ovnzc.png" alt="The knowledgeatom hierarchy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second meaningful difference is the indexing granularity.&lt;/p&gt;

&lt;p&gt;This repo introduces a &lt;strong&gt;KnowledgeAtom&lt;/strong&gt; layer: a two-level indexing model with&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file-level documents&lt;/li&gt;
&lt;li&gt;chunk-level atoms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those chunk atoms are not anonymous fragments. They carry stable fragment URIs of the form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;local://store/encoded_path#c0001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
`&lt;/p&gt;

&lt;p&gt;That design solves two very common problems at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;precision retrieval&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;stable attribution&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The file-level object remains available, but the system can also retrieve chunk-level units directly. That reduces the usual gap between "the document matched" and "the relevant passage was actually isolated."&lt;/p&gt;

&lt;p&gt;The URI choice matters too. A lot of local-first search code still uses basename-style references that collide the moment two files share a name. This repo moves to a reversible, quoted absolute-path-based URI namespace (&lt;code&gt;urllib.parse.quote(abs_path, safe='')&lt;/code&gt;), which is much less fragile.&lt;/p&gt;

&lt;p&gt;That is not marketing polish. That is retrieval hygiene.&lt;/p&gt;




&lt;h2&gt;
  
  
  3) The chunking path is internal, structured, and mechanical
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F383b95htzfjgqjf66bt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F383b95htzfjgqjf66bt7.png" alt="Internal tow-pass chunking" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another place where this codebase differs is that it does not outsource the core text pipeline by default.&lt;/p&gt;

&lt;p&gt;Instead of treating chunking as a thin wrapper around an external library, it implements an internal text chunker with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;heading-boundary splitting&lt;/li&gt;
&lt;li&gt;paragraph splitting&lt;/li&gt;
&lt;li&gt;sentence fallback for oversized blocks&lt;/li&gt;
&lt;li&gt;undersized chunk merging (default minimum: 64 tokens)&lt;/li&gt;
&lt;li&gt;token-aware chunk sizing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The chunking system is actually two-pass under the hood. The structure-aware &lt;code&gt;TextChunker&lt;/code&gt; handles the document splits above. On top of that, &lt;code&gt;KnowledgeAtom&lt;/code&gt; applies a second windowing pass when generating chunk embeddings — 800-character windows, 120-character overlap, and an 80-character minimum before a fragment is dropped. These two paths are separate by design: &lt;code&gt;TextChunker&lt;/code&gt; is responsible for semantic structure, &lt;code&gt;KnowledgeAtom&lt;/code&gt; for granular embedding units.&lt;/p&gt;

&lt;p&gt;The engine also ships a &lt;code&gt;ContextExtractor&lt;/code&gt; — a sliding-window utility that can enrich each chunk with text from its neighboring chunks before retrieval. It is fully tested, but it is not yet wired into the default ingestion path. It is available for downstream pipeline extension.&lt;/p&gt;

&lt;p&gt;So the pipeline architecture is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
document&lt;br&gt;
  → structure-aware split (TextChunker)&lt;br&gt;
  → chunk atom embedding (KnowledgeAtom, 800-char windows)&lt;br&gt;
  → multi-level indexing&lt;br&gt;
  → retrieval&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That is a better-shaped pipeline for document search than a naive chunk list.&lt;/p&gt;




&lt;h2&gt;
  
  
  4) The vector path is trying to remove operational weight, not add it
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz03a8czoace2fhqm5t6o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz03a8czoace2fhqm5t6o.png" alt="zero-dependency vectorization" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is probably the most unusual architectural choice in the repo.&lt;/p&gt;

&lt;p&gt;Instead of anchoring everything around a heavyweight embedding model stack, the project uses &lt;strong&gt;Gravitas Vectorizer v2.0&lt;/strong&gt;, a deterministic vectorization path built on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hybrid feature extraction (word tokens + character n-grams)&lt;/li&gt;
&lt;li&gt;signed feature hashing for collision mitigation&lt;/li&gt;
&lt;li&gt;SHA-256 based deterministic output&lt;/li&gt;
&lt;li&gt;no torch, no transformers, no model download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is obvious: this is not trying to win a leaderboard as a giant foundation-model embedding backend.&lt;/p&gt;

&lt;p&gt;That is not the point.&lt;/p&gt;

&lt;p&gt;The point is that it makes the semantic path much cheaper to deploy, easier to reason about, and viable in environments where "just load another model" is operationally the wrong answer.&lt;/p&gt;

&lt;p&gt;Technically, that shows up in several ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deterministic vector generation&lt;/li&gt;
&lt;li&gt;cold start under 1ms&lt;/li&gt;
&lt;li&gt;no ML framework dependency in the core vector path&lt;/li&gt;
&lt;li&gt;optional NumPy acceleration with pure-Python fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the semantic layer is being treated as infrastructure, not as a permanent excuse to expand infrastructure.&lt;/p&gt;

&lt;p&gt;That is rare.&lt;/p&gt;




&lt;h2&gt;
  
  
  5) The repo is explicit about local-first and multi-provider execution
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a0kn73rqdv8bv1fja40.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a0kn73rqdv8bv1fja40.png" alt="Architecture and provider abstraction" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of document search systems quietly assume one provider path.&lt;/p&gt;

&lt;p&gt;This repo does not.&lt;/p&gt;

&lt;p&gt;The provider layer supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini&lt;/li&gt;
&lt;li&gt;OpenAI&lt;/li&gt;
&lt;li&gt;Anthropic&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;OpenAI-compatible endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters for two reasons.&lt;/p&gt;

&lt;p&gt;First, it keeps the system from being hardwired to one hosted model assumption.&lt;/p&gt;

&lt;p&gt;Second, it means the retrieval stack and the answer stack are not collapsed into the same dependency decision.&lt;/p&gt;

&lt;p&gt;That is an important architectural separation.&lt;/p&gt;

&lt;p&gt;For non-Gemini providers, the code takes a provider-RAG route: local semantic retrieval first, then prompt construction, then model answer generation. That is a much more honest design than pretending all providers support the same retrieval semantics natively.&lt;/p&gt;

&lt;p&gt;The local Ollama path is especially relevant. Not because "local" is fashionable, but because self-hosted document search is often most attractive precisely when data boundary control matters more than marginal model quality gains.&lt;/p&gt;




&lt;h2&gt;
  
  
  6) The codebase has been refactored toward narrower responsibilities
&lt;/h2&gt;

&lt;p&gt;One of the easiest ways to tell whether a repo is becoming more operationally serious is to look at whether the core orchestrator is shrinking or swelling.&lt;/p&gt;

&lt;p&gt;Here, the architecture moved in the right direction.&lt;/p&gt;

&lt;p&gt;The central &lt;code&gt;core.py&lt;/code&gt; was split into focused mixins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;IngestMixin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LocalSearchMixin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CloudSearchMixin&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not just aesthetic cleanup.&lt;/p&gt;

&lt;p&gt;It clarifies the system boundary between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingestion&lt;/li&gt;
&lt;li&gt;local retrieval/orchestration&lt;/li&gt;
&lt;li&gt;provider-backed answer generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same pattern appears elsewhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BackendRegistry&lt;/code&gt; maps file extensions to parser classes via &lt;code&gt;register()&lt;/code&gt; — new formats plug in without modifying existing dispatch logic&lt;/li&gt;
&lt;li&gt;duplicate helper blocks were pulled out of cloud search paths&lt;/li&gt;
&lt;li&gt;file parsing was reduced to dispatch instead of a single giant extractor module&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These changes do not make a flashy screenshot.&lt;/p&gt;

&lt;p&gt;They do make the code easier to maintain without quietly reintroducing the same complexity elsewhere.&lt;/p&gt;

&lt;p&gt;That is a real engineering improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark snapshot
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh872w7a8wjwp9zw8p060.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh872w7a8wjwp9zw8p060.png" alt="Benchmark snapshot" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;System profile&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gravitas Vectorizer v2.0 (deterministic DSP, zero ML deps)&lt;/li&gt;
&lt;li&gt;ChronosGrid vector backend with quantized storage (int8)&lt;/li&gt;
&lt;li&gt;BM25 + RRF hybrid retrieval&lt;/li&gt;
&lt;li&gt;Local / pgvector backends&lt;/li&gt;
&lt;li&gt;Redis cache optional&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Documented performance figures&lt;/strong&gt; (Docker, Apple M1, 500 PDFs ~2GB)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector generation: &lt;code&gt;&amp;lt;1ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Search, cache hit: &lt;code&gt;9ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Search, cache miss (includes Gemini API round-trip): &lt;code&gt;1,250ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Batch search (10 queries, parallel): &lt;code&gt;2,500ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Upload, 50MB file with indexing: &lt;code&gt;3,200ms&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What matters more than the numbers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The cache-hit figure reflects the full path when semantic and lexical retrieval are served from warm indexes.&lt;/p&gt;

&lt;p&gt;The cache-miss figure is dominated by the Gemini API round-trip, not local retrieval.&lt;/p&gt;

&lt;p&gt;The performance story here is not just raw speed. It is that the repo achieves low-latency local retrieval by reducing dependency weight and simplifying the vector path, rather than by hiding heavy infrastructure behind abstraction.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  A comparison that is actually worth making
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjehh80hs06t3tvd3ffz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjehh80hs06t3tvd3ffz.png" alt="A comparison that is actually worth making" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The wrong comparison is:&lt;/p&gt;

&lt;p&gt;"Is this the best RAG framework?"&lt;/p&gt;

&lt;p&gt;That is too vague to be useful.&lt;/p&gt;

&lt;p&gt;The better comparison is architectural.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Main idea&lt;/th&gt;
&lt;th&gt;Common weakness&lt;/th&gt;
&lt;th&gt;Why this repo differs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Framework-only RAG stack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compose your own parser, retriever, vector store, and generator&lt;/td&gt;
&lt;td&gt;High assembly burden; a lot of operational logic is still your job&lt;/td&gt;
&lt;td&gt;This repo packages more of the retrieval, ingestion, attribution, and serving path together&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosted RAG / SaaS search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fastest time to first demo&lt;/td&gt;
&lt;td&gt;External data boundary, vendor coupling, recurring service assumptions&lt;/td&gt;
&lt;td&gt;This repo keeps self-hosted and local-first execution as first-class options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector-first DIY pipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic retrieval drives everything&lt;/td&gt;
&lt;td&gt;Lexical exactness and attribution often become second-class&lt;/td&gt;
&lt;td&gt;This repo treats hybrid retrieval as the practical default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FLAMEHAVEN FileSearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval + ingestion + serving compressed into one engine&lt;/td&gt;
&lt;td&gt;Less of a blank canvas than a raw framework stack&lt;/td&gt;
&lt;td&gt;Better fit for teams that want a mechanical, deployable search base instead of another assembly project&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is the actual niche.&lt;/p&gt;

&lt;p&gt;Not "RAG but louder."&lt;/p&gt;

&lt;p&gt;More like:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG with a lower operational tax.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;The RAG field has cooled compared to its peak hype cycle.&lt;/p&gt;

&lt;p&gt;That is not a bad thing.&lt;/p&gt;

&lt;p&gt;It means the novelty premium is lower, and the real questions are clearer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can it be deployed?&lt;/li&gt;
&lt;li&gt;Can it run without a side quest in infrastructure?&lt;/li&gt;
&lt;li&gt;Can it keep data local?&lt;/li&gt;
&lt;li&gt;Can it support both lexical precision and semantic recall?&lt;/li&gt;
&lt;li&gt;Can its retrieval behavior be inspected rather than mythologized?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why a repo like this becomes more interesting now than it would have been in the most hype-saturated phase of the RAG wave.&lt;/p&gt;

&lt;p&gt;When everything is new, wrappers are enough.&lt;/p&gt;

&lt;p&gt;When the field matures, the differentiator becomes whether the system removes real engineering burden.&lt;/p&gt;

&lt;p&gt;This one is at least trying to solve that problem directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is special about the code, specifically
&lt;/h2&gt;

&lt;p&gt;If I had to reduce the repo's technical distinctiveness to a short list, it would be this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BM25 + RRF is built in&lt;/strong&gt;, not bolted on later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KnowledgeAtom indexing&lt;/strong&gt; gives the system a more precise retrieval unit than document-only search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable chunk URIs&lt;/strong&gt; (&lt;code&gt;local://store/enc_path#c0001&lt;/code&gt;) make attribution less fragile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-pass chunking&lt;/strong&gt; — structure-aware TextChunker + char-window KnowledgeAtom embedding pass — keeps the text pipeline mechanical and inspectable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gravitas Vectorizer v2.0&lt;/strong&gt; reduces startup cost and dependency sprawl (zero torch/transformers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider abstraction&lt;/strong&gt; separates retrieval architecture from model vendor choice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixin segmentation and BackendRegistry pattern&lt;/strong&gt; show a codebase moving away from monolithic orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why this repo feels different from the usual RAG stack.&lt;/p&gt;

&lt;p&gt;Not because it claims magic.&lt;/p&gt;

&lt;p&gt;Because it makes several practical decisions that many RAG repos defer, externalize, or ignore.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest boundary
&lt;/h2&gt;

&lt;p&gt;This is not a claim that the repo solves everything.&lt;/p&gt;

&lt;p&gt;It does not.&lt;/p&gt;

&lt;p&gt;And the codebase itself shows that.&lt;/p&gt;

&lt;p&gt;Static inspection still flags complexity hotspots in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;api.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;admin_routes.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;eval_self.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;chronos_grid.py&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are also components that exist in the engine but are not yet connected to the default pipeline — &lt;code&gt;ContextExtractor&lt;/code&gt; being the clearest example. The architecture is there; the wiring is not yet complete everywhere.&lt;/p&gt;

&lt;p&gt;That is actually a good thing for a write-up like this, because it keeps the claim honest.&lt;/p&gt;

&lt;p&gt;The interesting story here is not "perfect codebase."&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a repo with a real architectural point of view, a recognizably lower dependency burden, and code decisions that are meaningfully different from the usual vector-wrapper pattern.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is a much stronger claim than vague "enterprise-grade RAG" language.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final take
&lt;/h2&gt;

&lt;p&gt;FLAMEHAVEN FileSearch is interesting because it is not merely trying to make retrieval work.&lt;/p&gt;

&lt;p&gt;It is trying to make retrieval:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more mechanical&lt;/li&gt;
&lt;li&gt;more local&lt;/li&gt;
&lt;li&gt;more attributable&lt;/li&gt;
&lt;li&gt;less dependency-heavy&lt;/li&gt;
&lt;li&gt;and less painful to deploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a better differentiator than "supports RAG."&lt;/p&gt;

&lt;p&gt;Most repositories do.&lt;/p&gt;

&lt;p&gt;The more important question now is whether they reduce the actual engineering burden around RAG, or just rearrange it.&lt;/p&gt;

&lt;p&gt;This repo is interesting because it appears to reduce some of it in code.&lt;/p&gt;

&lt;p&gt;And in a field where many projects now converge into the same parser + vector store + model + wrapper pattern, that is a difference worth paying attention to.&lt;/p&gt;




&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/flamehaven01/Flamehaven-Filesearch" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/Flamehaven-Filesearch&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>rag</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>AI-SLOP Detector v3.5.0 — Every Claim, Verified Against Source Code</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Wed, 15 Apr 2026 06:19:37 +0000</pubDate>
      <link>https://dev.to/flamehaven01/ai-slop-detector-v350-every-claim-verified-against-source-code-1n94</link>
      <guid>https://dev.to/flamehaven01/ai-slop-detector-v350-every-claim-verified-against-source-code-1n94</guid>
      <description>&lt;p&gt;I published a LinkedIn post about AI-SLOP Detector's self-calibration system and download numbers. Someone asked the reasonable question: &lt;strong&gt;"Can you actually back that up?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Here's the source.&lt;/p&gt;

&lt;p&gt;This isn't a feature announcement. It's a line-by-line audit of seven claims against the actual codebase. Every VERDICT links to a real file and real line numbers. The repo is public — go check it yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  What was claimed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Every scan is recorded&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repeat scans become calibration signal&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Updates only when signal is strong enough&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visible policy artifact (&lt;code&gt;.slopconfig.yaml&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explicit numeric limits govern calibration&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detects empty/stub/phantom/disconnected code&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~1.4K downloads last week&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All seven. No fabrications. No inflated numbers. Here's the proof.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 1: "Every scan is recorded"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/history.py&lt;/code&gt;, lines 116–180&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auto-invoked on every CLI run. The only opt-out is &lt;code&gt;--no-history&lt;/code&gt;. Each scan writes to SQLite at &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt; and stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;deficit_score&lt;/code&gt;, &lt;code&gt;ldr_score&lt;/code&gt;, &lt;code&gt;inflation_score&lt;/code&gt;, &lt;code&gt;ddc_usage_ratio&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n_critical_patterns&lt;/code&gt;, &lt;code&gt;fired_rules&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git_commit&lt;/code&gt;, &lt;code&gt;git_branch&lt;/code&gt;, &lt;code&gt;project_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Schema is now at v5, auto-migrated on startup through every release from v2.9.0 to v3.5.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The record() call is real. The schema is versioned. The behavior is not optional.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 2: "Every re-scan becomes signal"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/history.py&lt;/code&gt;, lines 221–246&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_files_with_multiple_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Only files scanned &amp;gt;= 2 times count as calibration events
&lt;/span&gt;    &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="n"&gt;GROUP&lt;/span&gt; &lt;span class="n"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="n"&gt;HAVING&lt;/span&gt; &lt;span class="nc"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 301–309&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_load_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;by_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_group_runs_by_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single-scan files produce no calibration events. Only repeat scans generate &lt;code&gt;improvement&lt;/code&gt; or &lt;code&gt;fp_candidate&lt;/code&gt; labels. The threshold is hardcoded in SQL, not assumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The repeat-scan requirement is enforced at the query level, not in documentation.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 3: "Updates only when the signal is strong enough"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 37–54 (constants) and 251–262 (enforcement)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;   &lt;span class="c1"&gt;# min gap between #1 and #2 candidate
&lt;/span&gt;&lt;span class="n"&gt;MIN_IMPROVEMENTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;       &lt;span class="c1"&gt;# improvement events required
&lt;/span&gt;&lt;span class="n"&gt;MIN_FP_CANDIDATES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;      &lt;span class="c1"&gt;# fp_candidate events required
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 1 — confidence gap check (line 251):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_gap&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insufficient_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Confidence gap &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_gap&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;lt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Candidates are too close — need more history data for reliable calibration.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;  &lt;span class="c1"&gt;# NO UPDATE APPLIED
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 2 — score delta check (line 262):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_score&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;winner_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# also does not apply
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two independent guards. Both must pass before any weight update applies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Ambiguous signal is rejected twice before touching configuration.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 4: "Leaves behind a visible policy every time it changes"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, docstring line 17–18&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Return CalibrationResult&lt;span class="p"&gt;;&lt;/span&gt; optionally write to .slopconfig.yaml via &lt;span class="nt"&gt;--apply-calibration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;--apply-calibration&lt;/code&gt; is passed and &lt;code&gt;status == "ok"&lt;/code&gt;, optimal weights are written to &lt;code&gt;.slopconfig.yaml&lt;/code&gt;. Plain-text YAML. Human-readable. Git-versionable. Every calibration change is a diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The policy artifact is explicit. You can &lt;code&gt;git blame&lt;/code&gt; it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 5: "Explicit limits govern calibration"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 37–54&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MIN_W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;             &lt;span class="c1"&gt;# minimum allowed weight per dimension
&lt;/span&gt;&lt;span class="n"&gt;MAX_W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;             &lt;span class="c1"&gt;# maximum allowed weight per dimension
&lt;/span&gt;&lt;span class="n"&gt;MAX_PURITY_WEIGHT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="c1"&gt;# purity ceiling
&lt;/span&gt;&lt;span class="n"&gt;DOMAIN_TOLERANCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;  &lt;span class="c1"&gt;# max per-dimension deviation from domain anchor
&lt;/span&gt;&lt;span class="n"&gt;DOMAIN_DRIFT_LIMIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="c1"&gt;# warn when optimal weight drifts this far
&lt;/span&gt;&lt;span class="n"&gt;GRID_STEP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;             &lt;span class="c1"&gt;# 0.05 increment resolution
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No ML model. No learned bounds. Every constraint is a named constant with a comment explaining why it exists. The calibration space is a bounded grid, not an open optimization landscape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Every limit is auditable. Nothing is opaque.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 6: "Detects empty implementations, phantom dependencies, disconnected pipelines"
&lt;/h2&gt;

&lt;p&gt;These are the three canonical defect patterns AI code generation produces at scale. Each has a dedicated module.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Defect class&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Empty/stub functions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/ldr.py&lt;/code&gt; — LDRCalculator detects &lt;code&gt;pass&lt;/code&gt;, &lt;code&gt;...&lt;/code&gt;, &lt;code&gt;raise NotImplementedError&lt;/code&gt;, &lt;code&gt;TODO&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phantom/unused imports&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/hallucination_deps.py&lt;/code&gt; — AST-based import vs usage analysis via &lt;code&gt;HallucinatedDependency&lt;/code&gt; dataclass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disconnected pipelines&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/ddc.py&lt;/code&gt; — DDC (Declared Dependency Completeness) usage ratio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Function clone clusters&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/patterns/python_advanced.py&lt;/code&gt; — Jensen-Shannon Divergence on 30-dim AST histograms, JSD &amp;lt; 0.05 = clone&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The clone detection is worth noting. JSD on AST histograms catches structural duplication that string similarity misses entirely. LLMs produce a lot of this — same function logic, slightly renamed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Each defect class has a named module with a working implementation.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 7: "~1.4K downloads in the past week"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: pypistats.org API (&lt;code&gt;mirrors=false&lt;/code&gt;), queried 2026-04-15&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;last_week&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;1,407  (mirrors excluded — actual pip install traffic)&lt;/span&gt;
&lt;span class="na"&gt;last_month&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1,787&lt;/span&gt;
&lt;span class="na"&gt;last_day&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="m"&gt;83&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"~1.4K" is within 0.5% of 1,407. Mirrors excluded means bot traffic is stripped — these are real install invocations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Verified against pypistats in real time. The number is not rounded up.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this format exists
&lt;/h2&gt;

&lt;p&gt;Most open-source project posts make claims. Few back them up with file paths and line numbers.&lt;/p&gt;

&lt;p&gt;That gap is the same problem AI-SLOP Detector is built to close. AI-generated code makes claims too — functions that look complete, imports that look used, pipelines that look connected. Static analysis finds the gap between what the code says and what it does.&lt;/p&gt;

&lt;p&gt;This post applies the same standard to the project's own marketing copy. If a claim can be verified, it should be. If it can't, it shouldn't be made.&lt;/p&gt;

&lt;p&gt;The codebase is public: &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;github.com/flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pull requests welcome. Audits welcome more.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verified by static code analysis + pypistats API, 2026-04-15&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitools</category>
      <category>opensource</category>
      <category>codequality</category>
      <category>python</category>
    </item>
  </channel>
</rss>
