<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kwansub Yun</title>
    <description>The latest articles on DEV Community by Kwansub Yun (@flamehaven01).</description>
    <link>https://dev.to/flamehaven01</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3508506%2Fe2f9bc29-10d2-41ec-8e77-19b8b5cfd9e9.jpg</url>
      <title>DEV Community: Kwansub Yun</title>
      <link>https://dev.to/flamehaven01</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/flamehaven01"/>
    <language>en</language>
    <item>
      <title>Beyond M15: Why STEM BIO-AI Started Acting More Like a Governance Report in v1.8.x</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Fri, 12 Jun 2026 10:21:34 +0000</pubDate>
      <link>https://dev.to/flamehaven01/beyond-m15-why-stem-bio-ai-started-acting-more-like-a-governance-report-in-v18x-2jlc</link>
      <guid>https://dev.to/flamehaven01/beyond-m15-why-stem-bio-ai-started-acting-more-like-a-governance-report-in-v18x-2jlc</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Not just a new framework, but a clearer answer to what the score means, why the report exists, and how the artifact should be read.&lt;/strong&gt;
&lt;/h2&gt;




&lt;p&gt;The real change in &lt;code&gt;v1.8.0&lt;/code&gt; through &lt;code&gt;v1.8.4&lt;/code&gt; was not that STEM BIO-AI cited one more framework.&lt;/p&gt;

&lt;p&gt;The real change was that it became harder to misread the report.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;M15&lt;/code&gt; mattered. It strengthened the regulatory-traceability vocabulary. But the deeper shift was broader: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;the tool got stricter about what it was willing to imply from local repository evidence, and the report got more explicit about why each surface exists at all.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That changed the project in three ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;it stopped behaving like a score sheet that developers happened to inspect&lt;/li&gt;
&lt;li&gt;it integrated &lt;code&gt;M15&lt;/code&gt; as a bounded post-hoc traceability layer rather than a hidden score driver&lt;/li&gt;
&lt;li&gt;it treated release memory, packaging, and public report surfaces as part of release integrity rather than mere maintenance hygiene&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the real post-M15 story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fm177lzsfwr2gpnb6e0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fm177lzsfwr2gpnb6e0.png" alt="cover" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 1. Perception: Why STEM BIO-AI Should Not Be Read as a Simple Score Tool
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F686q01btrkc49rt5m3n1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F686q01btrkc49rt5m3n1.png" alt="2" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The hardest reporting problem in the &lt;code&gt;v1.8.x&lt;/code&gt; line was no longer only &lt;strong&gt;how to show something&lt;/strong&gt; or even &lt;strong&gt;what to show&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It was &lt;strong&gt;why to show it at all&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That distinction matters because the same report is read by different people for different reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a prospective user wants to know whether the repository is trustworthy enough to try&lt;/li&gt;
&lt;li&gt;a maintainer wants to know what is holding the score down and what to fix first&lt;/li&gt;
&lt;li&gt;a reviewer or auditor wants to know which claims are supported, which are overstated, and which remain outside scope&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If those audiences all receive the same fields without a visible purpose hierarchy, the result is machine-legible but human-misleading.&lt;/p&gt;

&lt;p&gt;That is why the recent report changes should be understood as &lt;strong&gt;user-friendliness in a governance sense&lt;/strong&gt;, not as design polish.&lt;/p&gt;

&lt;p&gt;The project had to become better at stopping readers from confusing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a deterministic score with a safety verdict&lt;/li&gt;
&lt;li&gt;a traceability mapping with compliance proof&lt;/li&gt;
&lt;li&gt;a code-integrity &lt;code&gt;PASS&lt;/code&gt; with overall repository maturity&lt;/li&gt;
&lt;li&gt;a compact report surface with complete evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That realization changed the output layer itself.&lt;/p&gt;

&lt;p&gt;Recent report work added or strengthened several surfaces specifically to solve that perception problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a fixed score-boundary note near the score itself&lt;/li&gt;
&lt;li&gt;explicit &lt;code&gt;Tier Lock&lt;/code&gt; and &lt;code&gt;Classification Applied&lt;/code&gt; surfaces so score constraints are not hidden&lt;/li&gt;
&lt;li&gt;stronger &lt;code&gt;Governance Posture&lt;/code&gt;, &lt;code&gt;What Is Actually Present&lt;/code&gt;, and &lt;code&gt;What Is Missing Or Contradicted&lt;/code&gt; summaries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Regulatory Traceability&lt;/code&gt; placed ahead of the MIT AI Risk Repository (AIRI), used here as a secondary risk-vocabulary layer, so the reader sees repository-to-framework mapping before the broader risk language&lt;/li&gt;
&lt;li&gt;clearer chapter hierarchy in the detailed PDF so the report reads like a governance document instead of a detector dump&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In concrete terms, that changed the reader's path through the artifact.&lt;/p&gt;

&lt;p&gt;Instead of landing first on a score and then digging through detector output, the current report leads with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Governance Posture&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;About This Score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;What Is Actually Present&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;What Is Missing Or Contradicted&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Regulatory Traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AIRI Risk Triggers&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does it move into &lt;code&gt;Decision Path&lt;/code&gt;, &lt;code&gt;Top Remediation Actions&lt;/code&gt;, &lt;code&gt;Code Integrity details&lt;/code&gt;, and &lt;code&gt;Evidence detail&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The key lesson was simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a report becomes more useful not when it shows more fields, but when the reason those fields exist becomes legible to the reader.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is also why the score disclaimer mattered so much:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Score reflects calculation integrity, not calibrated validity. Triage signal only.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sentence is not ornamental. It forces the system to tell the truth about itself.&lt;/p&gt;

&lt;p&gt;What is verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;calculation integrity&lt;/li&gt;
&lt;li&gt;deterministic reproducibility&lt;/li&gt;
&lt;li&gt;transparent score assembly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What is not verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;calibrated measurement validity&lt;/li&gt;
&lt;li&gt;runtime behavior correctness&lt;/li&gt;
&lt;li&gt;clinical safety&lt;/li&gt;
&lt;li&gt;compliance or regulatory clearance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the most important perception shift in the &lt;code&gt;v1.8.x&lt;/code&gt; line.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmx4p534t8fev02422dup.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmx4p534t8fev02422dup.png" alt="3" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The project is no longer trying only to answer, “What score did this repository get?”&lt;/p&gt;

&lt;p&gt;It is trying to answer something more useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is bio-governance actually present?&lt;/li&gt;
&lt;li&gt;Is it adequate relative to the repository’s claims?&lt;/li&gt;
&lt;li&gt;What is verified, what is inferred, and what is still missing?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Figure 1. The report now places governance posture, score-boundary language, and top-level trust signals near the score surface instead of hiding them behind lower-level detector output.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 2. What M15 Is, Why It Matters, and How STEM BIO-AI Uses It
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3kv3g2nva28gfu7x8ap0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3kv3g2nva28gfu7x8ap0.png" alt="5" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;M15&lt;/code&gt; refers to &lt;strong&gt;ICH M15: General Principles for Model-Informed Drug Development&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The official FDA guidance page is here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.fda.gov/regulatory-information/search-fda-guidance-documents/m15-general-principles-model-informed-drug-development" rel="noopener noreferrer"&gt;FDA: M15 General Principles for Model-Informed Drug Development&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As the FDA describes it, the June 2026 final guidance was prepared under the auspices of the International Council for Harmonisation and provides general recommendations for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;planning model-informed drug development evidence&lt;/li&gt;
&lt;li&gt;model evaluation&lt;/li&gt;
&lt;li&gt;documentation&lt;/li&gt;
&lt;li&gt;regulatory interactions&lt;/li&gt;
&lt;li&gt;reporting and submission&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also establishes a harmonized assessment framework and terminology for MIDD evidence. That matters because it gives a cleaner language for talking about traceability, documentation quality, and context of use.&lt;/p&gt;

&lt;p&gt;But the important thing in STEM BIO-AI is not merely that &lt;code&gt;M15&lt;/code&gt; appears in the output.&lt;/p&gt;

&lt;p&gt;The important thing is &lt;strong&gt;how&lt;/strong&gt; it appears.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI does &lt;strong&gt;not&lt;/strong&gt; use &lt;code&gt;M15&lt;/code&gt; as a covert score driver. It does not inflate the formal score because an &lt;code&gt;M15&lt;/code&gt; citation exists. It uses &lt;code&gt;M15&lt;/code&gt; as a &lt;strong&gt;post-hoc regulatory traceability layer&lt;/strong&gt; attached to already-detected repository evidence.&lt;/p&gt;

&lt;p&gt;That boundary matters.&lt;/p&gt;

&lt;p&gt;Without it, a framework citation can easily become a kind of rhetorical overclaim:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the report looks more regulatory than it really is&lt;/li&gt;
&lt;li&gt;the reader assumes framework mention implies compliance maturity&lt;/li&gt;
&lt;li&gt;traceability begins to masquerade as proof&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The post-M15 line was careful to avoid that mistake.&lt;/p&gt;

&lt;p&gt;In practice, the project used &lt;code&gt;M15&lt;/code&gt; in a bounded way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;as part of &lt;code&gt;measurement_basis&lt;/code&gt; and regulatory framing&lt;/li&gt;
&lt;li&gt;as a traceability surface that helps interpret repository evidence&lt;/li&gt;
&lt;li&gt;as a complementary reference alongside EU AI Act, IMDRF, and FDA guidance themes&lt;/li&gt;
&lt;li&gt;not as a direct input that changes the formal score formula&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That changed real artifact fields.&lt;/p&gt;

&lt;p&gt;The post-M15 line now surfaces traceability in places such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;human-readable &lt;code&gt;Regulatory Traceability&lt;/code&gt; sections in HTML, Markdown, explain, and PDF&lt;/li&gt;
&lt;li&gt;framework-grouped labels such as &lt;code&gt;EU AI Act&lt;/code&gt;, &lt;code&gt;ICH M15&lt;/code&gt;, &lt;code&gt;IMDRF&lt;/code&gt;, and &lt;code&gt;FDA&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;status-oriented summaries such as &lt;code&gt;Signal only&lt;/code&gt;, &lt;code&gt;Partially aligned&lt;/code&gt;, and &lt;code&gt;Aligned&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;explicit &lt;code&gt;source_ids&lt;/code&gt; and &lt;code&gt;finding_refs&lt;/code&gt; so a reader can trace which repository signal triggered which regulatory mapping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the right way to describe the integration is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;M15 strengthened traceability language and reporting context, but it did not become the hidden engine of the score.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is also consistent with how FDA guidance should be read. FDA's own Federal Register notice states that guidance documents do not establish legally enforceable responsibilities; they describe the Agency's current thinking and should be read as recommendations unless specific statutory or regulatory requirements are cited. See the June 3, 2026 Federal Register notice for M15: &lt;a href="https://regulations.justia.com/regulations/fedreg/2026/06/03/2026-11112.html" rel="noopener noreferrer"&gt;https://regulations.justia.com/regulations/fedreg/2026/06/03/2026-11112.html&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This distinction also helped the report become more honest.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Regulatory Traceability&lt;/code&gt; is useful because it tells a reviewer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which frameworks the observed evidence touches&lt;/li&gt;
&lt;li&gt;which mappings are only signal-level&lt;/li&gt;
&lt;li&gt;which are partially aligned&lt;/li&gt;
&lt;li&gt;what the report still cannot claim&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly where a framework like &lt;code&gt;M15&lt;/code&gt; belongs in this system: as a bounded interpretive layer that helps a reader connect local repository signals to external governance language more carefully.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzu5z9a16d5wmzte6iio.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzu5z9a16d5wmzte6iio.png" alt="6" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Regulatory traceability now shows framework-grouped mappings, bounded statuses, and trigger-linked references, making it easier to see how local repository evidence touches M15, EU AI Act, IMDRF, and FDA guidance themes without mistaking those mappings for compliance proof.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd90mq26ogcn5vphcp2p9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd90mq26ogcn5vphcp2p9.jpg" alt="sample7p" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 3. The Other Improvements That Actually Made the Tool More Mature
&lt;/h2&gt;

&lt;p&gt;After the M15 integration, three other changes mattered just as much, and in some cases more.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 The Tool Stopped Hiding Score Constraints
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13fyu6eurjnkocwuz5kc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F13fyu6eurjnkocwuz5kc.png" alt="7" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the biggest interpretability problems in earlier versions was that a report could be capped or floored without making that state obvious enough in the human-readable artifact.&lt;/p&gt;

&lt;p&gt;That is what led to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Tier Lock [CA-CAP]&lt;/code&gt;, the clinical-adjacent score-cap state&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Tier Lock [T0-FLOOR]&lt;/code&gt;, the hard-floor state for stronger direct clinical concern&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Classification Applied&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These surfaces changed the meaning of the report.&lt;/p&gt;

&lt;p&gt;They tell the reader that the formal score is not just an arithmetic total. It is also shaped by active classification state:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whether the repository is clinical-adjacent&lt;/li&gt;
&lt;li&gt;whether an explicit non-clinical boundary is missing&lt;/li&gt;
&lt;li&gt;whether a score ceiling is active&lt;/li&gt;
&lt;li&gt;whether a hard-floor review path has been triggered&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This made the report more inspectable, but more importantly, it made the report less willing to hide the reasons a higher tier is blocked.&lt;/p&gt;

&lt;p&gt;That matters because remediation is not always “add more points.”&lt;/p&gt;

&lt;p&gt;Sometimes the real issue is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remove the condition that prevents the repository from being meaningfully read as governance-ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a better audit posture than a naked scalar score.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lud9o3dxx1q0phvqems.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lud9o3dxx1q0phvqems.jpg" alt="sample1p" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  3.2 The Report Became a Governance Document Instead of a Score Sheet
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flupq2dm2ivbiu6myut5a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flupq2dm2ivbiu6myut5a.png" alt="4" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was the most visible change to anyone reading the artifacts.&lt;/p&gt;

&lt;p&gt;The detailed packet stopped feeling like a machine-oriented export and started behaving more like a governance-suitability document.&lt;/p&gt;

&lt;p&gt;The current packet is built around a more explicit hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Governance Posture&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;What Is Actually Present&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;What Is Missing Or Contradicted&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Regulatory Traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AIRI Risk Triggers&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Method Boundary&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current detailed packet is chaptered as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Chapter 1 — Stage Scorecard and Governance Scoring&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Chapter 2 — Code Integrity Deep Analysis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Chapter 3 — Regulatory Traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Chapter 4 — Remediation Actions, AIRI Risk Triggers &amp;amp; Method Boundary&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Chapter 5 — Report Metadata&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The HTML report similarly exposes a seven-section navigation path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Summary&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Decision Path&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Code Integrity&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Regulatory&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AIRI Risk Triggers&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Evidence&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Developer&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those labels matter because they changed what the reader sees first and what the reader is expected to conclude from the artifact. The reader now moves through adequacy, contradiction, traceability, and scope before falling back to engineering detail.&lt;/p&gt;

&lt;p&gt;Only after that does the packet lean into deeper developer-facing material such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Decision Path&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Top Remediation Actions&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Code Integrity details&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Evidence detail&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That reordering matters because the report’s first job is not to help a maintainer debug detectors. Its first job is to answer whether bio-governance is actually present, whether it is adequate relative to claims, and what remains unsupported or missing.&lt;/p&gt;

&lt;p&gt;That is why the current packet structure is more than presentation work. It is a statement about document type: a governance artifact with&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a posture statement&lt;/li&gt;
&lt;li&gt;explicit scope limits&lt;/li&gt;
&lt;li&gt;traceability context&lt;/li&gt;
&lt;li&gt;contradiction surfaces&lt;/li&gt;
&lt;li&gt;remediation direction&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3.3 MICA, Packaging, and Release Surfaces Became Release Integrity Work
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgrhygtg0ylb460w9apa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flgrhygtg0ylb460w9apa.png" alt="8" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The final maturation step was less glamorous, but it mattered a great deal.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;v1.8.x&lt;/code&gt;, active memory pointers, public version surfaces, preview assets, and package-data inclusion became impossible to treat as optional housekeeping.&lt;/p&gt;

&lt;p&gt;If the release says one thing while:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MICA&lt;/code&gt;, the project's active release-memory layer, points somewhere else&lt;/li&gt;
&lt;li&gt;packaged assets omit active files&lt;/li&gt;
&lt;li&gt;report previews lag behind the actual runtime&lt;/li&gt;
&lt;li&gt;public docs describe stale behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;then the tool is not governed. It is merely assembled.&lt;/p&gt;

&lt;p&gt;That is why post-M15 work spent real effort on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rotating the active MICA trio cleanly&lt;/li&gt;
&lt;li&gt;pruning live historical memory surfaces while preserving provenance in Git-tagged history&lt;/li&gt;
&lt;li&gt;making report previews match the actual runtime output&lt;/li&gt;
&lt;li&gt;hardening package-data and release-surface alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The practical examples here are not abstract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;README&lt;/code&gt; level tables and actual packet filenames had to agree on &lt;code&gt;8p&lt;/code&gt;, not &lt;code&gt;7p&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;tracked preview assets had to match the real generated HTML and PDF outputs&lt;/li&gt;
&lt;li&gt;active &lt;code&gt;MICA&lt;/code&gt; pointers had to reference the same live trio the package actually shipped&lt;/li&gt;
&lt;li&gt;public docs had to stop describing stale section counts or old packet shapes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Small mismatches matter here because governance tools are judged by their own traceability discipline. If a report surface says &lt;code&gt;8p&lt;/code&gt; while the surrounding docs still describe &lt;code&gt;7p&lt;/code&gt;, the tool teaches the wrong lesson about its own evidence hygiene.&lt;/p&gt;

&lt;p&gt;This sounds operational because it is. But it is also methodological.&lt;/p&gt;

&lt;p&gt;A governance scanner that critiques target repositories for stale surfaces, unsupported claims, or weak provenance cannot remain credible if its own release memory and public artifact surfaces drift by version.&lt;/p&gt;

&lt;p&gt;That is why the packaging and memory work belongs in the same story as the report work.&lt;/p&gt;

&lt;p&gt;It reduced the number of places where truth could fork.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Leaves the Project
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2drz2titirm380osf4r1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2drz2titirm380osf4r1.png" alt="9" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If I had to summarize the post-M15 line in one sentence, it would be this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STEM BIO-AI became less willing to let a convenient surface pretend to be the whole truth.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That shows up in several places at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the score is now shown with clearer purpose boundaries&lt;/li&gt;
&lt;li&gt;score constraints are surfaced instead of buried&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;M15&lt;/code&gt; appears as traceability, not as covert score inflation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AIRI&lt;/code&gt; is framed as secondary risk vocabulary, not proof&lt;/li&gt;
&lt;li&gt;the packet now behaves more like a governance document&lt;/li&gt;
&lt;li&gt;release memory and packaging are treated as release-integrity concerns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tool is still bounded and deterministic. It still cannot see runtime truth, wet-lab reproducibility, model-output correctness, or clinical validation.&lt;/p&gt;

&lt;p&gt;But in the &lt;code&gt;v1.8.x&lt;/code&gt; line, it got better at saying exactly that.&lt;/p&gt;

&lt;p&gt;And it got better at saying it in a form that a prospective user, a maintainer, and a reviewer can all use without needing to reverse-engineer the internal taxonomy first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu77ytdlvcs0xrkoz1i1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu77ytdlvcs0xrkoz1i1u.png" alt="10" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next maturity steps are not only more detectors.&lt;/p&gt;

&lt;p&gt;They are also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;improving human-readable explanations without overstating certainty&lt;/li&gt;
&lt;li&gt;expanding the behavioral and path-sensitive side of static analysis without pretending it is dynamic truth&lt;/li&gt;
&lt;li&gt;broadening benchmark calibration so score validity is less prior-heavy&lt;/li&gt;
&lt;li&gt;continuing to align report purpose, release memory, and public surfaces so the artifact remains hard to misread&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the real roadmap after M15.&lt;/p&gt;

&lt;p&gt;Not just more coverage.&lt;/p&gt;

&lt;p&gt;More disciplined meaning.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flty80m9umf7b13s0qagp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flty80m9umf7b13s0qagp.png" alt="repo" width="680" height="636"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repository: &lt;a href="https://github.com/flamehaven01/STEM-BIO-AI" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/STEM-BIO-AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Live HF Space: &lt;a href="https://huggingface.co/spaces/Flamehaven/stem-bio-ai" rel="noopener noreferrer"&gt;https://huggingface.co/spaces/Flamehaven/stem-bio-ai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>bioinformatics</category>
      <category>opensource</category>
      <category>infrastructure</category>
      <category>governance</category>
    </item>
    <item>
      <title>AI-SLOP-DETECTOR v3.8.1: When Code Generation Gets Cheap, Structural Trust Gets Expensive</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 04 Jun 2026 15:09:30 +0000</pubDate>
      <link>https://dev.to/flamehaven01/ai-slop-detector-v381-when-code-generation-gets-cheap-structural-trust-gets-expensive-3kb0</link>
      <guid>https://dev.to/flamehaven01/ai-slop-detector-v381-when-code-generation-gets-cheap-structural-trust-gets-expensive-3kb0</guid>
      <description>&lt;p&gt;For a long time, the hardest part of software development was writing code.&lt;/p&gt;

&lt;p&gt;That is no longer true.&lt;/p&gt;

&lt;p&gt;As AI-assisted coding and agent-driven workflows become mainstream, the cost of generating code is collapsing. But the cost of understanding, reviewing, simplifying, and deleting code is rising just as quickly. Code is now easier to append than to validate. Easier to duplicate than to consolidate. Easier to generate than to safely remove.&lt;/p&gt;

&lt;p&gt;That asymmetry is creating a new engineering problem. The question is no longer only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we generate more code faster?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is increasingly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we stop generated code from silently degrading the structure of a codebase?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the space AI-SLOP-DETECTOR is being built for.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;v3.8.1&lt;/code&gt; matters because the project is moving from &lt;strong&gt;detection&lt;/strong&gt; toward &lt;strong&gt;governed cleanup&lt;/strong&gt;, while keeping three layers separate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;scoring&lt;/strong&gt;: measure structural risk&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;action planning&lt;/strong&gt;: prioritize what is safe or important to review&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;enforcement&lt;/strong&gt;: verify what must fail closed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation is the real story of this release. It is also the strongest reason to take the project seriously.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Release Matters Now
&lt;/h2&gt;

&lt;p&gt;There are many tools that claim to measure “AI code quality.” The meaningful distinction is not whether they can emit findings. It is whether they preserve boundary discipline when the findings start to drive workflow.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;v3.8.1&lt;/code&gt; is important because it sharpens three claims:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;The scoring path became safer&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cleanup became more actionable&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Governance became harder to bypass&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else in this release is evidence for one of those three claims.&lt;/p&gt;




&lt;h2&gt;
  
  
  Changelog Evidence Since v3.6.0
&lt;/h2&gt;

&lt;p&gt;The recent releases make more sense as a sequence than as isolated feature drops.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Key Change&lt;/th&gt;
&lt;th&gt;Why It Mattered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.6.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Claude Code Skill, CI gate fix, pre-commit rewrite, VS Code packaging&lt;/td&gt;
&lt;td&gt;The project became more workflow-aware, not just scan-aware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dogfooding calibration, renderer/module splits, self-repair from internal audit&lt;/td&gt;
&lt;td&gt;Maintainability and internal trust improved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;False-positive reduction, richer skill routing, VS Code modularization&lt;/td&gt;
&lt;td&gt;Lower friction and better usability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Config/schema validation and runtime data guards&lt;/td&gt;
&lt;td&gt;The scoring path became harder to corrupt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Import/package stability and CI fixes&lt;/td&gt;
&lt;td&gt;The tool became more reliable in real environments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Major false-positive patch wave&lt;/td&gt;
&lt;td&gt;Trustworthiness improved materially&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;phantom_import&lt;/code&gt; flat-project fix&lt;/td&gt;
&lt;td&gt;A visible correctness gap was closed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;deficit_breakdown&lt;/code&gt;, idempotent &lt;code&gt;--init&lt;/code&gt;, first-run UX improvements&lt;/td&gt;
&lt;td&gt;Explainability and onboarding improved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.7&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cross-language aggregation fix, ignore matching fix, ML reproducibility fix&lt;/td&gt;
&lt;td&gt;Project-level correctness improved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Structural scaling, suppression ledger, cache, hotspots, agent API&lt;/td&gt;
&lt;td&gt;The tool became more operational&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.7.9&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Governance verification gate and math/policy separation&lt;/td&gt;
&lt;td&gt;Enforcement became explicit and fail-closed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.8.0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Canonical CLI: &lt;code&gt;scan / review / pulse / sweep&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;The public surface became simpler and more stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;v3.8.1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cleanup confidence planning, manifest hygiene, layered architecture review&lt;/td&gt;
&lt;td&gt;The tool moved from issue listing toward action planning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Seen together, these releases show a pattern: not just more features, but more correctness, more explainability, more governance, and more usable workflow surfaces.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 1: The Scoring Path Became Safer
&lt;/h2&gt;

&lt;p&gt;The most important technical reinforcement since &lt;code&gt;v3.6.0&lt;/code&gt; is not that the project added more signals. It is that the project made the scoring path safer to trust.&lt;/p&gt;

&lt;p&gt;The core model still uses a weighted geometric aggregation across four dimensions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F65z6q1c8k6f2e8r35mph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F65z6q1c8k6f2e8r35mph.png" alt="1" width="797" height="69"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;with the deficit-oriented score driven by:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ny6zvz3veh8rd800c1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ny6zvz3veh8rd800c1u.png" alt="2" width="797" height="58"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here, &lt;em&gt;&lt;strong&gt;P pattern&lt;/strong&gt;&lt;/em&gt; represents the additional penalty assigned when repeated structural patterns reinforce the deficit.&lt;/p&gt;

&lt;p&gt;That formula is not the interesting part by itself. The important part is what was reinforced around it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What changed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;config values are validated before they enter the model&lt;/li&gt;
&lt;li&gt;metric ranges are guarded before they can poison the score&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;deficit_breakdown&lt;/code&gt; makes score attribution inspectable&lt;/li&gt;
&lt;li&gt;cross-language aggregation no longer misstates project summaries&lt;/li&gt;
&lt;li&gt;structural coherence now scales with deterministic fallback above a ceiling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why it matters
&lt;/h3&gt;

&lt;p&gt;Without those reinforcements, the formula risks becoming authority texture. With them, it behaves more like an engineering instrument.&lt;/p&gt;

&lt;p&gt;For a technical reader, the observable improvement is not abstract math prestige. It is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fewer broken summaries&lt;/li&gt;
&lt;li&gt;fewer config-induced distortions&lt;/li&gt;
&lt;li&gt;better explanation of where a score came from&lt;/li&gt;
&lt;li&gt;predictable behavior on large repositories&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In short, the model became harder to misuse, easier to explain, and more stable at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 2: Cleanup Became More Actionable
&lt;/h2&gt;

&lt;p&gt;Most code-quality tools stop at issue emission. That is useful, but incomplete.&lt;/p&gt;

&lt;p&gt;Developers do not only need to know what exists. They need to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is important&lt;/li&gt;
&lt;li&gt;what is probably safe to review&lt;/li&gt;
&lt;li&gt;what needs human caution&lt;/li&gt;
&lt;li&gt;what should be looked at first&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where &lt;code&gt;v3.8.1&lt;/code&gt; makes its clearest product-level leap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cleanup confidence planning
&lt;/h3&gt;

&lt;p&gt;Cleanup-family outputs can now carry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;confidence&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;action_class&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;evidence&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important architectural choice is that this was &lt;strong&gt;not&lt;/strong&gt; implemented as a second disconnected scoring model. Cleanup confidence is a reuse layer over existing signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;deficit_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;churn&lt;/li&gt;
&lt;li&gt;coverage gap&lt;/li&gt;
&lt;li&gt;cleanup-local evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simplified mental model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;base_evidence&lt;/span&gt;
&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;low_churn_bonus&lt;/span&gt;
&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;low_coverage_bonus&lt;/span&gt;
&lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;active_churn_penalty&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact arithmetic is less important than the architecture: the system is not maintaining one truth model for scoring and another truth model for cleanup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manifest-aware dependency hygiene
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;unused-deps&lt;/code&gt; also grew beyond file-local hints. It now reads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pyproject.toml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;package.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and can emit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;manifest_unused_dependency&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;undeclared_import&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters because many dependency problems are not visible inside a single file. They exist at the boundary between source code and project metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why it matters
&lt;/h3&gt;

&lt;p&gt;Before:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sweep -&amp;gt; list of candidates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sweep -&amp;gt; ranked issues -&amp;gt; action class -&amp;gt; evidence-backed review plan
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the difference between a detector and a cleanup instrument.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 3: Governance Became Harder To Bypass
&lt;/h2&gt;

&lt;p&gt;This is arguably the article’s strongest credibility anchor, and it deserves to be said plainly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The project does not ask the score to become policy, and it does not let policy quietly mutate the score.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is the right architectural judgment.&lt;/p&gt;

&lt;h3&gt;
  
  
  What changed
&lt;/h3&gt;

&lt;p&gt;The project now treats governance as a separate fail-closed path:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;analysis emits a deterministic governance artifact&lt;/li&gt;
&lt;li&gt;verification recomputes the artifact hash&lt;/li&gt;
&lt;li&gt;policy checks run in a dedicated verification gate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The workflow is intentionally layered:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;analysis -&amp;gt; governance_record.json -&amp;gt; verify-governance -&amp;gt; pass/fail enforcement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why it matters
&lt;/h3&gt;

&lt;p&gt;This separation means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;math can evolve without silently changing CI policy&lt;/li&gt;
&lt;li&gt;policy can become stricter without corrupting the scoring model&lt;/li&gt;
&lt;li&gt;governance can be audited as an artifact, not just inferred from a transient report&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a category crowded with vague “AI code quality” claims, this is the kind of subsystem separation that actually signals seriousness.&lt;/p&gt;




&lt;h2&gt;
  
  
  Supporting Reinforcements
&lt;/h2&gt;

&lt;p&gt;The release also includes several important supporting improvements that strengthen the three main claims without replacing them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layered architecture review
&lt;/h3&gt;

&lt;p&gt;Architecture analysis can now opt into a layered preset rather than stopping at import cycles alone.&lt;/p&gt;

&lt;p&gt;A simplified configuration looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;architecture&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;enabled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;preset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;layered&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The built-in intent is narrow by design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;api -&amp;gt; domain&lt;/code&gt; allowed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;domain -&amp;gt; data&lt;/code&gt; forbidden&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;domain -&amp;gt; service&lt;/code&gt; forbidden&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;domain -&amp;gt; api&lt;/code&gt; forbidden&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not enabled by default, and that is correct. Architecture review is valuable only if it avoids becoming a false-positive factory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canonical CLI
&lt;/h3&gt;

&lt;p&gt;The public CLI is now much easier to hold in memory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;scan&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;review&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pulse&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sweep&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That simplification matters because adoption dies when the interface surface grows faster than user confidence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Selective Rust acceleration
&lt;/h3&gt;

&lt;p&gt;Performance work also stayed disciplined. The project did &lt;strong&gt;not&lt;/strong&gt; rewrite itself around native code. It kept Python as the product core and used Rust only for measured hot paths such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file walking&lt;/li&gt;
&lt;li&gt;glob-heavy traversal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the right trade. Native code is a performance helper here, not a product identity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Five Topics Worth A Deeper Follow-Up
&lt;/h2&gt;

&lt;p&gt;The following five areas deserve separate technical notes because they are where the release’s architecture becomes most visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Mathematical Model Hardening
&lt;/h3&gt;

&lt;p&gt;The scoring model did not need a louder formula. It needed a safer boundary.&lt;/p&gt;

&lt;p&gt;That is why the important work happened around validation, metric guards, cross-language aggregation, attributed deficit output, and deterministic fallback above scale thresholds. The benefit is practical: fewer strange summaries, safer config changes, and score outputs that are easier to debug.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scan -&amp;gt; validated metrics -&amp;gt; attributed score -&amp;gt; project summary
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model now behaves less like an opaque detector and more like a measurement subsystem.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cleanup Confidence Planning
&lt;/h3&gt;

&lt;p&gt;“This might be dead code” is not enough guidance for real cleanup work.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;v3.8.1&lt;/code&gt; moves cleanup closer to a review plan by attaching confidence, action class, and evidence to cleanup-family findings. The key design choice is reuse: cleanup confidence draws from existing signals such as deficit, churn, coverage, and local evidence instead of inventing a second truth system.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sweep dead-code -&amp;gt; ranked issue -&amp;gt; action class -&amp;gt; evidence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That makes cleanup safer for humans and easier for agents to consume.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Manifest-Aware Dependency Hygiene
&lt;/h3&gt;

&lt;p&gt;Dependency debt is often project-level, not file-local.&lt;/p&gt;

&lt;p&gt;By comparing declared dependencies, imported dependencies, and normalized top-level mappings across &lt;code&gt;pyproject.toml&lt;/code&gt; and &lt;code&gt;package.json&lt;/code&gt;, the tool can now surface manifest-level problems such as unused declared packages or missing declarations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;manifest -&amp;gt; imports -&amp;gt; used / unused / missing -&amp;gt; cleanup output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That turns &lt;code&gt;unused-deps&lt;/code&gt; from a file hint into a repository hygiene signal.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Layered Architecture Review
&lt;/h3&gt;

&lt;p&gt;Cycle detection is useful, but many architecture failures appear before cycles do.&lt;/p&gt;

&lt;p&gt;The layered architecture preset gives teams an opt-in way to express allowed and forbidden import directions, with evidence attached to the violation. The important part is restraint: this is not forced on every repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;boundary-violations -&amp;gt; cycles + optional layered rule review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That keeps architecture review useful without turning it into noisy certainty.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Governance Verification Gate
&lt;/h3&gt;

&lt;p&gt;Measurement and enforcement should not collapse into the same layer.&lt;/p&gt;

&lt;p&gt;The governance gate creates a deterministic artifact, verifies it separately, and fails closed when policy or integrity checks break. That makes CI behavior more explicit and audit-friendly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;scan -&amp;gt; governance artifact -&amp;gt; verify-governance -&amp;gt; pass / fail
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is one of the strongest separations in the system: measurement, artifact generation, and enforcement each have their own boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Category Will Keep Growing
&lt;/h2&gt;

&lt;p&gt;We are still early.&lt;/p&gt;

&lt;p&gt;Most teams are only beginning to feel what large-scale AI-assisted development actually does to a repository over time. At first it feels like acceleration. Then it starts to feel like churn, duplication, abandoned logic, inflated structure, and uncertainty about what is still safe to touch.&lt;/p&gt;

&lt;p&gt;That is why interest in slop will keep rising.&lt;/p&gt;

&lt;p&gt;The more code agents can generate, the more valuable tools become that help humans decide what should never have remained in the codebase in the first place.&lt;/p&gt;

&lt;p&gt;As agent-driven code development becomes more mainstream, the need for systems like this will likely accelerate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;measure structural trust&lt;/li&gt;
&lt;li&gt;prioritize cleanup&lt;/li&gt;
&lt;li&gt;separate evidence from policy&lt;/li&gt;
&lt;li&gt;make deletion safer&lt;/li&gt;
&lt;li&gt;make governance explicit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI-SLOP-DETECTOR is being built gradually in that direction.&lt;/p&gt;

&lt;p&gt;Not as a one-shot idea.&lt;br&gt;
Not as a trend-chasing wrapper.&lt;br&gt;
Not as a linter with a fashionable label.&lt;/p&gt;

&lt;p&gt;But as a system shaped step by step around a simple reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;if AI makes code generation cheap, then structural review, cleanup discipline, and governance become more valuable than ever.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the craft mindset behind this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;refine the instrument&lt;/li&gt;
&lt;li&gt;tighten the workflow&lt;/li&gt;
&lt;li&gt;separate the layers&lt;/li&gt;
&lt;li&gt;improve the trust surface one release at a time&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;That is the craft mindset behind this project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;refine the instrument&lt;/li&gt;
&lt;li&gt;tighten the workflow&lt;/li&gt;
&lt;li&gt;separate the layers&lt;/li&gt;
&lt;li&gt;improve the trust surface one release at a time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19r5jxlhfsh8olypnis6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19r5jxlhfsh8olypnis6.png" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Repository: &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>opensource</category>
      <category>ai</category>
      <category>governance</category>
    </item>
    <item>
      <title>When the Memory Gate Met a Real Archive: What 90 Experiments Taught Us About Cheap LLM Slop</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Wed, 03 Jun 2026 18:15:07 +0000</pubDate>
      <link>https://dev.to/flamehaven01/when-the-memory-gate-met-a-real-archive-what-90-experiments-taught-us-about-cheap-llm-slop-4mm8</link>
      <guid>https://dev.to/flamehaven01/when-the-memory-gate-met-a-real-archive-what-90-experiments-taught-us-about-cheap-llm-slop-4mm8</guid>
      <description>&lt;h2&gt;
  
  
  Introduction: Enforcing the MICA Contract
&lt;/h2&gt;

&lt;p&gt;This article is the practical side of the MICA series. &lt;strong&gt;MICA&lt;/strong&gt; stands for &lt;em&gt;Memory Invocation and Context Archive&lt;/em&gt;. In the workflow described here, it is a small package that the maintainer loads at session start so the active rules are visible before any code is touched. &lt;/p&gt;

&lt;p&gt;Parts 6 and 7 set up the contract. This article shows what that contract did when a real scientific archive started accumulating cheap slop across more surfaces than a single maintainer could manually hold.&lt;/p&gt;

&lt;p&gt;The archive is the Flamehaven Verification Ledger. It publishes three kinds of records.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EQA (Equation-to-Artifact).&lt;/strong&gt; Physics and math reproductions. Currently 56 records, numbered &lt;code&gt;TOE-TEST-0001&lt;/code&gt; through &lt;code&gt;TOE-TEST-0056&lt;/code&gt;. Example: a Schwarzschild Planck-scale metric verification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BAV (Biomolecular AI Validation).&lt;/strong&gt; Protein-folding consensus checks across several AI fold models (AlphaFold3, AlphaFold2, Chai-1, Boltz-2). Currently 34 experiments, with 6 active cards and a 26-entry foundational archive.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BSC (Bioscience Compliance).&lt;/strong&gt; Repository compliance audits against external risk taxonomies (the MIT AI Risk Repository and EU AI Act). Currently 2 audits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is around 90 experiments in total. The full file count is past 300. Every record is published. The archive is intended to be cited. If the AI maintainer drifts, the drift can become a downstream paper citation.&lt;/p&gt;

&lt;p&gt;One scope note matters before the story starts. &lt;code&gt;flamehaven-audit-reports&lt;/code&gt; is not the engine that computes these results. It is the public evidence surface. Upstream engines and experiment repositories produce the raw artifacts. &lt;/p&gt;

&lt;p&gt;This repository ingests those artifacts, sanitizes them for publication, classifies what kind of record they are, and renders them in a static ledger that other people can inspect and cite.&lt;/p&gt;

&lt;p&gt;The three lanes have already taught us three different shapes of cheap slop. EQA taught us about framing drift at scale (a record displayed as a &lt;code&gt;PASS&lt;/code&gt; when no real check produced it). The portal taught us about state duplication (an inline JavaScript copy that drifted from the disk file behind it). &lt;/p&gt;

&lt;p&gt;BAV keeps trying to teach us about provenance drift, artifact-identity drift, and over-clean presentation around real runs. The article walks those three scars in order and then describes the gate that grew out of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  📖 Glossary
&lt;/h2&gt;

&lt;p&gt;A short list. Skim and move on.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MICA.&lt;/strong&gt; A small package the maintainer loads at session start. It carries the rules and exposes whether the package state is coherent before write work begins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI (Design Invariant).&lt;/strong&gt; A rule with an ID. Example: &lt;code&gt;DI-EQA-001&lt;/code&gt; says math runs must use &lt;code&gt;mpmath&lt;/code&gt; at 200-bit precision or higher.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playbook.&lt;/strong&gt; A markdown file. People read it. Every rule inside cites a DI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema.&lt;/strong&gt; Two machine-readable files. &lt;code&gt;mica.yaml&lt;/code&gt; carries the package shape. &lt;code&gt;archive.json&lt;/code&gt; carries the 28 DIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validator.&lt;/strong&gt; &lt;code&gt;mica_pct.py&lt;/code&gt;. When run against a package root, it emits &lt;code&gt;CLOSED CONTRACT&lt;/code&gt; or &lt;code&gt;INCOMPLETE&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Receipt.&lt;/strong&gt; A small JSON block proving a run actually ran. Pins the engine commit hash, the run command, and the output hash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EQA / BAV / BSC.&lt;/strong&gt; The three lanes of the archive. Physics math, protein folding, compliance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is enough to read the rest.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Archive We Are Talking About
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft37paqynk9kn9gkrsyhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft37paqynk9kn9gkrsyhh.png" alt="Comparing the three lanes of truth" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The story needs a concrete protagonist. The protagonist is the archive itself. The opening named the three lanes. This section adds the file shape, three live numbers, and one honest scope label that the rest of the article will keep returning to.&lt;/p&gt;

&lt;p&gt;The protagonist is not a single program. It is a layered publication system. Upstream computation happens in engine or experiment repositories. &lt;code&gt;flamehaven-audit-reports&lt;/code&gt; is the place where those outputs are turned into public records. &lt;/p&gt;

&lt;p&gt;That projection layer does four jobs that are easy to blur together if they are not named explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingest the upstream artifact&lt;/li&gt;
&lt;li&gt;Sanitize anything that should not be published as-is&lt;/li&gt;
&lt;li&gt;Classify the record by what kind of evidence it really is&lt;/li&gt;
&lt;li&gt;Render it through a static inspection surface.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The three lanes do not all behave the same way inside that surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EQA Lane: Closest to a deterministic computation archive.&lt;/li&gt;
&lt;li&gt;BAV Lane: Pipeline- and governance-heavy. The ledger must distinguish between a genuine rerunnable experiment, a runtime audit, and a research or review artifact.&lt;/li&gt;
&lt;li&gt;BSC Lane: Maps repository state directly to external compliance taxonomies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each EQA record carries at least two files. A machine-readable &lt;code&gt;internal_data.json&lt;/code&gt; holds the receipt. A human-readable &lt;code&gt;analysis_report.md&lt;/code&gt; holds the narrative. Some records also ship a SPAR review record. The strongest three current records:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schwarzschild Planck-scale metric verification. Engine re-execution produced &lt;code&gt;Omega = 0.9985&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;de Sitter background check. Recorded at &lt;code&gt;sqrt_jsd = 0.2722&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;OpenAI Erdős Eq.(2.2) reproduction. Claim: matches the published value to 0.014 percent. Anchored to a public MIT repo and a Zenodo DOI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The BAV lane is honest about its own scale. Of the 6 active cards, only one (&lt;code&gt;EXP-031&lt;/code&gt;) carries a foldable input sequence. It is a 52-amino-acid input run against AlphaFold3, AlphaFold2, Chai-1, and Boltz-2. &lt;/p&gt;

&lt;p&gt;The other five cards are governance and methodology experiments. They do not ship a re-run scaffold. That boundary is honest. We did not invent a fake fold to fill the slot.&lt;/p&gt;

&lt;p&gt;The total is past 300 files. That number matters. A single maintainer can review 5 records by hand. Nobody can review 300. Slop scales with file count. Review does not.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. What the EQA Lane Taught Us First
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp52adgfyghsoa5qckhyb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp52adgfyghsoa5qckhyb.png" alt="What the EQA Lane Taught Us First" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The first scar was about labels, not numbers. The numbers were correct. The labels around them were wrong.&lt;/p&gt;

&lt;p&gt;In June 2026, we ran an internal audit on the EQA archive, the lane that publishes the physics and math reproductions. The plan was to spot-check the calculations. We re-ran the engine on a sample of records. The numbers matched. &lt;/p&gt;

&lt;p&gt;A Schwarzschild horizon calculation came back at &lt;code&gt;Omega = 0.9985&lt;/code&gt;. A de Sitter background check came back at &lt;code&gt;sqrt_jsd = 0.2722&lt;/code&gt;. The math was honest.&lt;/p&gt;

&lt;p&gt;The audit found a different problem.&lt;/p&gt;

&lt;p&gt;The public website was showing 51 of the 56 records with a green &lt;code&gt;PASS&lt;/code&gt; badge. A green PASS is supposed to mean: a numerical check ran and the result passed a threshold. That was not what the website was doing. &lt;/p&gt;

&lt;p&gt;It was treating every record that had a markdown analysis file as &lt;code&gt;PASS&lt;/code&gt;, whether or not a real check had ever run. Governance notes, scenario builds, and integration documents all showed up as if they had been verified.&lt;/p&gt;

&lt;p&gt;A reader scanning the page saw "51 successful verifications." When we sat down and went through the 51 records by hand, only 7 of them had come from a real engine run. The other 44 were notes and supporting documents that had been imported into the lane over time.&lt;/p&gt;

&lt;p&gt;The numbers did not change. The framing did.&lt;/p&gt;

&lt;p&gt;We rewrote the page headline to say "7 verification runs and 44 supporting documents." We wrote a new rule into the package contract that lives next to the records on disk. &lt;/p&gt;

&lt;p&gt;The rule says, in plain English: a green PASS badge can only come from a real threshold check. The mere presence of a report file is not a PASS. A grade copied in from someone else's report is not a fresh verdict. The five most recent records (numbered &lt;code&gt;TOE-TEST-0052&lt;/code&gt; through &lt;code&gt;TOE-TEST-0056&lt;/code&gt;) carry their own real verdicts.&lt;/p&gt;

&lt;p&gt;One thing to be clear about. This audit was a manual one-time read. The MICA validator did not catch the drift. We caught it by reading the records ourselves and asking what each one actually claimed. What MICA does now is preserve the lesson in the package contract and in the maintainers' workflow. &lt;/p&gt;

&lt;p&gt;The contract status the validator emits when everything lines up is called &lt;code&gt;CLOSED CONTRACT&lt;/code&gt;. That does not mean every semantic rule is automatically enforced by the validator itself. &lt;/p&gt;

&lt;p&gt;It means the package structure, declared layers, and DI bindings are coherent, and the maintainer is expected to run inside that contract before changing the archive.&lt;/p&gt;

&lt;p&gt;This is the kind of failure no CI gate or syntax check would catch. The math was correct. The framing was wrong. A markdown-only policy would have continued to allow it because every file would have parsed cleanly. The rule survived this kind of pressure because the contract records both what the rule says and the specific incident that forced it to exist.&lt;/p&gt;

&lt;p&gt;The next scar hit a different part of the system. Not the math lane this time. The website itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Forgiveness Budget Scientific Archives Don't Have
&lt;/h2&gt;

&lt;p&gt;Most LLM-assisted writing operates on a forgiveness budget.&lt;/p&gt;

&lt;p&gt;A blog post can be slightly overstated. A README can describe something the code does not quite do yet. A pitch deck can round 73% up to "over 70%." The reader corrects internally. The next revision absorbs the drift. The social cost of small overclaiming is low.&lt;/p&gt;

&lt;p&gt;A scientific archive does not have that budget.&lt;/p&gt;

&lt;p&gt;This archive is published in a form meant to be cited. The Schwarzschild Omega value, the Erdős reproduction match percentage, and the EXP-031 fold metrics are all the kind of claims that can become downstream references. The drift that is harmless in a blog post becomes a poisoned downstream paper citation here.&lt;/p&gt;

&lt;p&gt;The model that helpfully rewrites a paragraph also helpfully invents a SMILES string (the text encoding chemists use for molecules) that looks chemically plausible. The agent that summarizes a build log will, if asked one too many times, invent a DOI. The same instinct that makes LLMs useful for prose makes them dangerous for an archive.&lt;/p&gt;

&lt;p&gt;The objects that have to survive this environment are the ones an LLM is least equipped to verify on its own. SMILES strings. DOIs. AlphaFold &lt;code&gt;pLDDT&lt;/code&gt; values (per-residue confidence scores for a fold). Numerical thresholds with physical meaning. Record-level provenance. None of these can be caught by spell-check or by a continuous-integration pipeline.&lt;/p&gt;

&lt;p&gt;This is why the archive needs a gate that loads before any code is touched.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Failure That Forced the Cross-Lane Gate
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcty06gcpd5th92mx7f3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcty06gcpd5th92mx7f3c.png" alt="The Failure That Forced the Cross-Lane Gate" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second scar was inside the website that displays the archive.&lt;/p&gt;

&lt;p&gt;The site used to ship a fallback copy of every record inside the JavaScript file that runs in the reader's browser (&lt;code&gt;js/portal.js&lt;/code&gt;). The original purpose was harmless. Some readers download the repository and open the homepage by double-clicking it, which uses the &lt;code&gt;file://&lt;/code&gt; URL scheme. Some browsers refuse to load separate JSON files over &lt;code&gt;file://&lt;/code&gt; for security reasons, so the fallback let the page render anyway. Two small functions held the fallback. One returned a copy of the dataset for a record. The other returned a copy of the human-readable report.&lt;/p&gt;

&lt;p&gt;The on-disk files kept changing. The inline copies inside the JavaScript did not. The drift grew quietly over weeks.&lt;/p&gt;

&lt;p&gt;A maintainer wrote a small drift-checking script and ran it. It compared every on-disk record with its inline twin. It found 151 places where the two copies disagreed. The most striking example was a record about the Erdős reproduction whose &lt;code&gt;schema_id&lt;/code&gt; field did not even share the same structure between its two copies. The AI maintainer had been editing the disk files. The website had been rendering the stale inline copies. Both sides looked fine internally. Neither side agreed with the other.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;GOVERNANCE.md&lt;/code&gt; had said "single source of truth" the whole time. The maintainer agreed with the policy. The model also agreed. The policy lived in prose. Nothing in code enforced it.&lt;/p&gt;

&lt;p&gt;Same shape as the EQA framing audit. A human caught the drift, not the MICA validator. What MICA does now is preserve the new rule in the package contract and surrounding docs. The rule, in plain English, is: no inline copy of any record may ship inside the browser code. The two functions that used to return the inline copies were stripped out and replaced with stubs that return empty values. The stubs carry the history inline, so a future maintainer reading the file sees both the rule and the incident that forced it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getFallbackReportText&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Removed (v1.13.1): inlined report-text fallback drifted from the on-disk .md&lt;/span&gt;
  &lt;span class="c1"&gt;// reports. Single source of truth = the on-disk files fetched above.&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getFallbackDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;runId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Removed (v1.13.1): inlined fallback datasets had drifted from the on-disk JSON&lt;/span&gt;
  &lt;span class="c1"&gt;// (151 schema/value mismatches found 2026-06-02 by check_fallback_drift.py).&lt;/span&gt;
  &lt;span class="c1"&gt;// Single source of truth = the on-disk evidence files fetched above. This ledger&lt;/span&gt;
  &lt;span class="c1"&gt;// must be served over HTTP (e.g. "python -m http.server"), not opened via file://.&lt;/span&gt;
  &lt;span class="c1"&gt;// Returns null so the inspector shows an honest load error rather than stale data.&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix lives in three places. The archive's machine-readable contract carries the lesson. The browser code returns empty values where the inline copies used to live. The playbook (the human-facing operating guide) explains why those values are empty. &lt;/p&gt;

&lt;p&gt;A new maintainer joining the project sees the rule from all three angles. The validator confirms the package still loads as a coherent contract. The code refuses to render the old fallback because the function returns nothing. The playbook explains why a human should not put the fallback back in.&lt;/p&gt;

&lt;p&gt;This is the lesson that produced the title of the article. A memory gate that lives only in markdown is etiquette. The gate becomes structural the moment the contract, the code, and the playbook all point at each other.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What the Playbook Actually Does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc32yrjrdcfx503ktif54.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc32yrjrdcfx503ktif54.png" alt="What the Playbook Actually Does" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;People ask why we ship a human-readable playbook (a long markdown file) if the contract is already a machine-readable file. The clearest answer is a short list of cheap failures the playbook actually prevented.&lt;/p&gt;

&lt;p&gt;A maintainer who reads only the machine-readable contract sees one rule: math must use an arbitrary-precision library at 200 bits or higher. That is precise. It is also blunt. It does not say &lt;em&gt;why&lt;/em&gt;. The first time the maintainer hits a math sub-case the contract did not specifically name, they may default to the standard 64-bit floating-point library. &lt;/p&gt;

&lt;p&gt;The playbook is where the original incident behind the rule lives. In our case, an early experiment where 64-bit floats silently underflowed to zero in a class-field calculation and produced a meaningless result of &lt;code&gt;0&lt;/code&gt;. The playbook tells that story in plain English. A maintainer who read it will not re-introduce the same bug in a new sub-case.&lt;/p&gt;

&lt;p&gt;An AI maintainer that starts work without loading the playbook will, when asked to fix a wrong score in a record, simply edit the JSON file that stores the score. The contract forbids this in one terse sentence. The playbook expands that sentence into a behavior rule. Never edit a record's data file after it has been committed. &lt;/p&gt;

&lt;p&gt;Instead, create a new record with a new ID and link the corrected record from the original one. A session that loaded the playbook reads that rule before any edit happens. A session that did not load it destroys the audit trail that lets a third party re-run the original computation.&lt;/p&gt;

&lt;p&gt;The third example happens at render time. The website uses a small classifier (a regular expression) to decide what kind of colored label sits next to each metric. When the classifier does not recognize a metric name, it returns nothing, and the metric renders without any colored label. &lt;/p&gt;

&lt;p&gt;The contract says what to do, in terse machine terms. The playbook documents the human procedure step by step. Add the new metric to the glossary, decide what kind of evidence backs it, assign the matching label, then merge. &lt;/p&gt;

&lt;p&gt;Without the playbook, a record with a missing label might ship as if the missing label were on purpose.&lt;/p&gt;

&lt;p&gt;The playbook is not the rule. The contract is. The playbook is the briefing for the maintainer about to face the rule, and the record of the specific past failure each rule was written to prevent.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Where MICA Sits, and What It Refuses
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueathf9k2ta93ivxvkfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueathf9k2ta93ivxvkfp.png" alt="Where MICA Sits, and What It Refuses" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MICA is a small Python validator plus package format. In the workflow used here, the maintainer runs it at session start. The script reads the package contract first (a short YAML file). The contract names three other files. &lt;/p&gt;

&lt;p&gt;The validator confirms those layers exist and that the package shape is coherent. The first is the archive's machine-readable rule list. The second is the human-readable playbook. The third is the credibility document that says what kinds of internal scores may or may not appear on the public surface.&lt;/p&gt;

&lt;p&gt;After loading, the script runs 11 simple structural checks against the package. Each check catches one specific kind of cheap failure before write work proceeds.&lt;/p&gt;

&lt;p&gt;The first group of checks refuses a half-formed package. The script asks whether the contract declares the required shape fields (&lt;code&gt;mica_spec&lt;/code&gt;, &lt;code&gt;mode&lt;/code&gt;, &lt;code&gt;layers&lt;/code&gt;), whether the archive and playbook layers exist, and whether the mode/layer combination is coherent. The package is unusable until those fields line up.&lt;/p&gt;

&lt;p&gt;The second group refuses drift between what the contract says and what the file system actually holds. The script asks whether every file the contract names exists on disk. A check here fails when a file was renamed in one place and not the other. This is the same shape of failure as the website's inline-fallback drift from the second scar, but caught much earlier.&lt;/p&gt;

&lt;p&gt;The third group refuses critical rules that have no accountability behind them. Every critical archive rule is supposed to carry a short note naming the incident that forced the rule. The script asks whether &lt;code&gt;binding.origin_episode&lt;/code&gt; is filled in for every critical rule. &lt;/p&gt;

&lt;p&gt;A check here fails when a rule was written as a top-down policy with no recorded cost behind it. Rules like that are easy for a maintainer or an AI maintainer to rationalize past in the moment. A rule that names what was paid the last time it was missing is much harder to ignore.&lt;/p&gt;

&lt;p&gt;The fourth group refuses stale package references. The script can check whether any declared &lt;code&gt;binding.lesson_ref&lt;/code&gt; paths still resolve. A broken cross-reference is how a rule slowly becomes etiquette.&lt;/p&gt;

&lt;p&gt;The sequence at session start looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw26km4xvdfmzrbbbdrch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw26km4xvdfmzrbbbdrch.png" alt="mermaid" width="800" height="1000"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If every check passes, the script emits the status &lt;code&gt;CLOSED CONTRACT&lt;/code&gt;. If a hard-fail check trips, it emits &lt;code&gt;INCOMPLETE&lt;/code&gt;. In the workflow used here, the maintainer fixes that state before any code change happens.&lt;/p&gt;

&lt;p&gt;This is what we mean by a gate that is meant to run before any code is touched. It is not just a policy hope. It is a small Python validator with explicit hard-fail conditions.&lt;/p&gt;

&lt;p&gt;The 28 archive rules do the same thing at the per-record level. We did not aim for 28. The number grew as incidents forced new rules. Every rule carries a short note pointing at the incident that produced it. The list is not a top-down policy. It is an accumulated record of past failures the team agreed not to repeat.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. One Bad BAV Card, Step by Step
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgo0tpcusroziy7eh5goh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgo0tpcusroziy7eh5goh.png" alt="One Bad BAV Card, Step by Step" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note before the walk-through. This scenario is a constructed illustration, not a documented incident. The protein-folding cards on the archive today are all well-formed. The point of stepping through it is to show the refusal sequence at the granularity a peer reviewer can check, not to claim that a refusal of this exact shape has been logged in production.&lt;/p&gt;

&lt;p&gt;To make the gate concrete, here is one fabrication the contract refuses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1.&lt;/strong&gt; An AI maintainer is asked to add a new protein-folding card. There is no real protein sequence on disk to fold, but the model knows the file format the lane expects. It writes a record at &lt;code&gt;bav/exp-035/reference_run.json&lt;/code&gt; that looks like a real fold result. The file carries &lt;code&gt;pTM = 0.78&lt;/code&gt;, &lt;code&gt;pLDDT_mean = 84.2&lt;/code&gt;, &lt;code&gt;PAE = 4.3 Å&lt;/code&gt;. These numbers fall inside the same range as the only real fold on the archive (&lt;code&gt;EXP-031&lt;/code&gt;), so they pass a casual eye-test.&lt;/p&gt;

&lt;p&gt;Without the gate, the rest follows naturally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2.&lt;/strong&gt; The website's small classifier reads the metric name &lt;code&gt;pLDDT_mean&lt;/code&gt;. It matches the pattern for an externally-defined fold metric (AlphaFold defines pLDDT, so the website treats anything named that way as borrowed from outside, and therefore checkable by a third party). The card renders with a green "verifiable" badge. The classifier is just a short regular expression. Here is what it does:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;provClassOf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;label&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;label&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toLowerCase&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;plddt|&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;pae&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;|ptm|contact|brier|&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;auc&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;|&lt;/span&gt;&lt;span class="se"&gt;\b&lt;/span&gt;&lt;span class="sr"&gt;ece&lt;/span&gt;&lt;span class="se"&gt;\b)&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;EXTERNAL&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;p_e2e|e2e|capture|transfer&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;DERIVED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;sr9|di2|sidrce|coherence|spar|nnsl|resonance|drift|omega|ω&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ADVISORY-HEURISTIC&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// incidental values (counts, dates, grades) carry no badge&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The label &lt;code&gt;pLDDT_mean&lt;/code&gt; contains the string &lt;code&gt;plddt&lt;/code&gt;, so the first pattern matches. The function returns &lt;code&gt;EXTERNAL&lt;/code&gt;. The badge turns green. The regular expression has no way to check whether the number behind the label came from a real fold or from a fabrication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3.&lt;/strong&gt; A reader trusts the green badge. The value gets cited in a manuscript. A wet lab spends real money chasing a fold that was never run.&lt;/p&gt;

&lt;p&gt;With the gate, the chain breaks at step 1.&lt;/p&gt;

&lt;p&gt;The archive carries a rule that says a fold card can only claim re-runnable status if it ships two specific files alongside the result: a &lt;code&gt;.fasta&lt;/code&gt; file containing the protein sequence that was folded, and a small JSON file naming the model version and the random seed used. &lt;/p&gt;

&lt;p&gt;In this repository, that rule lives in the contract and in the surrounding spec, and the maintainer is expected to check it before publishing the card. If &lt;code&gt;bav/exp-035&lt;/code&gt; had no real input sequence, it could not honestly ship as a re-runnable fold. At most it would ship as non-re-runnable or stay unpublished. The reader would see the honest label.&lt;/p&gt;

&lt;p&gt;A standalone &lt;code&gt;GOVERNANCE.md&lt;/code&gt; would not have stopped step 1. A YAML config without a validator would not have noticed the missing input file. An agent system prompt would have been compressed away under context pressure. A CI check would have run too late, after the fabrication was already on the public surface.&lt;/p&gt;

&lt;p&gt;The contract, the playbook, and the validator reduce the chance of that chain because all three point at each other, and because the workflow checks the contract before the file is allowed to settle into the public archive.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What This Pipeline Cannot Block
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb88e7ztwvf9yo8sbn4h9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb88e7ztwvf9yo8sbn4h9.png" alt="What This Pipeline Cannot Block" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A pipeline that pretends to catch everything is the failure mode it was built to prevent.&lt;/p&gt;

&lt;p&gt;Five things still slip past every layer above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A plausible fabricated value inside the normal range.&lt;/strong&gt; A fake &lt;code&gt;pLDDT&lt;/code&gt; of 78.4 looks like a real one. The website's classifier labels it as externally-defined and the green badge appears. Only a third party re-running the fold catches the fabrication. This is why only &lt;code&gt;EXP-031&lt;/code&gt; ships the full re-run scaffold, and the other five active BAV cards do not claim independent re-runnability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A new promotional pattern outside the word list.&lt;/strong&gt; A small filter watches the public pages for 14 superlative terms such as &lt;code&gt;revolutionary&lt;/code&gt; and &lt;code&gt;breakthrough&lt;/code&gt;. A maintainer who writes something like "a novel adaptive coherence framework" defeats every entry on the list. The list is a floor, not a ceiling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A fabrication marker silently removed.&lt;/strong&gt; A separate filter looks for a literal &lt;code&gt;[synthetic]&lt;/code&gt; tag in shipped files, the kind of marker a developer might leave on placeholder data. The filter only fires when the tag is present. A maintainer who deletes the tag while keeping the fabricated content underneath passes the filter cleanly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A real DOI pointing at the wrong paper.&lt;/strong&gt; Nothing in the pipeline fetches DOIs. A real URL pointing to a real but unrelated paper is invisible to the validator. Peer review is the only check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A correct computation framed as the wrong thing.&lt;/strong&gt; This was the failure that produced the framing rule in the math lane. The engine outputs were real. The headline treated the mere presence of a report file as a fresh &lt;code&gt;PASS&lt;/code&gt;. The fix was structural, but the same shape of error can reappear in any new lane.&lt;/p&gt;

&lt;p&gt;The honest claim is narrow. MICA makes cheap slop expensive enough to catch. It does not make expensive slop catchable.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What We Learned, What We Did Not Solve
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6267suu69dz3bpj4kms8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6267suu69dz3bpj4kms8.png" alt="shifting human attention to what actually matters" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The archive is small by industry standards. 56 math records. 34 biomolecular-validation experiments (6 active cards and a 26-entry foundational archive). 2 compliance audits. Around 90 experiments. Past 300 files in total.&lt;/p&gt;

&lt;p&gt;That scale was large enough to teach us four things.&lt;/p&gt;

&lt;p&gt;First, a markdown policy alone does not survive an AI maintainer. The 151-mismatch drift proved it. The policy was correct. Nothing in code enforced it.&lt;/p&gt;

&lt;p&gt;Second, the rule list is where the policy actually lives. The playbook is the human reading layer. The validator is the structural gate. The workflow is the enforcement surface. Together, they form one operating contract.&lt;/p&gt;

&lt;p&gt;Third, the gate works best when it runs before any code is touched. PR-time checks are necessary, but not sufficient. By the time cheap slop reaches the PR surface, the maintainer is already reviewing content that should have been constrained earlier.&lt;/p&gt;

&lt;p&gt;Fourth, the pipeline only refuses cheap slop. It does not verify molecules, fold real proteins, or check that a DOI links to the paper it claims to. That work stays external. The pipeline buys reviewer time so the reviewer can do that external work on the few claims that genuinely need it.&lt;/p&gt;

&lt;p&gt;What we did not solve.&lt;/p&gt;

&lt;p&gt;The website's metric classifier still misclassifies on a typo. A maintainer who writes &lt;code&gt;pLDT&lt;/code&gt; instead of &lt;code&gt;pLDDT&lt;/code&gt; ships a card with no colored badge at all. Nothing automated catches it. Reading the PR diff before merge is the only safety net.&lt;/p&gt;

&lt;p&gt;The fabrication-marker filter is bypassable. Anyone who knows the &lt;code&gt;[synthetic]&lt;/code&gt; tag is there can delete it, and the underlying content goes through.&lt;/p&gt;

&lt;p&gt;DOIs are not fetched. A real URL to a real but unrelated paper passes every layer.&lt;/p&gt;

&lt;p&gt;And the article has not shown a logged production refusal by the MICA validator itself. The validator's refusal logic is exercised by a few test fixtures inside the MICA repository (small example packages deliberately broken in specific ways). &lt;/p&gt;

&lt;p&gt;The fixtures prove the mechanism works as designed. They do not prove that the gate has fired in production on this archive yet. The two real incidents in this article were both caught by human attention. The framing drift in the math lane was caught by a one-time read. &lt;/p&gt;

&lt;p&gt;The website's fallback drift was caught by a small script a maintainer ran. The rule list records both lessons. The next time either shape returns, the contract now names the failure pattern and creates a refusal point where automation, workflow, or human review can tighten around it. We do not yet have a refusal log entry to point at.&lt;/p&gt;

&lt;p&gt;The pattern across these gaps is the same. Cheap slop is refused upstream. Expensive slop is left for the maintainer's reading and for peer review.&lt;/p&gt;

&lt;p&gt;The maintainer has finite attention. Every minute spent catching a fabricated &lt;code&gt;pTM = 0.78&lt;/code&gt; is a minute not spent reading the molecule, the protocol, or the citation that actually needs human judgment. &lt;/p&gt;

&lt;p&gt;Session-start refusal exists to move the cheap failures upstream so the saved attention can land on the expensive ones. The contract does not pretend to verify the world. It frees the maintainer to verify the parts that matter most.&lt;/p&gt;




&lt;p&gt;This article was the practical side of what Parts 6 and 7 set up as a session-start contract. The next part of the MICA series will return to the framework side.&lt;/p&gt;

&lt;p&gt;The reproduction handle for the strongest record on the ledger is short:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction
&lt;span class="nb"&gt;cd &lt;/span&gt;openai-erdos-eq22-reproduction
python &lt;span class="nt"&gt;-m&lt;/span&gt; pytest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Treat any number on the ledger as a number to verify, not a number to trust.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Making Equation (2.2) of the OpenAI Erdős Result Executable</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 26 May 2026 06:37:10 +0000</pubDate>
      <link>https://dev.to/flamehaven01/making-equation-22-of-the-openai-erdos-result-executable-ml7</link>
      <guid>https://dev.to/flamehaven01/making-equation-22-of-the-openai-erdos-result-executable-ml7</guid>
      <description>&lt;h2&gt;
  
  
  Why a proved theorem still needs reproducible claim custody
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2108443327152872531" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2108443327152872531" alt="open ai" width="900" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On May 20, 2026, &lt;a href="http://%20https://openai.com/index/model-disproves-discrete-geometry-conjecture/" rel="noopener noreferrer"&gt;OpenAI announced&lt;/a&gt; that an internal reasoning model had produced a counterexample to the Erdős planar unit-distance conjecture.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The problem is easy to state: given $n$ points in the plane, how many pairs of points can be exactly distance $1$ apart?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For nearly eighty years, the prevailing expectation was that square-grid-type constructions were essentially optimal up to a slowly growing exponent. OpenAI’s announcement changed that. Its internal reasoning model produced an infinite family of examples giving a polynomial improvement, and the proof was checked and written up in mathematical form by external mathematicians.&lt;/p&gt;

&lt;p&gt;In this article, “the remarks paper” refers to the companion PDF by Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Matchett Wood, linked from OpenAI’s announcement.&lt;/p&gt;

&lt;p&gt;The proof-level result belongs to those authors and the source papers.&lt;/p&gt;

&lt;p&gt;My focus here is narrower: equation (2.2) in that remarks paper, and whether its explicit numerical value can be reproduced as executable code.&lt;/p&gt;

&lt;p&gt;This is not about proving the theorem again. It is about what happens after a theorem contains a fragile numerical claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  The proof is not the artifact
&lt;/h2&gt;

&lt;p&gt;A mathematical proof and a software artifact do different jobs.&lt;/p&gt;

&lt;p&gt;The proof establishes the theorem. It gives the definitions, the argument, the dependencies, and the mathematical reason why the result holds.&lt;/p&gt;

&lt;p&gt;A software artifact should not pretend to replace that.&lt;/p&gt;

&lt;p&gt;But some claims inside a mathematical paper have a finite, numerical, or computationally checkable surface. Those claims can be preserved differently. They can be run. They can be tested. They can fail when precision is wrong.&lt;/p&gt;

&lt;p&gt;That is the narrow role of an executable reproduction artifact: not proof replacement, not automated peer review, and not authority over the theorem, but a reproducible object for the part of the claim that can be computed.&lt;/p&gt;




&lt;h2&gt;
  
  
  The specific target: equation (2.2)
&lt;/h2&gt;

&lt;p&gt;In the OpenAI Erdős result, one checkable surface is equation (2.2) of the remarks paper.&lt;/p&gt;

&lt;p&gt;For the explicit choice&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7138879423288234316" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7138879423288234316" alt="math1" width="606" height="158"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;the remarks paper gives an explicit numerical lower bound on the exponent excess above the classical Erdős exponent:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13849924454096937923" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13849924454096937923" alt="math2" width="841" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These parameters are taken directly from the remarks paper without modification. The artifact does not derive the multiquadratic choice; it reproduces the finite numerical calculation built from that choice.&lt;/p&gt;

&lt;p&gt;This is not the later stronger explicit bound associated with Sawin’s separate preprint. It is not $\delta \approx 0.014$. It is the numerical value appearing in equation (2.2) of the remarks paper.&lt;/p&gt;

&lt;p&gt;That narrowness is important. It is exactly what makes the claim suitable for executable reproduction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where the numerical fragility comes from
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D4133600104991436468" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D4133600104991436468" alt="4" width="900" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The numerical fragility comes from the exact form of equation (2.2), not from a large computation.&lt;/p&gt;

&lt;p&gt;Immediately after the published expression, the parameters are:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7110299839676694670" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D7110299839676694670" alt="math3" width="754" height="43"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D483384573840666881" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D483384573840666881" alt="math 4" width="772" height="51"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With the paper’s definitions of $u, v$, and $\delta$  substituted into equation (2.2), the exponent excess reduces to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2715587953765822422" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D2715587953765822422" alt="math5" width="752" height="80"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The constant $36$ is not introduced by the implementation. It is already present in the remarks paper’s equation (2.2), both in the numerator term $u\pi/(36v)$ and in the denominator term $\log(36/\delta^2).$&lt;/p&gt;

&lt;p&gt;After substituting $u = K/r^2, v = r/2$, and $\delta = 101^{-2K}$, the numerator simplifies to $\log(K\pi / 18r^3)$, while the denominator becomes $\log 36 + 4K \log 101$.&lt;/p&gt;

&lt;p&gt;Here the $101$ comes from the finite prime in $S = {101, \infty}$.&lt;/p&gt;

&lt;p&gt;In other words, this artifact does not derive the constant $36$ from first principles; it reproduces the published equation with the stated substitutions.&lt;/p&gt;

&lt;p&gt;The precision problem is in the numerator:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D11575553626952662327" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D11575553626952662327" alt="math 7" width="254" height="53"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because $K$ is the ceiling of $18r^3 / \pi$, the ratio $K\pi / 18r^3$ is only barely larger than $1$.&lt;/p&gt;

&lt;p&gt;More precisely:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D10827487014404388139" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D10827487014404388139" alt="math8" width="339" height="89"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For $r = 510510$,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13091608971449808775" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D13091608971449808775" alt="math 9" width="255" height="74"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the numerator is effectively $\log(1 + \varepsilon)$ with $\varepsilon$ at the  $10^{-18}$scale.&lt;/p&gt;

&lt;p&gt;IEEE 754 double precision has machine epsilon around $2.2 \times 10^{-16}$. A naive &lt;code&gt;float64&lt;/code&gt; computation therefore cannot reliably distinguish the near-one ratio from  $1$. The ratio rounds to $1$, leading to $\log(1) = 0.$&lt;/p&gt;

&lt;p&gt;The exponent excess disappears before the computation reaches the value stated in the paper.&lt;/p&gt;

&lt;p&gt;This is not a flaw in the mathematics. It is a precision failure in the numerical evaluation of a valid expression. That is the reason the artifact evaluates equation (2.2) using &lt;code&gt;mpmath&lt;/code&gt; at 200-bit precision.&lt;/p&gt;

&lt;p&gt;A PDF can state the value. A verifier can expose when the value disappears.&lt;/p&gt;




&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9321543817991300315" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9321543817991300315" alt="last" width="900" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We built:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction" rel="noopener noreferrer"&gt;https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The purpose is deliberately narrow: reproduce the finite, explicitly checkable numerical surface of equation (2.2) in the OpenAI Erdős unit-distance disproof remarks.&lt;/p&gt;

&lt;p&gt;The package evaluates the expression using &lt;code&gt;mpmath&lt;/code&gt; at 200-bit precision and returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6.2391e-38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matches the published two-significant-figure value $\approx 6.24 \times 10^{-38}$ to $1.4 \times 10^{-4}$ relative error.&lt;/p&gt;

&lt;p&gt;The repository includes 60 unit tests, 21 verifier checks, a frozen per-source-file SHA-256 manifest, GitHub Actions CI across Ubuntu and Windows, Python 3.11 / 3.12 verification, and a frozen-report mode that prints a verdict without mutating tracked evidence.&lt;/p&gt;

&lt;p&gt;The basic reproduction path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction&amp;gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;openai-erdos-eq22-reproduction
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;".[dev]"&lt;/span&gt;
python &lt;span class="nt"&gt;-m&lt;/span&gt; erdos_ant.verify
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output includes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Verdict: PASS
Checks: 21/21 passed
eq (2.2) exponent excess: 6.2391e-38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a large system. That is part of the point. A small claim with a clear boundary is easier to inspect than a broad claim that blurs proof, computation, and interpretation.&lt;/p&gt;




&lt;h2&gt;
  
  
  From reproduction to custody
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9085427059880693022" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9085427059880693022" alt="2" width="900" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This repository was not built as a one-off reaction to an OpenAI announcement. We are not announcing a grand framework here; we are showing the discipline in miniature.&lt;/p&gt;

&lt;p&gt;For us, the work is part of a longer routine: take a mathematical or technical claim, isolate the checkable surface, pin the environment, and make drift visible.&lt;/p&gt;

&lt;p&gt;That is intentionally plain work.&lt;/p&gt;

&lt;p&gt;Read the source.&lt;/p&gt;

&lt;p&gt;Extract the claim.&lt;/p&gt;

&lt;p&gt;Reproduce the computation.&lt;/p&gt;

&lt;p&gt;Record the boundary.&lt;/p&gt;

&lt;p&gt;Let the verifier fail if the result disappears.&lt;/p&gt;

&lt;p&gt;To execute this routine reliably, the scope must be uncomfortably narrow. This repository intentionally leaves the proof of Theorem 1.1, the construction of the infinite tower, and Sawin’s separate $\delta \approx 0.014$  preprint to their respective sources. It does not pretend to be peer review.&lt;/p&gt;

&lt;p&gt;This is not just a disclaimer. It is the point of the artifact.&lt;/p&gt;

&lt;p&gt;A sharp, restricted boundary is exactly what makes a claim inspectable, repeatable, and challengeable. This is what I mean here by claim custody.&lt;/p&gt;

&lt;p&gt;It addresses a technical governance question, but not in the policy sense: what exactly is being trusted, from which source, and what makes the claim fail if the implementation changes?&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A PDF can state the value. A verifier can expose when the value disappears.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We claim no authority over the broader theorem. We simply maintain a reproducible boundary around the fragile numerical claim inside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9984717360298612367" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcoderlegion.com%2F%3Fqa%3Dblob%26qa_blobid%3D9984717360298612367" alt="repo" width="900" height="579"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The theorem was proved in the mathematical papers.&lt;/p&gt;

&lt;p&gt;This repository asks a smaller question: can the numerical value in equation (2.2) survive execution?&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;float64&lt;/code&gt;, it does not. The exponent excess collapses to zero.&lt;/p&gt;

&lt;p&gt;At 200-bit precision, with the source parameters pinned and the verifier running under CI, the artifact recovers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;6.2391e-38
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;matching the published value to $1.4 \times 10^{-4}$ relative error.&lt;/p&gt;

&lt;p&gt;That is the point.&lt;/p&gt;

&lt;p&gt;Not a new theorem. Not a proof replacement.&lt;/p&gt;

&lt;p&gt;A reproducible claim surface for one precision-sensitive number in a major AI-assisted mathematical result.&lt;/p&gt;

&lt;p&gt;Repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction" rel="noopener noreferrer"&gt;https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Paper / Zenodo:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://doi.org/10.5281/zenodo.20383217" rel="noopener noreferrer"&gt;https://doi.org/10.5281/zenodo.20383217&lt;/a&gt;&lt;/p&gt;

</description>
      <category>mathematics</category>
      <category>python</category>
      <category>openscience</category>
      <category>openai</category>
    </item>
    <item>
      <title>The README Was a Protocol. The Entrypoint Was Still Optional.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 21 May 2026 10:34:02 +0000</pubDate>
      <link>https://dev.to/flamehaven01/the-readme-was-a-protocol-the-entrypoint-was-still-optional-57hj</link>
      <guid>https://dev.to/flamehaven01/the-readme-was-a-protocol-the-entrypoint-was-still-optional-57hj</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3k7jz1voscq51d9kuu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3k7jz1voscq51d9kuu7.png" alt="cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation Hierarchy&lt;/strong&gt;: The operational ladder — &lt;code&gt;natural&lt;/code&gt;, &lt;code&gt;guided&lt;/code&gt;, &lt;code&gt;forced&lt;/code&gt; — that determines how MICA actually reaches a live session.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Activation Packet&lt;/strong&gt;: The compiled session-start object that declares read targets, load state, self-test posture, drift status, and gate outcome.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report&lt;/strong&gt;: The structured opening output that declares what was loaded, what the self-test found, and whether the session gate is open.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;README-as-Protocol&lt;/strong&gt;: The pattern where the model's natural tendency to read the README first is formalized as a declared invocation mechanism. Introduced in v0.1.8.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Where Part 6 Left Off
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/my-ai-maintainer-kept-making-wrong-calls-so-i-made-it-report-its-state-before-touching-anything-2df7"&gt;Part 6&lt;/a&gt; showed what MICA looks like inside a single maintenance agent — session report, drift detection, design invariants, deviation log. The structure held. The protocol ran.&lt;/p&gt;

&lt;p&gt;Part 6 ended with a harder question: &lt;strong&gt;what happens when accumulated session knowledge needs to govern the next session — inside a tool that runs within AI workflows itself?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer depends on a prior question: does the next session actually load what was accumulated?&lt;/p&gt;

&lt;p&gt;That is not a schema problem. It is an entrypoint problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Gap README-as-Protocol Left Open
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg14swus4yof9adum02mc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg14swus4yof9adum02mc.png" alt="The Entrypoint Gap" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/the-model-already-read-the-readme-mica-v018-made-it-a-protocol-37j9"&gt;Part 4&lt;/a&gt; made a specific assumption: in many repository-based AI workflows, the README is already the model's first orientation surface.&lt;/p&gt;

&lt;p&gt;That observation became README-as-Protocol.&lt;/p&gt;

&lt;p&gt;Instead of inventing a new installation mechanism, MICA formalized an existing behavior: the model reads the README, the README points to the archive, and the session is expected to load context, run checks, and report readiness before work begins.&lt;/p&gt;

&lt;p&gt;That assumption was useful.&lt;/p&gt;

&lt;p&gt;It gave MICA a path into the session without requiring plugins, services, or custom host infrastructure.&lt;/p&gt;

&lt;p&gt;But a protocol is not an entrypoint.&lt;/p&gt;

&lt;p&gt;The README can declare where the archive is, what invariants matter, what the session report must contain. None of that guarantees sequencing. A model can still skim the README, jump directly into code, or begin work before declaring its load state.&lt;/p&gt;

&lt;p&gt;A gate without a consequence is still only etiquette.&lt;/p&gt;

&lt;p&gt;That is the gap this version had to close.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Answer: An Invocation Hierarchy
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mlovwluvuidh83cfzb9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7mlovwluvuidh83cfzb9.png" alt="The Activation Spectrum" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MICA does not auto-invoke by magic. If no human, host, wrapper, or launcher calls the memory contract, the archive can exist without governing anything. This is the same truth Part 2 identified: the structure can exist, and the model can still have no reliable way to know it exists.&lt;/p&gt;

&lt;p&gt;The answer is an explicit hierarchy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural&lt;/strong&gt; — the model reads the project surface voluntarily: README, &lt;code&gt;mica.yaml&lt;/code&gt;, archive JSON, playbook. No intervention required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guided&lt;/strong&gt; — a host agent requests the activation packet before work begins. The packet declares read targets, self-test posture, drift state, and gate outcome. The host uses it to preflight the session.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Forced&lt;/strong&gt; — a launcher blocks repository work until the session report clears. This is the strongest path and the least elegant one. It is also the one that survives noisy real-world terminal workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What Changed in Code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1ip7e5d3uompdkvdxdh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1ip7e5d3uompdkvdxdh.png" alt="The output mechanism" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three concrete moves made this operational.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session report became a real runtime output.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hook&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session-report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The opening report is now a compiled object — not a protocol expectation, not a prose description. A host can consume it directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invocation is now compiled, not described.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mica_invoke.py&lt;/code&gt; compiles read targets and session report into one activation packet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;packet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entry_strategy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;read_targets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;_layer_targets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_root&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the shift from documentation-first startup to packet-first startup. The host no longer has to infer the sequence from prose.&lt;/p&gt;

&lt;p&gt;In guided mode, the output is already shaped for host consumption:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guided"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"entry_strategy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guided"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"read_targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"readme"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mica_yaml"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"archive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playbook"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lessons"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_report"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"archive_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.7.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"self_test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CLOSED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"closed_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"drift_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO_DRIFT"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"directive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Host agent should load declared MICA surfaces first and use the session report as opening state."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Forced mode now has consequence.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;forced&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;packet&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session_report&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;BLOCKED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The simplest entry surface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight batchfile"&gt;&lt;code&gt;@echo &lt;span class="na"&gt;off&lt;/span&gt;
&lt;span class="kd"&gt;python&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="vm"&gt;%~dp0&lt;/span&gt;&lt;span class="s2"&gt;tools\mica_invoke.py"&lt;/span&gt; &lt;span class="err"&gt;%&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That wrapper gives MICA an enforceable terminal entrypoint instead of relying on good behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. STEM-BIO-AI: The Cleaner Case
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9t7xfu8e4n06e08j08p1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9t7xfu8e4n06e08j08p1.png" alt="Dependency Shift" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;STEM-BIO-AI&lt;/code&gt; already had a mature MICA memory layer — archive, playbook, lessons, invocation protocol, drift profile. What changed was not the memory model. It was how that model becomes operative before work begins.&lt;/p&gt;

&lt;p&gt;That difference is visible across all three invocation modes.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;natural&lt;/code&gt; mode, the helper preserves the README-first path and makes the expected read order explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MICA INVOKE] mode=natural
Gate       : PASS
State      : INVOCATION_MODE
PCT        : CLOSED
...
Directive: Prefer reading README first, then load mica.yaml, archive, and playbook before scan work.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;guided&lt;/code&gt; mode, the same startup becomes a host-consumable packet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"guided"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"read_targets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"readme"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mica_yaml"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"archive"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"playbook"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"lessons"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_report"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"archive_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.7.8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"self_test"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"pct"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CLOSED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"closed_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"drift_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"NO_DRIFT"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In &lt;code&gt;forced&lt;/code&gt; mode, the launcher uses the same contract as a gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[MICA INVOKE] mode=forced
Gate       : PASS
State      : INVOCATION_MODE
PCT        : CLOSED
...
Directive: Block work until the session report gate is not BLOCKED.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The session report now looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: 1.7.8
Load: {"state": "INVOCATION_MODE", "mica_yaml": "memory\\mica.yaml"}
Self-test: {"pct": "CLOSED", "closed_contract": true}
Drift: {"status": "NO_DRIFT"}
Active invariants: {"critical_count": 15, "high_count": 3}
Gate: PASS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before, the package told the operator how to start correctly. Now, the session declares whether it actually did.&lt;/p&gt;

&lt;p&gt;Before this version, starting a &lt;code&gt;STEM-BIO-AI&lt;/code&gt; session correctly still depended on the operator remembering to load the right memory surfaces in the right order. Now that dependency can move upward: in &lt;code&gt;guided&lt;/code&gt; mode to the host, and in &lt;code&gt;forced&lt;/code&gt; mode to the launcher.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. CCGE: The Harder Case
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sxhtvkrlw8yyd951osd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1sxhtvkrlw8yyd951osd.png" alt="retaining identity" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CCGE&lt;/code&gt; is more important precisely because it is harder. It is already a governance-heavy runtime. If MICA's identity were weak, it would disappear into the larger framework.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;CCGE&lt;/code&gt; here is the Care Chain Governance Engine: a fail-closed clinical governance runtime with its own execution core, artifact generation, policy layers, and approval logic. That is why it is the harder case. MICA is not being tested in isolation. It is being tested inside a system dense enough to swallow it.&lt;/p&gt;

&lt;p&gt;It did not.&lt;/p&gt;

&lt;p&gt;The boundary stayed explicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MICA&lt;/strong&gt; = invocation, memory, invariants, drift control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CCGE Core&lt;/strong&gt; = fail-closed runtime and artifact generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;STEM-AI&lt;/strong&gt; = trust re-audit and classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the important architectural result. In &lt;code&gt;STEM-BIO-AI&lt;/code&gt;, MICA is already close to the center of the tool's operational identity. In &lt;code&gt;CCGE&lt;/code&gt;, MICA has to retain its own identity inside a much larger runtime. It does so by remaining responsible for invocation, memory, invariants, and drift control, while &lt;code&gt;CCGE Core&lt;/code&gt; remains responsible for fail-closed execution and artifact logic.&lt;/p&gt;

&lt;p&gt;The current session report in &lt;code&gt;CCGE&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: None
Load: {"state": "INVOCATION_MODE", "mica_yaml": "mica.yaml"}
Self-test: {"pct": "CLOSED", "closed_contract": true}
Drift: {"status": "NO_DRIFT"}
Active invariants: {"critical_count": 0, "high_count": 0}
Gate: PASS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Archive: None&lt;/code&gt; with &lt;code&gt;Gate: PASS&lt;/code&gt; is not a contradiction. The baseline archive does not yet expose a &lt;code&gt;project.version&lt;/code&gt; field. MICA detected that gap and reported it before any work began. A system that hides its own incompleteness is not governed. A system that surfaces it at session start is.&lt;/p&gt;

&lt;p&gt;The reason is concrete: the active archive is still a baseline integration memory object, not yet a fully target-bound archive. Its &lt;code&gt;project&lt;/code&gt; block still carries placeholders like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"project"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;target-repo-name&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;absolute-or-repo-relative-path&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"owner"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;org-or-maintainer&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"integration_program"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CCGE Unified Model"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"target_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"phase_1_candidate"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the current report is telling the truth about what exists: a coherent MICA package around a still-baseline archive.&lt;/p&gt;

&lt;p&gt;A README might have let that gap stay invisible. The session report surfaced it immediately. That is what honest governance looks like before an archive is fully populated.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. What This Means for Anyone Building Agent Workflows
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9iym06ebtjisrhcw3oq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs9iym06ebtjisrhcw3oq.png" alt="Architecutural imperatives" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three lessons from running this against two different projects.&lt;/p&gt;

&lt;p&gt;Human-readable startup is not enough. If the only valid path lives in a README, the protocol is vulnerable to partial reading and host variance. &lt;code&gt;STEM-BIO-AI&lt;/code&gt; is the clean example here: the memory layer was already mature, but correct startup still depended too much on the operator remembering to load it.&lt;/p&gt;

&lt;p&gt;Session-start state must be machine-usable. If a host agent cannot consume the startup declaration as a structured object, it cannot reliably preflight the session. That is why &lt;code&gt;guided&lt;/code&gt; mode matters more than another explanatory document: it gives the host an object to act on, not just instructions to interpret.&lt;/p&gt;

&lt;p&gt;A gate needs an entrypoint. A session report can be a conceptual hard gate, but until a launcher or host uses it as an entry condition, it remains a convention. &lt;code&gt;CCGE&lt;/code&gt; is the stronger proof of that point because the environment is already dense with governance logic; without an explicit entry surface, MICA would have been easy to blur into the surrounding framework instead of remaining its own startup layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What This Does Not Claim
&lt;/h2&gt;

&lt;p&gt;MICA does not self-invoke automatically in all environments. There is still no natural law that forces an LLM session to load the governed archive first.&lt;/p&gt;

&lt;p&gt;The real claim is narrower:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MICA can now be read naturally&lt;/li&gt;
&lt;li&gt;MICA can now be requested deliberately&lt;/li&gt;
&lt;li&gt;MICA can now be enforced mechanically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not total automation. A realistic path to enforceable startup.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. What Part 8 Will Address
&lt;/h2&gt;

&lt;p&gt;The startup path is now much stronger.&lt;/p&gt;

&lt;p&gt;But one question remains:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How much of the session-start contract should be owned by the archive itself, and how much should remain a runtime default?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The current line can emit &lt;code&gt;session-report&lt;/code&gt;, compile guided packets, and block in forced mode. The next step is stricter archive ownership — richer &lt;code&gt;session_report_format&lt;/code&gt;, explicit per-archive &lt;code&gt;session_gate_policy&lt;/code&gt;, better drift contracts.&lt;/p&gt;

&lt;p&gt;Part 8 is not about whether MICA should govern startup. It already does. It is about how much of that behavior should be declared by the archive rather than inferred by the runtime.&lt;/p&gt;

&lt;p&gt;The series continues only where there is something concrete to specify, test, or correct.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; A protocol is not yet an entrypoint. MICA becomes operational only when invocation is structured as &lt;code&gt;natural&lt;/code&gt;, &lt;code&gt;guided&lt;/code&gt;, or &lt;code&gt;forced&lt;/code&gt; — and the session begins from a declared activation packet, not from hope.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>contextengineering</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 19 May 2026 14:38:53 +0000</pubDate>
      <link>https://dev.to/flamehaven01/from-repo-scanner-to-audit-architecture-what-changed-in-stem-bio-ai-through-v178-500m</link>
      <guid>https://dev.to/flamehaven01/from-repo-scanner-to-audit-architecture-what-changed-in-stem-bio-ai-through-v178-500m</guid>
      <description>

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa5ste8u9hanwgyt9441.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa5ste8u9hanwgyt9441.png" alt="From repo scanner to audit architecture: the evolution of STEM BIO-AI through v1.7.8" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Three technical changes that made the scanner less Python-shaped, the warning model more stable, and the reports more inspectable.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The last time I wrote about STEM BIO-AI, the focus was AIRI:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how a local repository scanner could expand its risk vocabulary without pretending to become a universal AI safety judge.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That was the right story for &lt;code&gt;1.7.0&lt;/code&gt; and &lt;code&gt;1.7.1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But the project changed meaningfully after that.&lt;/p&gt;

&lt;p&gt;For readers who have not followed the earlier posts: &lt;a href="https://dev.to/flamehaven01/beyond-repo-scanning-how-airi-expanded-the-risk-vocabulary-in-stem-bio-ai-17x-5bgo"&gt;Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By &lt;code&gt;1.7.8&lt;/code&gt;, the interesting question was no longer just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can this scanner attach a broader risk language to local findings?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Can this scanner make those findings more inspectable, less misleading, and more robust across real repository shapes?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That shift matters.&lt;/p&gt;

&lt;p&gt;Because in audit tooling, correctness is only the first battle. The second battle is whether a reviewer can see &lt;strong&gt;why&lt;/strong&gt; the tool landed where it did, and whether the output still makes sense when it leaves the terminal and becomes a report, a PDF packet, a Hugging Face demo, or a governance memo.&lt;/p&gt;

&lt;p&gt;From &lt;code&gt;1.7.6&lt;/code&gt; through &lt;code&gt;1.7.8&lt;/code&gt;, three changes mattered most.&lt;/p&gt;

&lt;p&gt;They changed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;what counts as evidence,&lt;/li&gt;
&lt;li&gt;how warning lanes are separated,&lt;/li&gt;
&lt;li&gt;and how the final artifact stays legible across surfaces.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the more technical story behind those releases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk
&lt;/h2&gt;

&lt;p&gt;Before getting into the release details, it helps to define what AIRI means in this series.&lt;/p&gt;

&lt;p&gt;AIRI refers here to &lt;strong&gt;&lt;a href="https://airisk.mit.edu/" rel="noopener noreferrer"&gt;the MIT AI Risk Repository&lt;/a&gt;&lt;/strong&gt;: a public AI risk resource from the MIT AI Risk Initiative that organizes fragmented AI risk language across research, policy, and industry sources.&lt;/p&gt;

&lt;p&gt;The repository includes an AI Risk Database, a Causal Taxonomy of AI Risks, and a Domain Taxonomy of AI Risks. According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications, while the public domain taxonomy organizes risks into 7 domains and 24 subdomains.&lt;/p&gt;

&lt;p&gt;That makes AIRI useful as a vocabulary source.&lt;/p&gt;

&lt;p&gt;But vocabulary is not truth.&lt;/p&gt;

&lt;p&gt;A local scanner should not say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this repository caused this risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It should say something more careful:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this local finding belongs to a broader class of AI risk language.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction is the design boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Problem: The scanner was still too Python-shaped
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1aqh2kcvfa3gy2i0l01h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1aqh2kcvfa3gy2i0l01h.png" alt="Universal dependency detection and provenance evidence across Python and JavaScript stacks" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the more useful failures in this line came from an uncomfortable result: a repository could obviously have dependency and lockfile evidence, and STEM BIO-AI could still miss it.&lt;/p&gt;

&lt;p&gt;That is not a philosophical problem.&lt;br&gt;
That is &lt;strong&gt;an implementation problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, the project was still too biased toward Python-native signals.&lt;/p&gt;

&lt;p&gt;That showed up most clearly in JavaScript or mixed-stack repositories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;package.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;package-lock.json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pnpm-lock.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;yarn.lock&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;npm-shrinkwrap.json&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;were not being treated as first-class provenance and replication evidence in the same way that &lt;code&gt;requirements.txt&lt;/code&gt; or &lt;code&gt;pyproject.toml&lt;/code&gt; were.&lt;/p&gt;

&lt;p&gt;The result was a false negative pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 3 provenance (&lt;code&gt;B1&lt;/code&gt;) could be undercounted&lt;/li&gt;
&lt;li&gt;Stage 4 replication evidence could be undercounted&lt;/li&gt;
&lt;li&gt;and the report could quietly imply "no dependency evidence" when the repository clearly had dependency structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That kind of miss is more dangerous than it sounds.&lt;/p&gt;

&lt;p&gt;Not because it makes the score a little wrong.&lt;/p&gt;

&lt;p&gt;But because it damages trust in the scanner's worldview.&lt;/p&gt;

&lt;p&gt;If developers see a tool miss an obvious &lt;code&gt;pnpm-lock.yaml&lt;/code&gt;, they stop believing the harder claims too.&lt;/p&gt;


&lt;h3&gt;
  
  
  What changed in &lt;code&gt;1.7.6&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The fix was straightforward but important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JavaScript manifests and lockfiles were promoted into the same evidence families as the existing Python manifests where appropriate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Concretely, that meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;B1_data_provenance_controls&lt;/code&gt; started recognizing JS manifest/lock surfaces&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;S4_environment_lock_evidence&lt;/code&gt; started recognizing them&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;S4_exact_dependency_pins_or_hashes&lt;/code&gt; started recognizing them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was not a scoring philosophy change.&lt;/p&gt;

&lt;p&gt;It was a scope correction.&lt;/p&gt;

&lt;p&gt;The rule engine learned that a dependency ecosystem is a dependency ecosystem even when it is not Python.&lt;/p&gt;

&lt;p&gt;One boundary matters here.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;B1_data_provenance_controls&lt;/code&gt; does &lt;strong&gt;not&lt;/strong&gt; suddenly mean "dataset lineage was proven by a lockfile."&lt;/p&gt;

&lt;p&gt;In this lane, &lt;code&gt;B1&lt;/code&gt; is using dependency manifests as &lt;strong&gt;repository provenance surfaces&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what environment the repository expects,&lt;/li&gt;
&lt;li&gt;what dependency custody the repository exposes,&lt;/li&gt;
&lt;li&gt;and whether the repo surfaces any adjacent data-source, IRB, or dataset-citation language around that environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is weaker than dataset lineage evidence.&lt;/p&gt;

&lt;p&gt;But it is also much stronger than pretending a mixed-stack repository has no provenance surface at all.&lt;/p&gt;


&lt;h3&gt;
  
  
  A small before/after that makes the point
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;yorkeccak/bio&lt;/code&gt; case is a good example because the score movement was not philosophical. It was mechanical.&lt;/p&gt;

&lt;p&gt;Before the JS manifest fix, the same repository could produce:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: 1.7.5
final_score: 40
stage_3_code_bio: 6
B1_data_provenance_controls: 0 / 15
replication_score: 10
AIRI covered_count: 0 / 31
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the manifest and lockfile correction, the same repository shape produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;version: 1.7.8
final_score: 48
stage_3_code_bio: 25
B1_data_provenance_controls: 15 / 15
replication_score: 30
AIRI covered_count: 7 / 32

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part is not the score delta by itself.&lt;/p&gt;

&lt;p&gt;One small boundary is worth making explicit here.&lt;/p&gt;

&lt;p&gt;The AIRI change is doing two things at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the denominator moved from &lt;code&gt;31&lt;/code&gt; to &lt;code&gt;32&lt;/code&gt; because the governed AIRI detector-scope expanded by one mapping row across this release line,&lt;/li&gt;
&lt;li&gt;and the numerator moved from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;7&lt;/code&gt; because the current release can now carry more bounded AIRI links around the findings it actually surfaced.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That explains the AIRI coverage delta.&lt;/p&gt;

&lt;p&gt;The scoring delta came from a more mechanical correction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;package.json&lt;/code&gt;, &lt;code&gt;package-lock.json&lt;/code&gt;, and &lt;code&gt;pnpm-lock.yaml&lt;/code&gt; stopped being invisible,&lt;/li&gt;
&lt;li&gt;Stage 3 stopped saying "no dependency/provenance manifest detected,"&lt;/li&gt;
&lt;li&gt;and Stage 4 stopped undercounting replication structure that was obviously there.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is what I mean by "blind spot removal" rather than score drift.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why that matters
&lt;/h3&gt;

&lt;p&gt;This is the kind of change that sounds small in a changelog but large in practice.&lt;/p&gt;

&lt;p&gt;Because it changes the relationship between the tool and the developer reading it.&lt;/p&gt;

&lt;p&gt;A scanner earns the right to say "this repo is weak on provenance" only after it can correctly see the basic surfaces that exist in the target stack.&lt;/p&gt;

&lt;p&gt;That correction also made later report outputs more believable.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;B1&lt;/code&gt; moved from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;15&lt;/code&gt; in affected repositories, that was not "score drift." It was the removal of a blind spot.&lt;/p&gt;

&lt;p&gt;And that distinction is exactly why audit tools need explicit versioned rationale.&lt;/p&gt;

&lt;p&gt;Without it, every score movement looks arbitrary.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Problem: The warning lanes were doing too many jobs at once
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fll48qnppqqso4q9k47ut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fll48qnppqqso4q9k47ut.png" alt="Dedicated warning lanes in STEM BIO-AI showing C4, C5, and C6 semantic separation" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before the split, it helps to read &lt;code&gt;C1–C6&lt;/code&gt; as code-integrity lanes.&lt;/p&gt;

&lt;p&gt;They are not general AI risk categories. They are reviewer-facing signals that tell you what kind of repository weakness the scanner found, and where to inspect next.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lane&lt;/th&gt;
&lt;th&gt;What it means in STEM BIO-AI&lt;/th&gt;
&lt;th&gt;What a reviewer should inspect&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hardcoded credential signals&lt;/td&gt;
&lt;td&gt;exposed API keys, cloud keys, tokens, or credential-like patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Dependency pinning and external-service fragility&lt;/td&gt;
&lt;td&gt;loose dependency ranges, missing exact pins, fragile external service assumptions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Deprecated patient-adjacent paths&lt;/td&gt;
&lt;td&gt;legacy, archive, or deprecated folders that still contain patient or clinical-adjacent patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C4&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fail-open exception handling&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;except: pass&lt;/code&gt;, &lt;code&gt;except Exception: pass&lt;/code&gt;, silent fallbacks, or code paths where errors can disappear&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Compliance and clinical-boundary integrity&lt;/td&gt;
&lt;td&gt;unsupported HIPAA, compliance, clinical-safe, self-hosted, or regulatory-adjacent claims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;C6&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Mock-auth or no-auth local/self-host trust boundaries&lt;/td&gt;
&lt;td&gt;auto-login, mock authentication, no-auth flows, or weak local trust-boundary assumptions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That table matters because &lt;code&gt;C4&lt;/code&gt;, &lt;code&gt;C5&lt;/code&gt;, and &lt;code&gt;C6&lt;/code&gt; are not interchangeable.&lt;/p&gt;

&lt;p&gt;A fail-open exception is not the same problem as an unsupported compliance claim.&lt;/p&gt;

&lt;p&gt;And an unsupported compliance claim is not the same problem as a mock-auth self-host boundary.&lt;/p&gt;

&lt;p&gt;That distinction became important once the report started surfacing more nuanced governance signals.&lt;/p&gt;

&lt;p&gt;The old &lt;code&gt;C4&lt;/code&gt; lane had started life as a code-oriented fail-open/exception surface.&lt;/p&gt;

&lt;p&gt;But as the scanner got better at spotting unsupported compliance language and boundary failures, more and more signals were being interpreted near that same lane.&lt;/p&gt;

&lt;p&gt;That made the result harder to read.&lt;/p&gt;

&lt;p&gt;If a reviewer sees:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;C4_exception_handling_clinical_adjacent_paths: WARN&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;they should be able to infer the remediation class immediately.&lt;/p&gt;

&lt;p&gt;They should know to inspect executable control flow.&lt;/p&gt;

&lt;p&gt;They should not have to wonder whether the warning is actually about a README compliance claim, a missing clinical boundary, or a mock-auth local path.&lt;/p&gt;

&lt;p&gt;Once one lane starts carrying all of those meanings, the ID stops doing its job.&lt;/p&gt;

&lt;p&gt;This is a common failure mode in rule systems.&lt;/p&gt;

&lt;p&gt;At first it feels efficient:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one warning lane,&lt;/li&gt;
&lt;li&gt;one bucket,&lt;/li&gt;
&lt;li&gt;multiple related issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then a few releases later the bucket becomes a junk drawer.&lt;/p&gt;

&lt;p&gt;That is exactly what had to be prevented here.&lt;/p&gt;




&lt;h3&gt;
  
  
  What changed in &lt;code&gt;1.7.7&lt;/code&gt; and &lt;code&gt;1.7.8&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The solution was to split the lane cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C4&lt;/code&gt; stayed reserved for executable fail-open exception behavior&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C5&lt;/code&gt; was introduced for unsupported compliance or boundary-integrity claims&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C6&lt;/code&gt; was introduced for mock-auth, auto-login, or no-auth self-host/local trust-boundary signals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was more than renaming.&lt;/p&gt;

&lt;p&gt;It made the model of the problem cleaner:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C4&lt;/code&gt; is code-path failure semantics&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C5&lt;/code&gt; is governance/claim integrity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C6&lt;/code&gt; is trust-boundary collapse in local or self-host flows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction matters to developers because those are different remediation classes.&lt;/p&gt;

&lt;p&gt;If a repository triggers &lt;code&gt;C4&lt;/code&gt;, you inspect executable control flow.&lt;br&gt;
If it triggers &lt;code&gt;C5&lt;/code&gt;, you inspect public claim surfaces and supporting governance evidence.&lt;br&gt;
If it triggers &lt;code&gt;C6&lt;/code&gt;, you inspect local auth and trust-boundary design.&lt;/p&gt;

&lt;p&gt;One warning label should not try to be all three.&lt;/p&gt;

&lt;p&gt;The more interesting case is when two of those lanes fire together.&lt;/p&gt;

&lt;p&gt;A repository can claim something like "HIPAA-ready self-hosting" at the README layer and also expose a mock-auth or auto-login local path.&lt;/p&gt;

&lt;p&gt;That is not one problem.&lt;/p&gt;

&lt;p&gt;It is two related problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C5&lt;/code&gt; says the claim surface is overstating governance integrity&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;C6&lt;/code&gt; says the local trust boundary is weaker than the claim suggests&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly why the split matters.&lt;/p&gt;

&lt;p&gt;If those two findings collapse into one bucket, the reviewer loses both remediation clarity and causal ordering.&lt;/p&gt;

&lt;p&gt;If they stay separate, the report can say:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the public claim is weak,&lt;/li&gt;
&lt;li&gt;the local boundary is weak,&lt;/li&gt;
&lt;li&gt;and both together make the repository easier to over-trust.&lt;/li&gt;
&lt;/ol&gt;


&lt;h3&gt;
  
  
  The code insight
&lt;/h3&gt;

&lt;p&gt;This is one of those places where good audit tooling starts looking more like good static analysis design.&lt;/p&gt;

&lt;p&gt;A useful warning family is not just one that catches things.&lt;/p&gt;

&lt;p&gt;It is one that stays semantically stable across releases.&lt;/p&gt;

&lt;p&gt;That is why this split mattered:&lt;/p&gt;

&lt;p&gt;it was not just about improving recall.&lt;/p&gt;

&lt;p&gt;It was about preserving interpretability under growth.&lt;/p&gt;

&lt;p&gt;Once a detector ID becomes ambiguous, your historical comparisons become weaker.&lt;/p&gt;

&lt;p&gt;And once historical comparisons become weaker, your audit system starts losing its memory.&lt;/p&gt;

&lt;p&gt;That is a bigger problem than one missed warning.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. Problem: The report could still be correct and yet hard to trust
&lt;/h2&gt;

&lt;p&gt;A repository scanner does not end its life in JSON.&lt;/p&gt;

&lt;p&gt;It ends up in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Markdown&lt;/li&gt;
&lt;li&gt;HTML&lt;/li&gt;
&lt;li&gt;PDF&lt;/li&gt;
&lt;li&gt;demos&lt;/li&gt;
&lt;li&gt;governance reviews&lt;/li&gt;
&lt;li&gt;screenshots&lt;/li&gt;
&lt;li&gt;and social arguments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the output architecture matters almost as much as the scoring logic.&lt;/p&gt;

&lt;p&gt;And there were two places where this became obvious.&lt;/p&gt;


&lt;h3&gt;
  
  
  First: AIRI numbers needed explanation, not just display
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqednhmhqla6mtzxieyyo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqednhmhqla6mtzxieyyo.png" alt="AIRI numbers needed explanation" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Earlier versions could show AIRI coverage as a count, but not always make it obvious why a covered risk appeared.&lt;/p&gt;

&lt;p&gt;That is a problem.&lt;/p&gt;

&lt;p&gt;Because a number like &lt;code&gt;7 / 32&lt;/code&gt; looks precise.&lt;/p&gt;

&lt;p&gt;But precision without causal explanation is fragile.&lt;/p&gt;

&lt;p&gt;Developers do not just want to know that a risk mapped.&lt;br&gt;
They want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which detector triggered it,&lt;/li&gt;
&lt;li&gt;why that detector maps to that AIRI risk,&lt;/li&gt;
&lt;li&gt;and what boundary still remains around that mapping.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the AIRI layer had to become more explicit.&lt;/p&gt;

&lt;p&gt;That is where &lt;code&gt;mapping_details&lt;/code&gt; mattered.&lt;/p&gt;

&lt;p&gt;Covered AIRI rows now carry bounded reasoning objects that can say, in effect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;detector ID&lt;/li&gt;
&lt;li&gt;mapping justification&lt;/li&gt;
&lt;li&gt;trigger reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a much stronger artifact than a bare coverage count.&lt;/p&gt;

&lt;p&gt;It turns AIRI from a visual add-on into an inspectable vocabulary layer.&lt;/p&gt;

&lt;p&gt;In practice the object now looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"24.01.03"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Safe exploration problem with widely deployed AI assistants"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"covered_by"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"C5_compliance_boundary_integrity"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mapping_details"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"detector_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C5_compliance_boundary_integrity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"mapping_justification"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Weak compliance and clinical-boundary integrity can cause users to over-trust unsafe exploration in clinical-adjacent contexts."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"trigger_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Unsupported legal/compliance claim surfaced in boundary-integrity lane."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That matters because the AIRI layer no longer asks the reviewer to trust a number alone.&lt;/p&gt;

&lt;p&gt;It now gives the reviewer a bounded reasoning object to inspect.&lt;/p&gt;




&lt;h3&gt;
  
  
  Second: The packets themselves needed re-architecture
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm42xf261dgookg7em5c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm42xf261dgookg7em5c.png" alt="Artifact architecture showing brief, standard, and full evidence packet tiers across output surfaces" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The PDF tiers had also drifted into an awkward shape.&lt;/p&gt;

&lt;p&gt;The old packet boundaries were no longer matching the actual content density:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 4 could disappear or feel collapsed&lt;/li&gt;
&lt;li&gt;the closeout pages could become overcrowded&lt;/li&gt;
&lt;li&gt;and "5-page detailed packet" could stop meaning what users expected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That led to a cleaner packet model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;level 1&lt;/code&gt; = brief &lt;code&gt;1p&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;level 2&lt;/code&gt; = standard &lt;code&gt;5p&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;level 3&lt;/code&gt; = full &lt;code&gt;7p&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And just as importantly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the default CLI path moved to &lt;code&gt;level 3&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a statement about what the project now considers the normal artifact.&lt;/p&gt;

&lt;p&gt;The normal artifact is no longer the brief scan.&lt;br&gt;
It is the full evidence packet.&lt;/p&gt;


&lt;h3&gt;
  
  
  Why that matters
&lt;/h3&gt;

&lt;p&gt;This is where the project moved from "scanner" toward "audit architecture."&lt;/p&gt;

&lt;p&gt;A scanner can stop at a result.&lt;/p&gt;

&lt;p&gt;An audit architecture has to preserve meaning across surfaces.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON must be canonical&lt;/li&gt;
&lt;li&gt;HTML must be navigable&lt;/li&gt;
&lt;li&gt;PDFs must honor real packet boundaries&lt;/li&gt;
&lt;li&gt;and the same warning semantics must survive in all of them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why these changes matter to developers.&lt;/p&gt;

&lt;p&gt;They are part of the correctness story.&lt;/p&gt;

&lt;p&gt;If the &lt;code&gt;why&lt;/code&gt; disappears when the result becomes a report, the audit object was never complete to begin with.&lt;/p&gt;


&lt;h2&gt;
  
  
  The hidden pattern behind all three changes
&lt;/h2&gt;

&lt;p&gt;These releases can look like a mixed bag:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JS manifest support&lt;/li&gt;
&lt;li&gt;legal/compliance claim surfacing&lt;/li&gt;
&lt;li&gt;external dependency risk&lt;/li&gt;
&lt;li&gt;C4/C5/C6 split&lt;/li&gt;
&lt;li&gt;AIRI reasoning&lt;/li&gt;
&lt;li&gt;packet restructuring&lt;/li&gt;
&lt;li&gt;demo/output alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there is a single pattern underneath them:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;the system became less willing to let ambiguity hide inside a convenient surface.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That showed up in three ways:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;a manifest should count if it exists&lt;/li&gt;
&lt;li&gt;a warning lane should mean one thing&lt;/li&gt;
&lt;li&gt;a risk mapping should explain itself&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That may sound almost obvious.&lt;/p&gt;

&lt;p&gt;But a lot of tools never make it that far.&lt;/p&gt;

&lt;p&gt;They accumulate clever features faster than they reduce ambiguity.&lt;/p&gt;

&lt;p&gt;This line of work did the opposite.&lt;/p&gt;

&lt;p&gt;It made *&lt;em&gt;the system stricter about what its outputs are allowed to imply.&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
That is a more durable path.&lt;/p&gt;


&lt;h2&gt;
  
  
  The more interesting lesson
&lt;/h2&gt;

&lt;p&gt;The most useful thing about &lt;code&gt;1.7.6&lt;/code&gt; through &lt;code&gt;1.7.8&lt;/code&gt; is not that STEM BIO-AI became "smarter."&lt;/p&gt;

&lt;p&gt;It is that it became harder to misread.&lt;/p&gt;

&lt;p&gt;That is a better goal for audit tooling.&lt;/p&gt;

&lt;p&gt;Especially now.&lt;/p&gt;

&lt;p&gt;Because in a world increasingly full of fluent agent outputs, the differentiator is not whether a tool can generate a plausible narrative.&lt;/p&gt;

&lt;p&gt;It is whether the narrative stays tethered to inspectable structure when the repository is messy, cross-stack, overclaimed, or partially misleading.&lt;/p&gt;

&lt;p&gt;That is where this release line got better.&lt;/p&gt;

&lt;p&gt;Not by pretending to know more than it does.&lt;/p&gt;

&lt;p&gt;But by making its own boundaries clearer.&lt;/p&gt;


&lt;h2&gt;
  
  
  What I would tell developers evaluating this line
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoh0x6bamxqa14sn6vog.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcoh0x6bamxqa14sn6vog.png" alt="What I would tell developers evaluating this line" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you only look at the release notes, you might think:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;better AIRI&lt;/li&gt;
&lt;li&gt;more warnings&lt;/li&gt;
&lt;li&gt;nicer reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is true, but too shallow.&lt;/p&gt;

&lt;p&gt;The real changes are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the scanner is less Python-centric than it was&lt;/li&gt;
&lt;li&gt;the warning taxonomy is more semantically stable than it was&lt;/li&gt;
&lt;li&gt;the artifacts are more inspectable than they were&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That combination matters more than any one score change.&lt;/p&gt;

&lt;p&gt;It means the tool is becoming less of a clever repo grader and more of a reliable evidence instrument.&lt;/p&gt;

&lt;p&gt;That is the direction I care about.&lt;/p&gt;

&lt;p&gt;Because once the repository is politically messy, clinically adjacent, or governance-sensitive, "good-enough automation" is not enough.&lt;/p&gt;

&lt;p&gt;The system has to show its work.&lt;/p&gt;

&lt;p&gt;These versions got noticeably better at doing that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6ig153mid6rhvhjnxtj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6ig153mid6rhvhjnxtj.png" alt="A Reiable Edivdence Instrument for the Messy Reality" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;stem-ai
stem /path/to/repo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you want the full packet explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default path now lands on the full evidence packet, and that is the point.&lt;/p&gt;

&lt;p&gt;In audit tooling, the serious path should not require an extra flag.&lt;/p&gt;




&lt;h2&gt;
  
  
  See the Artifact
&lt;/h2&gt;

&lt;p&gt;If you want to inspect the actual artifact shape behind this release line, these two public outputs are the best reference:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3kbwdk04yt6f8f8q3bqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3kbwdk04yt6f8f8q3bqd.png" alt="stem-bio-ai report" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Interactive HTML report: &lt;a href="https://flamehaven01.github.io/flamehaven-audit-reports/stem-bio-ai/yorkeccak-bio/2026-05-15/report.html" rel="noopener noreferrer"&gt;Open interactive HTML report&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full &lt;code&gt;7p&lt;/code&gt; PDF packet: &lt;a href="https://flamehaven01.github.io/flamehaven-audit-reports/stem-bio-ai/yorkeccak-bio/2026-05-15/report.pdf" rel="noopener noreferrer"&gt;Open full &lt;code&gt;7p&lt;/code&gt; PDF packet&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point of &lt;code&gt;1.7.8&lt;/code&gt; is not just that the scanner scores the repository differently.&lt;/p&gt;

&lt;p&gt;It is that the same result now survives translation into JSON, Markdown, HTML, and a full review packet without losing too much meaning along the way.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>governance</category>
      <category>bioinformatics</category>
      <category>devtools</category>
    </item>
    <item>
      <title>Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 14 May 2026 13:41:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/beyond-repo-scanning-how-airi-expanded-the-risk-vocabulary-in-stem-bio-ai-17x-5bgo</link>
      <guid>https://dev.to/flamehaven01/beyond-repo-scanning-how-airi-expanded-the-risk-vocabulary-in-stem-bio-ai-17x-5bgo</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyj7biyn850iewno8ywf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkyj7biyn850iewno8ywf.png" alt="Beyond Repo Scanning" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the second half of the same &lt;code&gt;1.7.x&lt;/code&gt; transition.&lt;/p&gt;

&lt;p&gt;In the previous post, I wrote about calibration governance: how STEM BIO-AI keeps score authority from drifting when users simulate policy posture.&lt;/p&gt;

&lt;p&gt;That was about how the system decides.&lt;/p&gt;

&lt;p&gt;This post is about a different layer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how the system speaks about risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A local repository scanner can become trapped inside its own vocabulary.&lt;/p&gt;

&lt;p&gt;It can detect dependency issues, weak provenance language, shallow validation, reproducibility gaps, and risky exception handling.&lt;/p&gt;

&lt;p&gt;But if every finding stays only inside the scanner's internal language, the report may remain too narrow.&lt;/p&gt;

&lt;p&gt;That is the problem AIRI helped address in STEM BIO-AI &lt;code&gt;1.7.x&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In this context, AIRI is used as a local risk-vocabulary layer built from the MIT AI Risk Repository ecosystem.&lt;/p&gt;

&lt;p&gt;The point is not to replace deterministic repository scanning with an external risk database.&lt;/p&gt;

&lt;p&gt;The point is to give local findings a broader risk vocabulary without turning that vocabulary into a truth claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Basic AIRI Context
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17fvupudpdertnq7dy6i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17fvupudpdertnq7dy6i.png" alt="Expanding the language of risk" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://airisk.mit.edu/" rel="noopener noreferrer"&gt;The MIT AI Risk Repository&lt;/a&gt; is a public AI risk resource from the MIT AI Risk Initiative.&lt;/p&gt;

&lt;p&gt;It helps organize fragmented AI risk language across research, policy, and industry sources.&lt;/p&gt;

&lt;p&gt;The repository includes three main parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an AI Risk Database&lt;/li&gt;
&lt;li&gt;a Causal Taxonomy of AI Risks&lt;/li&gt;
&lt;li&gt;a Domain Taxonomy of AI Risks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications. The public domain taxonomy organizes risks into 7 domains and 24 subdomains.&lt;/p&gt;

&lt;p&gt;Some of those domain taxonomy nodes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;2. Privacy &amp;amp; Security&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;2.1 Compromise of privacy by obtaining, leaking or correctly inferring sensitive information&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;2.2 AI system security vulnerabilities and attacks&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;6.5 Governance failure&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7. AI System Safety, Failures, &amp;amp; Limitations&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;7.4 Lack of transparency or interpretability&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That makes AIRI useful as a vocabulary source.&lt;/p&gt;

&lt;p&gt;But vocabulary is not truth.&lt;/p&gt;

&lt;p&gt;A local scanner should not say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this repository caused this risk.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It should say something more careful:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this local finding belongs to a broader class of AI risk language.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction is the design boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Problem AIRI Was Meant to Solve
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zurro5671x5iqraftvh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7zurro5671x5iqraftvh.png" alt="Local scanners are trapped in their own vocabulary" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;STEM BIO-AI began as a deterministic evidence-surface scanner for bio and medical AI repositories.&lt;/p&gt;

&lt;p&gt;That core remains.&lt;/p&gt;

&lt;p&gt;The scanner looks at observable repository surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README and docs&lt;/li&gt;
&lt;li&gt;code structure&lt;/li&gt;
&lt;li&gt;CI configuration&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;reproducibility signals&lt;/li&gt;
&lt;li&gt;clinical-adjacent boundary language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But once STEM BIO-AI started producing richer audit outputs, a new question appeared:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How should the system talk about the broader risk territory around a detected finding?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a fail-open exception path may have implications beyond code quality&lt;/li&gt;
&lt;li&gt;weak provenance language may connect to reproducibility and trust concerns&lt;/li&gt;
&lt;li&gt;shallow validation around sensitive inputs may point toward a wider harm surface than the repository alone makes obvious&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without a broader vocabulary, those findings remain local and narrow.&lt;/p&gt;

&lt;p&gt;AIRI helps widen the vocabulary without making the scanner less deterministic.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Short Note on Detector Families
&lt;/h2&gt;

&lt;p&gt;In this article, a detector family means a bounded local analysis surface inside STEM BIO-AI.&lt;/p&gt;

&lt;p&gt;It does not mean an AI model judging the repository.&lt;/p&gt;

&lt;p&gt;Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;code integrity detectors such as hardcoded credential or fail-open exception checks&lt;/li&gt;
&lt;li&gt;AST contract detectors such as shallow validator checks&lt;/li&gt;
&lt;li&gt;bio diagnostics such as SMILES parser-guard or silent mock fallback checks&lt;/li&gt;
&lt;li&gt;provenance and reproducibility evidence surfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A detector family produces a local finding.&lt;/p&gt;

&lt;p&gt;The AIRI layer does not replace that finding.&lt;/p&gt;

&lt;p&gt;It gives the finding a broader vocabulary anchor.&lt;/p&gt;




&lt;h2&gt;
  
  
  AIRI Does Not Replace the Scan
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd7zc9lpl0946movl0id.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwd7zc9lpl0946movl0id.png" alt="Vocabulary is not truth" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This boundary matters.&lt;/p&gt;

&lt;p&gt;The AIRI layer does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate that a real-world incident happened&lt;/li&gt;
&lt;li&gt;prove that a repository causes a given harm&lt;/li&gt;
&lt;li&gt;turn a detector hit into a clinical danger claim&lt;/li&gt;
&lt;li&gt;replace due diligence or domain review&lt;/li&gt;
&lt;li&gt;override the deterministic score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, it gives the system a structured way to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what broader risk territory a finding may relate to&lt;/li&gt;
&lt;li&gt;which risk vocabulary exists around that class of concern&lt;/li&gt;
&lt;li&gt;where known coverage gaps remain&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why AIRI is a risk-vocabulary layer, not a truth layer.&lt;/p&gt;

&lt;p&gt;If a report says something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;covered risks: 12 / 31
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;that should not be read as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;the repository is 38% safe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;the scanner covers 38% of all AI risk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A better interpretation is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;within the detector scope currently mapped into the curated AIRI runtime layer, this scan triggered findings that connect to these AIRI risk entries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is narrower.&lt;/p&gt;

&lt;p&gt;It is also more useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  From External Repository to Local Governance Layer
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5fwgfgtfux2bro49eyfl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5fwgfgtfux2bro49eyfl.png" alt="Three layers of local governance" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AIRI story in STEM BIO-AI changed during &lt;code&gt;1.7.x&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The initial direction was simple: use AIRI to provide broader risk labels around local findings.&lt;/p&gt;

&lt;p&gt;That was useful, but not enough.&lt;/p&gt;

&lt;p&gt;If an audit system relies on an external risk source, it needs governance around that source.&lt;/p&gt;

&lt;p&gt;So STEM BIO-AI separates AIRI into three local layers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Local layer&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;airi_registry_full.v1.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;normalized full local registry derived from the upstream AIRI snapshot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;airi_runtime_bundle.v1.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;curated runtime subset used by deterministic scans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;airi_detector_mapping.v1.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;detector-to-risk mapping registry plus known-gap records&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This separation prevents a common mistake:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;confusing the full upstream AIRI universe with the smaller curated runtime bundle used by the scanner.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The scanner uses the curated runtime bundle, not the entire upstream AIRI universe.&lt;/p&gt;

&lt;p&gt;That keeps runtime outputs deterministic, reviewable, and tied to a known local snapshot.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “Governed” Means Here
&lt;/h2&gt;

&lt;p&gt;In the current &lt;code&gt;1.7.5&lt;/code&gt; state of the &lt;code&gt;1.7.x&lt;/code&gt; line, governed does not mean that every mapping has gone through an external review board.&lt;/p&gt;

&lt;p&gt;It means something narrower and more concrete:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AIRI data is stored as versioned local artifacts&lt;/li&gt;
&lt;li&gt;runtime scan output uses a curated bundle, not the entire upstream universe&lt;/li&gt;
&lt;li&gt;detector mappings are separated from the full registry&lt;/li&gt;
&lt;li&gt;known gaps are recorded as part of the mapping layer&lt;/li&gt;
&lt;li&gt;artifact metadata surfaces AIRI registry, bundle, mapping, snapshot, and license information&lt;/li&gt;
&lt;li&gt;changes to registry, runtime bundle, or mapping versions require explicit version bumps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the current governance level.&lt;/p&gt;

&lt;p&gt;It is not final.&lt;/p&gt;

&lt;p&gt;But it is stronger than attaching a risk dataset as an unversioned appendix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Curation Logic
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8rcpyd6489hb61zceot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo8rcpyd6489hb61zceot.png" alt="Curated by exclusion" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the part that matters most.&lt;/p&gt;

&lt;p&gt;AIRI is broad. STEM BIO-AI is narrow.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI does not need every AIRI entry active at runtime. It needs the subset that can be responsibly connected to deterministic repository evidence.&lt;/p&gt;

&lt;p&gt;So the runtime bundle is curated by exclusion as much as inclusion.&lt;/p&gt;

&lt;p&gt;A risk vocabulary node should stay outside the runtime bundle when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;No local evidence surface exists&lt;/strong&gt;&lt;br&gt;
The scanner has no repository-level signal that can responsibly connect to that risk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The mapping would require causal inference&lt;/strong&gt;&lt;br&gt;
The scanner would have to imply that harm occurred, that users were affected, or that the repository caused a risk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The risk is too broad for repository-local evidence&lt;/strong&gt;&lt;br&gt;
Broad societal, geopolitical, or macroeconomic risks may be important in AIRI, but they should not become runtime scan outputs unless a local detector surface can support the mapping.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The mapping would confuse vocabulary with score authority&lt;/strong&gt;&lt;br&gt;
If a risk label might be read as changing the formal score or certifying danger, it should remain outside the runtime layer until the reporting semantics are clear.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the runtime bundle is not a summary of all AI risk.&lt;/p&gt;

&lt;p&gt;It is the subset of risk vocabulary that the scanner can use responsibly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example: Detector Hit to AIRI Domain Vocabulary
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvz0q2fw4vfp5okznynr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvz0q2fw4vfp5okznynr.png" alt="Connecting evidence to context" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A concrete example helps.&lt;/p&gt;

&lt;p&gt;Suppose STEM BIO-AI detects a shallow validator around sensitive or clinical-adjacent inputs.&lt;/p&gt;

&lt;p&gt;The local finding might be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CC3_shallow_validator:
validate_* or check_* function uses only length checks without structural validation.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the repository level, this is a code-contract finding.&lt;/p&gt;

&lt;p&gt;It says:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the function appears to validate input&lt;/li&gt;
&lt;li&gt;the validation is shallow&lt;/li&gt;
&lt;li&gt;the implementation may not enforce the boundary implied by its name&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AIRI layer should not turn that into:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this repository caused privacy harm.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That would be too strong.&lt;/p&gt;

&lt;p&gt;A safer mapping uses AIRI as vocabulary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Local detector surface&lt;/th&gt;
&lt;th&gt;Local meaning&lt;/th&gt;
&lt;th&gt;AIRI vocabulary anchor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CC3_shallow_validator&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;validation function appears shallower than its name implies&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;; possibly &lt;code&gt;2.1 Compromise of privacy...&lt;/code&gt; if sensitive information handling is in scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;fail-open exception path&lt;/td&gt;
&lt;td&gt;code path may silently continue after failure&lt;/td&gt;
&lt;td&gt;&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hardcoded credential signal&lt;/td&gt;
&lt;td&gt;repository surface suggests exposed secret-like pattern&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2.2 AI system security vulnerabilities and attacks&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weak provenance surface&lt;/td&gt;
&lt;td&gt;repository gives weak evidence about data/source traceability&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;7.4 Lack of transparency or interpretability&lt;/code&gt;; possibly &lt;code&gt;6.5 Governance failure&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;silent mock fallback&lt;/td&gt;
&lt;td&gt;production-like path may fall back to simulated behavior&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;7.3 Lack of capability or robustness&lt;/code&gt;; &lt;code&gt;7.4 Lack of transparency or interpretability&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The mapping does not prove harm.&lt;/p&gt;

&lt;p&gt;It tells the reviewer which broader AIRI vocabulary may be relevant to the local finding.&lt;/p&gt;

&lt;p&gt;That is the difference between:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this detector proves a risk occurred&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this detector finding belongs near this risk-language area.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The second claim is weaker.&lt;/p&gt;

&lt;p&gt;It is also the correct claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Local Provenance Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5suzlt3jdnbanqzvc6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5suzlt3jdnbanqzvc6e.png" alt="Provenance is not cosmetic" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AIRI is external.&lt;/p&gt;

&lt;p&gt;That means STEM BIO-AI needs to answer governance questions explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which upstream snapshot is being used?&lt;/li&gt;
&lt;li&gt;which subset is active at runtime?&lt;/li&gt;
&lt;li&gt;which risks are included in the curated bundle?&lt;/li&gt;
&lt;li&gt;which risks are known gaps?&lt;/li&gt;
&lt;li&gt;which detector maps to which AIRI entry?&lt;/li&gt;
&lt;li&gt;what version of the mapping is active?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why the AIRI work matters.&lt;/p&gt;

&lt;p&gt;It is not just adding labels.&lt;/p&gt;

&lt;p&gt;It is turning risk vocabulary into a governed local data layer.&lt;/p&gt;

&lt;p&gt;In the current governance note, the upstream source is recorded as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upstream source: &lt;code&gt;https://airisk.mit.edu/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;upstream artifact: &lt;code&gt;The AI Risk Repository V4_03&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;upstream license: &lt;code&gt;MIT&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;local snapshot date: &lt;code&gt;2026-04-23&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That provenance is not cosmetic.&lt;/p&gt;

&lt;p&gt;It allows an audit artifact to say which risk vocabulary it was using when the scan was produced.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Implemented in the Current 1.7.5 State of 1.7.x
&lt;/h2&gt;

&lt;p&gt;The current AIRI layer is implemented, but bounded.&lt;/p&gt;

&lt;p&gt;Implemented surfaces include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AIRI-backed coverage surfaces in scan outputs&lt;/li&gt;
&lt;li&gt;local curated runtime bundle&lt;/li&gt;
&lt;li&gt;local registry and mapping schemas&lt;/li&gt;
&lt;li&gt;detector-to-AIRI mapping layer&lt;/li&gt;
&lt;li&gt;known-gap reporting&lt;/li&gt;
&lt;li&gt;provenance and bundle/source labeling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In current scan results, &lt;code&gt;airi_risk_coverage&lt;/code&gt; is the main artifact surface for this layer.&lt;/p&gt;

&lt;p&gt;The public result contract includes AIRI fields such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;airi_registry_version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_bundle_version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_mapping_version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_bundle_scope&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_upstream_snapshot_date&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;airi_upstream_license&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;total_risks_in_registry&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;total_risks_in_bundle&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;total_risks_in_detector_scope&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;detectors_triggered&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;covered_risks&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;covered_count&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;coverage_rate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;known_gaps_in_bundle&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;known_gaps_outside_bundle&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These fields matter because they let a reviewer distinguish three things that are easy to confuse: the upstream AIRI source, the local runtime bundle, and the detector mapping actually used by the scan.&lt;/p&gt;

&lt;p&gt;The important part is not only that these fields exist.&lt;/p&gt;

&lt;p&gt;The important part is that AIRI usage becomes auditable from the artifact itself.&lt;/p&gt;

&lt;p&gt;If two scans use different AIRI snapshots or mappings, that difference should not be hidden.&lt;/p&gt;




&lt;h2&gt;
  
  
  Coverage Is Not a Safety Percentage
&lt;/h2&gt;

&lt;p&gt;AIRI coverage in STEM BIO-AI is an audit-surface concept, not a safety percentage.&lt;/p&gt;

&lt;p&gt;It does not mean:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the repository is safe&lt;/li&gt;
&lt;li&gt;the repository is unsafe&lt;/li&gt;
&lt;li&gt;the scanner covers all AI risk&lt;/li&gt;
&lt;li&gt;the covered percentage is a compliance score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a local deterministic finding has been mapped to a known risk-vocabulary entry inside the curated AIRI runtime layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is useful because it gives reviewers a wider frame.&lt;/p&gt;

&lt;p&gt;But it does not turn local evidence into a global safety claim.&lt;/p&gt;

&lt;p&gt;This is the same discipline used elsewhere in STEM BIO-AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoring is not clinical validation&lt;/li&gt;
&lt;li&gt;advisory interpretation is not scoring authority&lt;/li&gt;
&lt;li&gt;reproducibility evidence is not automatic score authority&lt;/li&gt;
&lt;li&gt;AIRI coverage is not a safety percentage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each layer has a role.&lt;/p&gt;

&lt;p&gt;Each layer has a boundary.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed in 1.7.x
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;1.7.x&lt;/code&gt; AIRI story is not simply “we added AIRI.”&lt;/p&gt;

&lt;p&gt;The actual change was a move from loose risk labeling toward governed local vocabulary.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.0
&lt;/h3&gt;

&lt;p&gt;AIRI V4 integration appeared in scan outputs.&lt;/p&gt;

&lt;p&gt;The scanner began producing an &lt;code&gt;airi_risk_coverage&lt;/code&gt; section that maps triggered detector findings to AIRI risk IDs, coverage rate, and known gaps.&lt;/p&gt;

&lt;p&gt;The same release also introduced Layer 2 AST contract detectors such as &lt;code&gt;CC1&lt;/code&gt;, &lt;code&gt;CC2&lt;/code&gt;, and &lt;code&gt;CC3&lt;/code&gt;, which expanded the local detector surface available for risk-vocabulary mapping.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.1
&lt;/h3&gt;

&lt;p&gt;AIRI became a governed local data layer.&lt;/p&gt;

&lt;p&gt;The architecture separated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;full local registry&lt;/li&gt;
&lt;li&gt;curated runtime bundle&lt;/li&gt;
&lt;li&gt;detector mapping registry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This release also replaced hardcoded AIRI detector mappings and known-gap lists with packaged local registry files.&lt;/p&gt;

&lt;p&gt;Runtime outputs began surfacing registry version, bundle version, mapping version, upstream snapshot date, license, attribution note, and split known gaps into &lt;code&gt;known_gaps_in_bundle&lt;/code&gt; and &lt;code&gt;known_gaps_outside_bundle&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.2
&lt;/h3&gt;

&lt;p&gt;No major AIRI architecture change.&lt;/p&gt;

&lt;p&gt;The important governance point was regression stability: same-target self-scan comparison verified no drift in &lt;code&gt;airi_risk_coverage&lt;/code&gt; alongside score, tier, code contract, detector summary, and evidence ledger count.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.3
&lt;/h3&gt;

&lt;p&gt;No major AIRI architecture change.&lt;/p&gt;

&lt;p&gt;The release focused on runtime cleanup, stale demo wording, layout stabilization, and output routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.4
&lt;/h3&gt;

&lt;p&gt;AIRI presentation became clearer across demo and report outputs.&lt;/p&gt;

&lt;p&gt;The release surfaced AIRI summary material more clearly across the Hugging Face overview card and markdown/explain report sections.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.7.5
&lt;/h3&gt;

&lt;p&gt;No new AIRI data architecture change.&lt;/p&gt;

&lt;p&gt;But artifact-level governance improved more broadly through additive evidence-ledger quality fields and audit-freshness metadata.&lt;/p&gt;

&lt;p&gt;That matters because AIRI is most useful when it lives inside a report surface that already carries freshness, evidence quality, and provenance signals.&lt;/p&gt;

&lt;p&gt;The important change across the line is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AIRI moved from attached dataset toward versioned local risk-vocabulary layer.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What This Still Does Not Do
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F455nj5o0eixcl5y3g2ho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F455nj5o0eixcl5y3g2ho.png" alt="Local evidence first, external vocabulary second" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AIRI layer still does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;verify real incidents&lt;/li&gt;
&lt;li&gt;prove causality&lt;/li&gt;
&lt;li&gt;certify repository safety&lt;/li&gt;
&lt;li&gt;replace domain review&lt;/li&gt;
&lt;li&gt;turn AIRI categories into deterministic truth claims&lt;/li&gt;
&lt;li&gt;collapse the full upstream AIRI universe into the runtime scanner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not missing features.&lt;/p&gt;

&lt;p&gt;They are the boundaries that keep the layer useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Could Go
&lt;/h2&gt;

&lt;p&gt;The next useful direction is not to overload the scanner with external systems.&lt;/p&gt;

&lt;p&gt;It is to improve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;registry provenance&lt;/li&gt;
&lt;li&gt;bundle governance&lt;/li&gt;
&lt;li&gt;mapping confidence&lt;/li&gt;
&lt;li&gt;known-gap clarity&lt;/li&gt;
&lt;li&gt;artifact-visible mapping metadata&lt;/li&gt;
&lt;li&gt;disciplined links to incident-oriented resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The broader MIT AIRI ecosystem also includes related incident-oriented resources such as the AI Incident Tracker.&lt;/p&gt;

&lt;p&gt;That ecosystem is relevant context, but it is not the same thing as current runtime integration in STEM BIO-AI.&lt;/p&gt;

&lt;p&gt;A future version may choose to reference incident-oriented resources more explicitly, but deterministic scans should not ingest them casually or blur them with repository-local findings.&lt;/p&gt;

&lt;p&gt;A future version should be able to say not only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this detector maps to this AIRI risk vocabulary area.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But also:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;this mapping has this confidence level, this review status, this local evidence family, and this known limitation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the next governance step.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtlt5pk3yg79aconki1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtlt5pk3yg79aconki1l.png" alt="A governed bridge for STEM BIO-AI" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is the role of AIRI in this release line.&lt;/p&gt;

&lt;p&gt;Not truth replacement.&lt;/p&gt;

&lt;p&gt;Not safety certification.&lt;/p&gt;

&lt;p&gt;Not incident proof.&lt;/p&gt;

&lt;p&gt;A governed vocabulary bridge.&lt;/p&gt;

&lt;p&gt;Local evidence first.&lt;/p&gt;

&lt;p&gt;External vocabulary second.&lt;/p&gt;

&lt;p&gt;Explicit provenance always.&lt;/p&gt;




&lt;h2&gt;
  
  
  References and Acknowledgment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;MIT AI Risk Repository: &lt;a href="https://airisk.mit.edu/" rel="noopener noreferrer"&gt;https://airisk.mit.edu/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;MIT AI Incident Tracker: &lt;a href="https://airisk.mit.edu/ai-incident-tracker" rel="noopener noreferrer"&gt;https://airisk.mit.edu/ai-incident-tracker&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;STEM BIO-AI repository: &lt;a href="https://github.com/flamehaven01/STEM-BIO-AI" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/STEM-BIO-AI&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This AIRI-related direction in STEM BIO-AI was informed by broader public AI risk work, including the MIT AI Risk Repository ecosystem.&lt;/p&gt;

&lt;p&gt;The framing around AIRI as a broader risk-vocabulary layer, rather than a repository-local truth layer, was also strengthened by public commentary and ecosystem work from people in this space, including Peter Slattery, PhD.&lt;/p&gt;

&lt;p&gt;These references informed the vocabulary and governance direction described here. They do not imply endorsement of STEM BIO-AI or responsibility for its implementation choices.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>governance</category>
      <category>bioinformatics</category>
      <category>opensource</category>
    </item>
    <item>
      <title>When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 14 May 2026 05:41:41 +0000</pubDate>
      <link>https://dev.to/flamehaven01/when-control-becomes-authority-calibration-governance-in-stem-bio-ai-17x-52hf</link>
      <guid>https://dev.to/flamehaven01/when-control-becomes-authority-calibration-governance-in-stem-bio-ai-17x-52hf</guid>
      <description>&lt;p&gt;Control slowly becomes authority when nobody marks the boundary.&lt;/p&gt;

&lt;p&gt;That is the calibration problem I kept running into while building STEM BIO-AI.&lt;/p&gt;

&lt;p&gt;At first, STEM BIO-AI was centered on the score. It scanned a local bio or medical AI repository, inspected observable repository surfaces, and mapped the repository to a structured review tier.&lt;/p&gt;

&lt;p&gt;That was useful.&lt;/p&gt;

&lt;p&gt;But it was not enough.&lt;/p&gt;

&lt;p&gt;The harder problem was not producing a number. The harder problem was preventing every useful adjacent signal from becoming part of that number.&lt;/p&gt;

&lt;p&gt;In a bio/medical AI repository review system, several lanes can look similar if the tool is not careful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deterministic scoring&lt;/li&gt;
&lt;li&gt;diagnostic findings&lt;/li&gt;
&lt;li&gt;replication evidence&lt;/li&gt;
&lt;li&gt;advisory interpretation&lt;/li&gt;
&lt;li&gt;domain-specific review posture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They all matter.&lt;/p&gt;

&lt;p&gt;But they should not all have the same authority.&lt;/p&gt;

&lt;p&gt;That is the core reason calibration became a governance problem in the &lt;code&gt;1.7.x&lt;/code&gt; line.&lt;/p&gt;

&lt;p&gt;The principle is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;easy experimentation, hard drift&lt;/strong&gt;&lt;br&gt;
STEM BIO-AI should let researchers express review posture. It should let operators simulate policy changes. It should make policy metadata visible in artifacts.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But it should not let those inputs silently mutate the official score.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Short Context for New Readers
&lt;/h2&gt;

&lt;p&gt;STEM BIO-AI is a deterministic evidence-surface scanner for bio and medical AI repositories.&lt;/p&gt;

&lt;p&gt;It does not validate biomedical efficacy. It does not certify clinical safety. It does not prove that a model is correct.&lt;/p&gt;

&lt;p&gt;It scans observable repository surfaces such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README and docs&lt;/li&gt;
&lt;li&gt;code structure&lt;/li&gt;
&lt;li&gt;CI configuration&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;evidence and boundary language&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The formal score is currently built from three weighted score-bearing stages, plus an explicit credential penalty and clinical cap or hard-floor logic:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Stage 1&lt;/td&gt;
&lt;td&gt;README / stated evidence boundary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 2R&lt;/td&gt;
&lt;td&gt;repo-local consistency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 3&lt;/td&gt;
&lt;td&gt;code and bio-responsibility surface&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The active formula still also applies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;C1_penalty&lt;/code&gt; when hardcoded credentials are detected&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;score_cap&lt;/code&gt; or &lt;code&gt;t0_hard_floor&lt;/code&gt; when clinical-adjacent boundary rules require it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stage 4 exists, but it is a separate replication lane. It reports reproducibility and replication posture without automatically changing the formal score.&lt;/p&gt;

&lt;p&gt;That separation is intentional.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Is Actually Implemented in the Current 1.7.5 State of 1.7.x
&lt;/h2&gt;

&lt;p&gt;Before discussing calibration philosophy, the implementation boundary has to be clear.&lt;/p&gt;

&lt;p&gt;In the current &lt;code&gt;1.7.5&lt;/code&gt; state of the &lt;code&gt;1.7.x&lt;/code&gt; line, STEM BIO-AI has implemented a real calibration architecture, but it is still mostly a mirror-only and preview-oriented architecture.&lt;/p&gt;

&lt;p&gt;This post describes the current released state of the &lt;code&gt;1.7.x&lt;/code&gt; line as of &lt;code&gt;v1.7.5&lt;/code&gt;, not a future authoritative-read-through design.&lt;/p&gt;

&lt;p&gt;Implemented surfaces include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;packaged calibration profiles&lt;/li&gt;
&lt;li&gt;schema and runtime validation&lt;/li&gt;
&lt;li&gt;profile identity surfaced in result metadata&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy list&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy explain&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy derive&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stem policy simulate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;simulation-only local profile files&lt;/li&gt;
&lt;li&gt;profile hashes and read-mode metadata in artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The current named recommendation surface is intentionally narrow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;default&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;strict_clinical_adjacency&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;reproducibility_first&lt;/code&gt; is still a draft posture, not an active release-grade named recommendation.&lt;/p&gt;

&lt;p&gt;The important limitation is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the authoritative scan scoring path is still protected from arbitrary user-provided profile mutation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In other words, &lt;code&gt;scan --policy &amp;lt;name&amp;gt;&lt;/code&gt; can surface selected profile metadata. &lt;code&gt;policy derive&lt;/code&gt; and &lt;code&gt;policy simulate&lt;/code&gt; can show governed preview behavior. But user-provided profile files do not simply become the official scoring authority.&lt;/p&gt;

&lt;p&gt;More specifically, local profile files are currently accepted only by &lt;code&gt;stem policy simulate&lt;/code&gt;, and the CLI rejects them unless the file remains &lt;code&gt;mirror_only&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is not a missing convenience.&lt;/p&gt;

&lt;p&gt;That is the boundary being tested before it is allowed to become authority.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pressure That Causes Drift
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lcdsepmkhe1s4e9or9z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1lcdsepmkhe1s4e9or9z.png" alt="Formal score and advisory tuning drift" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One question pushed this design forward:&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;If advisory AI becomes more capable, will teams really keep the boundary between formal score and advisory interpretation?&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
I do not think the answer is automatically yes.&lt;/p&gt;

&lt;p&gt;If an advisory layer becomes helpful, there will always be pressure to let it influence the formal score "just a little."&lt;/p&gt;

&lt;p&gt;That is usually how audit systems drift.&lt;/p&gt;

&lt;p&gt;The score stops being a stable artifact and starts becoming a moving interpretation layer.&lt;/p&gt;

&lt;p&gt;The danger is not that users want control.&lt;/p&gt;

&lt;p&gt;The danger is that control slowly becomes authority without anyone noticing.&lt;/p&gt;

&lt;p&gt;So the design question is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we let people tune the system more freely?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The design question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do we let people express domain judgment without making the formal score easy to mutate?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is where calibration enters.&lt;/p&gt;


&lt;h2&gt;
  
  
  Calibration Is Not a Tuning Console
&lt;/h2&gt;

&lt;p&gt;The wrong calibration UX looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_1_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2r_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_3_percent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ca_no_disclaimer_cap"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;61&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"b2_partial_credit_mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"looser"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is editable.&lt;/p&gt;

&lt;p&gt;But editable is not the same as governed.&lt;/p&gt;

&lt;p&gt;Most researchers, operators, and domain reviewers do not think in raw score constants. They usually know something closer to this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clinical-adjacent claims should be treated very strictly&lt;/li&gt;
&lt;li&gt;reproducibility matters strongly in this environment&lt;/li&gt;
&lt;li&gt;README polish should not outweigh code evidence&lt;/li&gt;
&lt;li&gt;a casual mention of "limitations" should not count as meaningful transparency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the current calibration design starts with posture questions, not raw constants.&lt;/p&gt;

&lt;p&gt;The goal is not to ask a researcher to become a scoring-engine maintainer.&lt;/p&gt;

&lt;p&gt;The goal is to let a researcher express domain posture while keeping the formal scoring boundary visible, versioned, and difficult to mutate accidentally.&lt;/p&gt;




&lt;h2&gt;
  
  
  The &lt;code&gt;1–5&lt;/code&gt; Scale Is Input, Not Authority
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda62dggwxf9j1962dh7k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fda62dggwxf9j1962dh7k.png" alt="Posture over raw constants" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the current design, the user-facing intent layer uses a &lt;code&gt;1–5&lt;/code&gt; scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt; = minimal emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;2&lt;/code&gt; = light emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;3&lt;/code&gt; = moderate emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;4&lt;/code&gt; = strong emphasis&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;5&lt;/code&gt; = very strong emphasis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important line is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the &lt;code&gt;1–5&lt;/code&gt; scale is a UX input surface, not part of the formal score engine.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means the user can express posture in a natural way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clinical strictness&lt;/li&gt;
&lt;li&gt;code-integrity priority&lt;/li&gt;
&lt;li&gt;reproducibility priority&lt;/li&gt;
&lt;li&gt;structured limitations requirement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But those answers do not directly become score constants.&lt;/p&gt;

&lt;p&gt;They are translated through explicit rules.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wha5pt8fn2grrequosz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6wha5pt8fn2grrequosz.png" alt="Governing decision matrix" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The current decision table is intentionally narrow:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;th&gt;Outcome&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;clinical_strictness &amp;gt;= 4&lt;/code&gt; and &lt;code&gt;reproducibility_priority &amp;lt;= 3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;recommend &lt;code&gt;strict_clinical_adjacency&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all four values are &lt;code&gt;2&lt;/code&gt; or &lt;code&gt;3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;keep &lt;code&gt;default&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no named-profile rule matches&lt;/td&gt;
&lt;td&gt;generate a &lt;code&gt;preview_only&lt;/code&gt; profile delta from bounded deltas only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This table should not be mistaken for an empirically optimized model.&lt;/p&gt;

&lt;p&gt;It is a conservative governance rule table.&lt;/p&gt;

&lt;p&gt;The current threshold choices are design-steward decisions, not claims of statistical optimality. Their purpose is to keep the translation layer narrow, reviewable, and non-authoritative until a stronger benchmark-backed promotion process exists.&lt;/p&gt;

&lt;p&gt;That matters because a calibration system can fail in two opposite ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it can be too rigid for domain experts to use&lt;/li&gt;
&lt;li&gt;it can be so flexible that every local preference becomes a new score&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The initial rule table chooses the safer failure mode.&lt;/p&gt;

&lt;p&gt;If a posture is clearly within an existing release-grade profile, the system can recommend that profile. If the posture is ambiguous or combines competing priorities, the system falls back to &lt;code&gt;preview_only&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clinical_strictness = 4
reproducibility_priority = 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That does not automatically recommend &lt;code&gt;strict_clinical_adjacency&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It falls back to &lt;code&gt;preview_only&lt;/code&gt;, because two strong postures are competing and no release-grade named profile currently resolves that conflict.&lt;/p&gt;

&lt;p&gt;A hidden similarity function might produce something that looks more flexible.&lt;/p&gt;

&lt;p&gt;But it would also make the governance harder to audit.&lt;/p&gt;

&lt;p&gt;A narrow rule table is less magical.&lt;/p&gt;

&lt;p&gt;It is also safer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the CLI Is Allowed to Do
&lt;/h2&gt;

&lt;p&gt;![&lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n4mi9izlgqwmzhb62e7g.png" rel="noopener noreferrer"&gt;Easy experimentation, hard drift — sandbox and vault&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The preview workflow can look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem policy derive &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--clinical-strictness&lt;/span&gt; 5 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--code-integrity-priority&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reproducibility-priority&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--structured-limitations-requirement&lt;/span&gt; 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem policy simulate /path/to/repo &lt;span class="nt"&gt;--profile-file&lt;/span&gt; my_profile.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But those flows are not the same as saying:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--stage1-weight&lt;/span&gt; 0.35 &lt;span class="nt"&gt;--cap&lt;/span&gt; 72
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first two are governed preview surfaces.&lt;/p&gt;

&lt;p&gt;The last one is an untracked tuning console.&lt;/p&gt;

&lt;p&gt;The design intentionally supports the first and rejects the shape of the last.&lt;/p&gt;

&lt;p&gt;This is the practical meaning of easy experimentation, hard drift.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Gets Verified
&lt;/h2&gt;

&lt;p&gt;The central claim of this design is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;the current calibration rules are perfect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The claim is narrower:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;calibration changes should not become score authority without a visible governance path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That claim can be tested by checking whether the system exposes or blocks the relevant control surfaces.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Drift risk&lt;/th&gt;
&lt;th&gt;Expected control&lt;/th&gt;
&lt;th&gt;How to verify it&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;arbitrary score tuning&lt;/td&gt;
&lt;td&gt;no free-form CLI weight / cap override&lt;/td&gt;
&lt;td&gt;CLI help and accepted options do not expose direct score constants&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;hidden profile mutation&lt;/td&gt;
&lt;td&gt;profile status and read mode are surfaced&lt;/td&gt;
&lt;td&gt;result artifacts expose profile metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;unclear profile identity&lt;/td&gt;
&lt;td&gt;profile name, version, and hash are visible&lt;/td&gt;
&lt;td&gt;scan output includes calibration profile identity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;advisory influence leakage&lt;/td&gt;
&lt;td&gt;advisory output cannot override score&lt;/td&gt;
&lt;td&gt;advisory response validation cannot mutate &lt;code&gt;final_score&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;reproducibility overcompensation&lt;/td&gt;
&lt;td&gt;Stage 4 remains separate&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;replication_score&lt;/code&gt; does not change &lt;code&gt;formal_tier&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;premature named-profile expansion&lt;/td&gt;
&lt;td&gt;ambiguous postures fall back to preview&lt;/td&gt;
&lt;td&gt;derive/simulate returns &lt;code&gt;preview_only&lt;/code&gt; when no named rule matches&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;detector promotion drift&lt;/td&gt;
&lt;td&gt;evidence-only detectors are not score-authoritative&lt;/td&gt;
&lt;td&gt;detector policy is versioned in policy files and governance docs, even though per-detector score-integration status is not yet surfaced as first-class artifact metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is still not the same as a full empirical benchmark.&lt;/p&gt;

&lt;p&gt;But it is a real verification target.&lt;/p&gt;

&lt;p&gt;The system can be checked for whether it allows the forbidden mutation path.&lt;/p&gt;

&lt;p&gt;That is the level of proof appropriate for this release line: not "the final policy is optimal," but "the policy cannot quietly become authoritative without leaving a trace."&lt;/p&gt;

&lt;p&gt;That trace is stronger for some surfaces than others. Profile identity, hash, and read mode are already artifact-visible in &lt;code&gt;1.7.5&lt;/code&gt;. Detector promotion semantics are already versioned and documented, but they are not yet surfaced as first-class per-detector policy metadata in the result object.&lt;/p&gt;




&lt;h2&gt;
  
  
  The B2 Tightening Example
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxrlajsvmqp5jiurvatd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwxrlajsvmqp5jiurvatd.png" alt="Deterministic boundary changes in B2 tightening" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The clearest scoring example is Stage 3 B2.&lt;/p&gt;

&lt;p&gt;B2 is the bias and limitations measurement surface. Earlier scoring behavior allowed a weaker boundary: a simple vocabulary-level signal could still receive partial credit.&lt;/p&gt;

&lt;p&gt;That became too permissive.&lt;/p&gt;

&lt;p&gt;A repository that mentions "bias" or "limitations" once is not necessarily disclosing a meaningful boundary. It may only be surface signaling.&lt;/p&gt;

&lt;p&gt;So the B2 rule became stricter.&lt;/p&gt;

&lt;p&gt;The important change is not a marketing claim about benchmark improvement. The important change is a deterministic boundary change:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;Earlier posture&lt;/th&gt;
&lt;th&gt;Tightened posture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;no bias / limitations vocabulary&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;minimal single-term mention only&lt;/td&gt;
&lt;td&gt;partial credit possible&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;structured limitations language&lt;/td&gt;
&lt;td&gt;partial credit possible&lt;/td&gt;
&lt;td&gt;partial credit possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;quantitative measurement evidence&lt;/td&gt;
&lt;td&gt;full credit possible&lt;/td&gt;
&lt;td&gt;full credit possible&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the first place where calibration becomes visible as more than a principle.&lt;/p&gt;

&lt;p&gt;The rule change creates a concrete score path difference:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a repository that previously depended only on a minimal single-term limitations mention no longer has a B2 partial-credit path after the tightening.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the current public claim.&lt;/p&gt;

&lt;p&gt;I am not presenting a benchmark-wide before/after score delta here, because that would require a pinned fixture set and published comparison protocol.&lt;/p&gt;

&lt;p&gt;Without that, a claimed "T3 became T2" example would be anecdotal at best and misleading at worst.&lt;/p&gt;

&lt;p&gt;So the honest evidence level is rule-level impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the credit path changed&lt;/li&gt;
&lt;li&gt;the changed path is deterministic&lt;/li&gt;
&lt;li&gt;the changed path is inspectable&lt;/li&gt;
&lt;li&gt;benchmark-level deltas should be published only when the fixture protocol is pinned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In clinical-adjacent repositories, limitation language is not decoration. It is part of the claim boundary.&lt;/p&gt;

&lt;p&gt;A one-word mention does not carry the same weight as a structured limitations section, demographic coverage statement, known failure-mode description, or quantitative subgroup analysis.&lt;/p&gt;

&lt;p&gt;This is why calibration cannot be only a UI problem.&lt;/p&gt;

&lt;p&gt;If a user asks for a stricter limitations posture, the system should not silently subtract points through a hidden override. It should expose the rule that changed and the reason that rule exists.&lt;/p&gt;

&lt;p&gt;That is the difference between a score tweak and a governed scoring rationale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Stage 4 Stays Separate
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dqje510bvt6vrw13vov.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0dqje510bvt6vrw13vov.png" alt="Importance is not score authority" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stage 4 is the place where the strongest counterargument appears.&lt;/p&gt;

&lt;p&gt;The counterargument is fair:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If reproducibility is important, why does it not affect the formal score?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;My answer is that importance and score authority are not the same thing.&lt;/p&gt;

&lt;p&gt;Stage 4 measures replication posture: containers, reproducibility targets, dependency locks, artifact references, seeds, citation surfaces, and similar evidence.&lt;/p&gt;

&lt;p&gt;Those signals matter.&lt;/p&gt;

&lt;p&gt;But they do not mean the same thing as the formal claim boundary.&lt;/p&gt;

&lt;p&gt;A repository can be highly reproducible and still make unsafe or unbounded clinical claims.&lt;/p&gt;

&lt;p&gt;A repository can have clean containers and dependency locks while still lacking a clinical-use disclaimer.&lt;/p&gt;

&lt;p&gt;A repository can be easy to rerun while still having weak data provenance or shallow limitation language.&lt;/p&gt;

&lt;p&gt;If Stage 4 were allowed to lift the formal score too early, reproducibility could start compensating for claim-boundary weakness.&lt;/p&gt;

&lt;p&gt;That would be a different scoring philosophy.&lt;/p&gt;

&lt;p&gt;It may become valid in the future, but only if the rule is explicit.&lt;/p&gt;

&lt;p&gt;For now, Stage 4 is reported as a separate lane because the system is saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reproducibility matters&lt;/li&gt;
&lt;li&gt;reproducibility should be visible&lt;/li&gt;
&lt;li&gt;reproducibility should affect review interpretation&lt;/li&gt;
&lt;li&gt;reproducibility should not silently override the formal score boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why stronger reproducibility intent currently falls back to &lt;code&gt;preview_only&lt;/code&gt; instead of becoming a release-grade named profile.&lt;/p&gt;

&lt;p&gt;The system is not saying reproducibility is unimportant.&lt;/p&gt;

&lt;p&gt;It is saying reproducibility has not yet been granted formal score authority.&lt;/p&gt;




&lt;h2&gt;
  
  
  Advisory AI Uses the Same Boundary
&lt;/h2&gt;

&lt;p&gt;Advisory AI follows the same rule.&lt;/p&gt;

&lt;p&gt;Helpful interpretation is not score authority.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI can export provider-neutral advisory packets and validate downstream advisory responses, but the deterministic scanner does not need an external model runtime to produce the formal score.&lt;/p&gt;

&lt;p&gt;If an advisory system becomes useful, it may help interpret findings, prioritize review, or explain evidence patterns.&lt;/p&gt;

&lt;p&gt;But unless a future release explicitly changes the policy, advisory output remains structurally subordinate to the deterministic score.&lt;/p&gt;

&lt;p&gt;That is enough for this article.&lt;/p&gt;

&lt;p&gt;The broader advisory boundary is a separate topic.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Scoring Tool to Audit Workflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhihip6jhydpu5n3hszqd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhihip6jhydpu5n3hszqd.png" alt="From scoring tool to audit custody" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;1.7.x&lt;/code&gt; transition is best understood as a shift in the questions the tool is expected to answer.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Earlier scoring-tool question&lt;/th&gt;
&lt;th&gt;Audit-workflow question&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What score did the repository get?&lt;/td&gt;
&lt;td&gt;Which policy profile was visible when the score was produced?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Which stage contributed most?&lt;/td&gt;
&lt;td&gt;Was that stage score-authoritative, diagnostic, or separate-lane evidence?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What evidence triggered the tier?&lt;/td&gt;
&lt;td&gt;Did the evidence change the formal score or only the review posture?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What should the user fix?&lt;/td&gt;
&lt;td&gt;Would a proposed policy change be preview-only, experimental, benchmark-candidate, or release-authoritative?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is why I describe &lt;code&gt;1.7.x&lt;/code&gt; as an audit-system transition.&lt;/p&gt;

&lt;p&gt;The score still matters.&lt;/p&gt;

&lt;p&gt;But the system is increasingly designed around the custody of the score: where it came from, what was allowed to influence it, and what was intentionally kept outside it.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Still Does Not Do
&lt;/h2&gt;

&lt;p&gt;This boundary is just as important as the implementation.&lt;/p&gt;

&lt;p&gt;STEM BIO-AI still does not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate biomedical efficacy&lt;/li&gt;
&lt;li&gt;certify benchmark truth&lt;/li&gt;
&lt;li&gt;determine clinical deployment safety&lt;/li&gt;
&lt;li&gt;let advisory AI overwrite the formal score&lt;/li&gt;
&lt;li&gt;open arbitrary numeric tuning in the official scan path&lt;/li&gt;
&lt;li&gt;allow profile experimentation to become official policy without governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not missing conveniences.&lt;/p&gt;

&lt;p&gt;They are boundaries.&lt;/p&gt;

&lt;p&gt;A strong repository evidence tier is still an observable repository-surface signal. It is not clinical clearance, regulatory approval, or proof of biomedical validity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Next Version Direction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitvcewljd7d2ydmfsmoh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitvcewljd7d2ydmfsmoh.png" alt="The next step: policy parity" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The next important step is not adding more knobs.&lt;/p&gt;

&lt;p&gt;It is authoritative policy read-through in parity mode.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the default policy profile becomes the source read by the scoring path&lt;/li&gt;
&lt;li&gt;existing fixtures should show no score or tier drift&lt;/li&gt;
&lt;li&gt;policy hashes remain visible in artifacts&lt;/li&gt;
&lt;li&gt;non-default and researcher-provided profiles remain governed preview surfaces until promoted&lt;/li&gt;
&lt;li&gt;score-affecting policy changes become explicit release events&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a big-bang rewrite.&lt;/p&gt;

&lt;p&gt;It is authority relocation.&lt;/p&gt;

&lt;p&gt;The goal is to move score-affecting constants into versioned policy objects without changing the score by accident.&lt;/p&gt;

&lt;p&gt;Only after that parity step does it become safe to discuss broader named profiles.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Position
&lt;/h2&gt;

&lt;p&gt;The calibration problem is not really about giving users more control.&lt;/p&gt;

&lt;p&gt;It is about deciding when control becomes authority.&lt;/p&gt;

&lt;p&gt;If every useful signal can gradually influence the score, the score stops being an audit artifact.&lt;/p&gt;

&lt;p&gt;It becomes a negotiation.&lt;/p&gt;

&lt;p&gt;That is what STEM BIO-AI is trying to avoid.&lt;/p&gt;

&lt;p&gt;Researchers should be able to express posture.&lt;/p&gt;

&lt;p&gt;Operators should be able to simulate alternatives.&lt;/p&gt;

&lt;p&gt;Policy stewards should be able to promote changes.&lt;/p&gt;

&lt;p&gt;But the formal score should not move unless the governance path says it moved.&lt;/p&gt;

&lt;p&gt;That is the difference between a tuning console and an audit system.&lt;/p&gt;

</description>
      <category>opensource</category>
      <category>bioinformatics</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building a Deterministic Governance Kernel: Separating Custody from Truth</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 12 May 2026 06:00:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/building-a-deterministic-governance-kernel-separating-custody-from-truth-57l5</link>
      <guid>https://dev.to/flamehaven01/building-a-deterministic-governance-kernel-separating-custody-from-truth-57l5</guid>
      <description>&lt;p&gt;A governance engine should not pretend to know the truth of every domain.&lt;/p&gt;

&lt;p&gt;That was the architectural lesson behind CGF.&lt;/p&gt;

&lt;p&gt;At Flamehaven Labs, we build B2B governance engines for highly regulated environments. Over the past year, we developed specialized deterministic systems for different review contexts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;CareChainGovernanceEngine (CCGE)&lt;/strong&gt;: a fail-closed clinical-governance engine for enforcing safety-oriented review gates in bio-AI workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Analyst's Problem Framework (TAP)&lt;/strong&gt;: a “Proof Custody” engine designed to package and audit mathematical proof candidates.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both worked inside their own domains. But both also exposed the same architectural problem: reusable custody mechanics were mixed with domain-specific decision semantics.&lt;/p&gt;

&lt;p&gt;We needed to audit new targets — such as external open-source intake, RAG retrieval receipts, and AI evolution proposals. If we did not extract a common, domain-neutral kernel, we would be doomed to rewrite the entire scanning, hashing, and reporting pipeline for every new vertical.&lt;/p&gt;

&lt;p&gt;The result was the &lt;strong&gt;Custody Governance Framework (CGF)&lt;/strong&gt;: a domain-neutral custody kernel for B2B technical review workflows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggo4b1sjt3zsixrr75v4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggo4b1sjt3zsixrr75v4.png" alt="The Architectural Flaw" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is an architecture extraction note on how we decoupled domain truth from custody mechanics, the API design that powers it, and why we specifically rejected the modern trend of “LLM-agentic” governance.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem with “Agentic” Governance
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjih0ep6gy6hz0g5uoycv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjih0ep6gy6hz0g5uoycv.png" alt="The Problem with “Agentic” Governance" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Many emerging AI governance workflows are becoming document-shaped or agent-shaped: a YAML file, a Markdown policy, or an LLM prompt that says, “check whether this is safe.”&lt;/p&gt;

&lt;p&gt;The problem is not that LLMs are useless. The problem is that they can produce compliance-shaped language without producing verifiable compliance artifacts.&lt;/p&gt;

&lt;p&gt;In a strict B2B handoff — where auditability, legal review, and future regulatory mapping to frameworks such as the EU AI Act or NIST AI RMF may matter — you cannot rely on non-deterministic evaluations.&lt;/p&gt;

&lt;p&gt;CGF takes the opposite approach: &lt;strong&gt;strict determinism&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The framework does not own domain-truth semantics. It owns the custody mechanics around findings, profiles, evidence, approvals, and artifacts.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: A Deterministic Data Flow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7efal66s6ug3q0ok9nz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7efal66s6ug3q0ok9nz.png" alt="The Architecture: A Deterministic Data Flow" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To enforce this separation, we designed CGF as a deterministic pipeline with a narrow side-effect boundary.&lt;/p&gt;

&lt;p&gt;The core engine takes a normalized review input and a &lt;code&gt;GovernanceProfile&lt;/code&gt;, transforming them into immutable artifact dataclasses. The writer layer then materializes those objects as files, manifests, and release artifacts.&lt;/p&gt;

&lt;p&gt;Here is what the end-to-end flow looks like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwvftadwopko2nqlqrbv.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkwvftadwopko2nqlqrbv.jpg" alt="mermaid" width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The API Boundary
&lt;/h2&gt;

&lt;p&gt;At the core boundary, the framework does not evaluate whether a finding is “bad” by its own logic.&lt;/p&gt;

&lt;p&gt;It relies on a &lt;code&gt;StatusDeriver&lt;/code&gt; driven by the injected profile.&lt;/p&gt;

&lt;p&gt;A simplified sketch of the boundary looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified sketch, not the full implementation
&lt;/span&gt;&lt;span class="nd"&gt;@dataclass&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slots&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;GovernancePipeline&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;GovernanceProfile&lt;/span&gt;
    &lt;span class="n"&gt;status_deriver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StatusDeriver&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_packet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ScanResult&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;GovernancePacket&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# 1. Inject profile-specific requirements, such as mandatory surfaces
&lt;/span&gt;        &lt;span class="n"&gt;governed_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_with_profile_requirements&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 2. Derive deterministic status via the profile
&lt;/span&gt;        &lt;span class="n"&gt;status_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_deriver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;derive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# 3. Assemble the immutable packet
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;GovernancePacket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;profile_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profile&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;profile_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;status_reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;status_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;evidence&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;compliance_score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;_compute_compliance_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;governed_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact implementation also handles timestamps, approval bridges, validation, artifact writing, and manifest verification.&lt;/p&gt;

&lt;p&gt;The important point is architectural: the core does not decide domain truth. It records how a profile interpreted the evidence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Highlight 1: Inspectable Artifacts Over Silent Mutation
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xr9csrctkgeo7ywjtm3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xr9csrctkgeo7ywjtm3.png" alt="Architecture Highlight 1" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A major risk in governance automation is the engine silently mutating the target repository, for example by automatically injecting compliance boilerplate.&lt;/p&gt;

&lt;p&gt;Many tools solve this with a &lt;code&gt;--dry-run&lt;/code&gt; flag that prints logs to stdout.&lt;/p&gt;

&lt;p&gt;In a B2B audit, stdout logs are not enough. You need an auditable, verifiable artifact.&lt;/p&gt;

&lt;p&gt;CGF implements a preview-first artifact flow. When the pipeline runs, it does not mutate the target repository by default. Instead, it consumes a normalized review input and emits a deterministic custody bundle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;cgf run &lt;span class="nt"&gt;--profile&lt;/span&gt; proof_custody.json &lt;span class="nt"&gt;--scan&lt;/span&gt; target_scan.json &lt;span class="nt"&gt;--out&lt;/span&gt; audit/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In real deployments, &lt;code&gt;target_scan.json&lt;/code&gt; may be produced by a vertical adapter, repository scanner, RAG receipt processor, or customer-specific intake layer.&lt;/p&gt;

&lt;p&gt;The output is an inspectable custody bundle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audit/
├── governance_packet.json       # Machine-readable audit state
├── preview_report.md            # Human-readable summary
├── chain_ribbon.md              # Markdown tag for custody-chain review state
└── manifest.json                # Artifact manifest with file hashes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important point is not that CGF edits the repository.&lt;/p&gt;

&lt;p&gt;It does not.&lt;/p&gt;

&lt;p&gt;The important point is that the review state, findings, proposed next actions, and artifact hashes become inspectable before any external system decides what to do next.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Highlight 2: The Reality of Audit Chains
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgsvgjb1yqf419dg58sf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frgsvgjb1yqf419dg58sf.png" alt="Architecture Highlight 2" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In B2B handoffs, customers often ask for tamper-resistance.&lt;/p&gt;

&lt;p&gt;CGF supports a &lt;code&gt;GovernanceAuditChain&lt;/code&gt;, an append-only JSONL ledger where packet records can be linked through SHA-256 hashes.&lt;/p&gt;

&lt;p&gt;But we need to be honest about the tradeoff:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local hash chains are tamper-evident, not tamper-resistant.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a bad actor has write access to the filesystem, they can delete the audit directory and regenerate the entire chain from scratch.&lt;/p&gt;

&lt;p&gt;CGF does not use a blockchain. The cost and complexity of a distributed ledger would outweigh the benefits for local repository scanning.&lt;/p&gt;

&lt;p&gt;Instead, CGF provides local tamper-evidence.&lt;/p&gt;

&lt;p&gt;To achieve true tamper-resistance, the deployment environment still matters: CI/CD artifact signing, external timestamping, identity providers, or customer-controlled archival systems.&lt;/p&gt;

&lt;p&gt;For example, an external identity token can be wired into an approval bridge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wiring an external identity token into the local chain
&lt;/span&gt;&lt;span class="n"&gt;bridge&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ApprovalBridge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;approved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;approved_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compliance_lead_JWT_subject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cryptographic signature from Identity Provider XYZ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;approved_packet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GovernancePipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_approval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;packet&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bridge&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The framework provides the cryptographic hooks and chronological integrity.&lt;/p&gt;

&lt;p&gt;The deployment environment provides the immutability.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current Limitations
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz467oliq5o7qr67tqavf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz467oliq5o7qr67tqavf.png" alt="Current Limitations" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This post is not a release announcement or a regulatory certification claim.&lt;/p&gt;

&lt;p&gt;It is an architecture note about the extraction pattern: how we separated reusable governance mechanics from domain-specific truth semantics.&lt;/p&gt;

&lt;p&gt;CGF is still early. It is not a compliance platform, not a hosted governance service, and not a regulatory certification product.&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;CGF does not prove that a medical system is safe. It does not prove that a mathematical argument is true. It does not certify legal compliance. It does not replace domain experts, auditors, clinicians, lawyers, or reviewers.&lt;/p&gt;

&lt;p&gt;What it does is narrower, but more concrete:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It makes the custody surface inspectable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are also practical limitations.&lt;/p&gt;

&lt;p&gt;As mentioned, local hash chains are tamper-evident, not tamper-resistant. True immutability still has to come from the deployment environment: CI/CD signing, external timestamping, identity providers, or customer-controlled archival systems.&lt;/p&gt;

&lt;p&gt;CGF is also not yet a complete enterprise governance platform. Authentication, RBAC, multi-tenant profile registries, async approval workflows, and regulatory citation mapping are still roadmap items, not solved infrastructure.&lt;/p&gt;

&lt;p&gt;Each domain still needs adapters, profiles, thresholds, and human review policies.&lt;/p&gt;

&lt;p&gt;The kernel provides the custody mechanics. The domain owner still has to define what evidence matters.&lt;/p&gt;

&lt;p&gt;That is intentional.&lt;/p&gt;

&lt;p&gt;A generic governance kernel should not pretend to know the truth of every field.&lt;/p&gt;




&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn8btw5jcbqoc6luwy2s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnn8btw5jcbqoc6luwy2s.png" alt="Roadmap" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The roadmap is not to turn CGF into a giant all-knowing compliance agent.&lt;/p&gt;

&lt;p&gt;The roadmap is to keep the kernel small, deterministic, and inspectable while adding stronger boundaries around the places where real B2B workflows need them.&lt;/p&gt;

&lt;p&gt;The next layers are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory mapping&lt;/strong&gt;: mapping finding codes to frameworks such as the EU AI Act, NIST AI RMF, and ISO/IEC 42001 without turning CGF itself into a legal authority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approval policy hardening&lt;/strong&gt;: adding stronger policy checks around &lt;code&gt;ApprovalBridge&lt;/code&gt; so approvals can be scoped, expired, and externally verified.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Async approval workflows&lt;/strong&gt;: allowing human review, compliance sign-off, or customer approval to arrive after the initial custody packet is generated.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Profile registries&lt;/strong&gt;: supporting versioned, tenant-scoped governance profiles so different customers can use different policies without changing the kernel.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Signed external receipts&lt;/strong&gt;: allowing RAG systems, technology scanners, quality engines, and external tools to produce receipts that CGF can verify and attach.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vertical adapters&lt;/strong&gt;: binding existing domain systems back to the kernel without importing their domain-specific truth semantics into the core.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every roadmap item has to preserve the same rule:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The kernel may govern custody, but it must not absorb domain truth.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The direction is deliberately conservative: more custody, more verification, more explicit boundaries — not more autonomous magic.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfb5uu077mkda41epvtc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvfb5uu077mkda41epvtc.png" alt="Why This Matters" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of AI governance today is still document-shaped.&lt;/p&gt;

&lt;p&gt;A policy lives in Markdown. A checklist lives in YAML. A prompt says the system should be safe, transparent, aligned, compliant, or human-reviewed.&lt;/p&gt;

&lt;p&gt;Those documents are not useless. They are often necessary.&lt;/p&gt;

&lt;p&gt;But they are not governance by themselves.&lt;/p&gt;

&lt;p&gt;Governance becomes real only when it has an execution surface:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a typed input boundary&lt;/li&gt;
&lt;li&gt;normalized findings&lt;/li&gt;
&lt;li&gt;profile-owned status derivation&lt;/li&gt;
&lt;li&gt;explicit evidence references&lt;/li&gt;
&lt;li&gt;generated review artifacts&lt;/li&gt;
&lt;li&gt;manifest hashes&lt;/li&gt;
&lt;li&gt;approval metadata&lt;/li&gt;
&lt;li&gt;release bundles&lt;/li&gt;
&lt;li&gt;verification commands&lt;/li&gt;
&lt;li&gt;clear non-goals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the difference CGF is trying to make.&lt;/p&gt;

&lt;p&gt;It is not another Markdown file describing how governance should work.&lt;/p&gt;

&lt;p&gt;It is a deterministic custody pipeline that turns review inputs into inspectable artifacts.&lt;/p&gt;

&lt;p&gt;The goal is not to make governance sound more sophisticated.&lt;/p&gt;

&lt;p&gt;The goal is to make it harder to fake.&lt;/p&gt;

&lt;p&gt;A governance system should leave behind more than confidence.&lt;/p&gt;

&lt;p&gt;It should leave behind artifacts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblh4bvppqc3mioyrdxrr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fblh4bvppqc3mioyrdxrr.png" alt="Conclusion" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Extracting the Custody Governance Framework taught us that governance architecture has to separate process from truth.&lt;/p&gt;

&lt;p&gt;Truth belongs to domains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;medicine&lt;/li&gt;
&lt;li&gt;mathematics&lt;/li&gt;
&lt;li&gt;law&lt;/li&gt;
&lt;li&gt;security&lt;/li&gt;
&lt;li&gt;finance&lt;/li&gt;
&lt;li&gt;science&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Process belongs to the governance kernel:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what was reviewed&lt;/li&gt;
&lt;li&gt;which profile was applied&lt;/li&gt;
&lt;li&gt;which findings fired&lt;/li&gt;
&lt;li&gt;what evidence was attached&lt;/li&gt;
&lt;li&gt;what status was derived&lt;/li&gt;
&lt;li&gt;who approved it&lt;/li&gt;
&lt;li&gt;whether the resulting artifacts can be verified later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation is the reason CGF exists.&lt;/p&gt;

&lt;p&gt;It does not try to be an AI judge. It does not ask an LLM to guess whether a system is compliant. It does not hide governance inside a prompt, a policy document, or a YAML file.&lt;/p&gt;

&lt;p&gt;It creates custody artifacts that can be inspected.&lt;/p&gt;

&lt;p&gt;For us, that is the real boundary between governance as language and governance as infrastructure.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>softwareengineering</category>
      <category>governance</category>
      <category>python</category>
    </item>
    <item>
      <title>From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Fri, 08 May 2026 08:25:50 +0000</pubDate>
      <link>https://dev.to/flamehaven01/from-score-to-workflow-turning-stem-bio-ai-into-a-local-audit-system-5amp</link>
      <guid>https://dev.to/flamehaven01/from-score-to-workflow-turning-stem-bio-ai-into-a-local-audit-system-5amp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Earlier in this series, I wrote about why bio/medical AI repositories need more than benchmarks, what I learned after auditing 10 public repositories, and why an AI auditor itself needs a memory contract.&lt;/p&gt;

&lt;p&gt;That work led to STEM-AI v1.1.2 and the MICA layer: a memory-contracted initialization step that forces the auditor to load bounded rules before scoring begins. If you have not read that part, the relevant post is here:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2"&gt;How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the broader arc, the full series is here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/series/37087"&gt;STEM-AI / STEM BIO-AI series&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But after that, a different engineering problem took over.&lt;/p&gt;

&lt;p&gt;The audit logic was stricter.&lt;br&gt;&lt;br&gt;
The reports were richer.&lt;br&gt;&lt;br&gt;
The reasoning was more bounded.&lt;/p&gt;

&lt;p&gt;But the developer workflow still felt too loose.&lt;/p&gt;

&lt;p&gt;So the next question was no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do I score trust?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How does a bio-AI audit tool become something an engineer can actually run, gate, inspect, and integrate?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer turned out to be less about seeing more signals and more about refusing to confuse them.&lt;/p&gt;

&lt;p&gt;That is the core argument of this post:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I took that seriously, STEM BIO-AI stopped looking like “one score plus some extra metadata” and started looking like a system with distinct lanes, distinct boundaries, and distinct operator workflows.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The problem was no longer scoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgevfs1ir5axbqcnehca4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgevfs1ir5axbqcnehca4.png" alt="The problem was no longer scoring" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the time I reached the 1.6.x line, the rubric was no longer the main bottleneck.&lt;/p&gt;

&lt;p&gt;The bottleneck was operational clarity.&lt;/p&gt;

&lt;p&gt;A trust audit tool is not very useful if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the normal path is one long command with too many flags&lt;/li&gt;
&lt;li&gt;CI has to reverse-engineer the result from human-readable stdout&lt;/li&gt;
&lt;li&gt;bio-specific diagnostics are mixed directly into the same surface as formal scoring&lt;/li&gt;
&lt;li&gt;regulatory relevance shows up as vague implication instead of explicit traceability&lt;/li&gt;
&lt;li&gt;advisory AI is present, but its relationship to the official score is unclear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the tool stops being hard to trust for conceptual reasons and starts being hard to trust for operational reasons.&lt;/p&gt;

&lt;p&gt;That is a different class of problem.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The CLI had to reflect operator intent&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The earlier CLI was functional, but too flat.&lt;/p&gt;

&lt;p&gt;You could do things like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
stem /path/to/repo &lt;span class="nt"&gt;--tier-gate&lt;/span&gt; T3 &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--quiet&lt;/span&gt;
stem /path/to/repo &lt;span class="nt"&gt;--advisory&lt;/span&gt; packet
stem /path/to/repo &lt;span class="nt"&gt;--advisory-response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of that worked.&lt;/p&gt;

&lt;p&gt;The issue was that it treated very different operator intents as one long option surface.&lt;/p&gt;

&lt;p&gt;In practice, these are separate workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scan a repository and generate artifacts&lt;/li&gt;
&lt;li&gt;enforce a gate in CI/CD&lt;/li&gt;
&lt;li&gt;export a bounded advisory packet&lt;/li&gt;
&lt;li&gt;validate a downstream provider response&lt;/li&gt;
&lt;li&gt;cross an explicit provider-call boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I refactored the CLI around workflows instead of flag accumulation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan &amp;lt;folder&amp;gt;
stem gate &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--min-tier&lt;/span&gt; T2
stem advisory validate &amp;lt;folder&amp;gt;
stem advisory packet &amp;lt;folder&amp;gt;
stem advisory call &amp;lt;folder&amp;gt;
stem advisory check-response &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--response&lt;/span&gt; FILE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The older paths still exist for compatibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem &amp;lt;folder&amp;gt;
stem audit &amp;lt;folder&amp;gt;
stem &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--tier-gate&lt;/span&gt; T2 &lt;span class="nt"&gt;--quiet&lt;/span&gt;
stem &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--advisory&lt;/span&gt; packet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But they are no longer the conceptual center.&lt;/p&gt;

&lt;p&gt;That matters more than it sounds.&lt;/p&gt;

&lt;p&gt;Once the command names match the operator’s intent, the system becomes easier to teach, easier to remember, and easier to wire into pipelines.&lt;/p&gt;

&lt;p&gt;This is not just a DX cleanup. In a medical or bio-adjacent audit context, command ambiguity is part of the trust problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Repository trust needed four separate lanes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cnkkkskgtmge5xk6hwe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cnkkkskgtmge5xk6hwe.png" alt="Repository trust needed four separate lanes" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was the biggest architectural shift.&lt;/p&gt;

&lt;p&gt;I stopped treating repository trust as one object.&lt;/p&gt;

&lt;p&gt;In practice, it needed four separate lanes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;deterministic structural scoring&lt;/li&gt;
&lt;li&gt;deterministic diagnostics&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;optional AI advisory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If all of those collapse into one final confidence score, the tool becomes harder to reason about.&lt;/p&gt;

&lt;p&gt;The more regulated the domain, the more dangerous it becomes to collapse every useful signal into one score.&lt;/p&gt;

&lt;p&gt;Some evidence should change the score.&lt;br&gt;
Some evidence should only raise review priority.&lt;br&gt;
Some evidence should support traceability.&lt;br&gt;
Some evidence should be handed to a human or advisory system.&lt;/p&gt;

&lt;p&gt;The maturity of the tool is not that it sees all of them.&lt;/p&gt;

&lt;p&gt;The maturity is that it does not confuse them.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;This separation is not just conceptual. It exists in the code path.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One reasonable objection to any architecture write-up is: are these really separate lanes, or are they just different labels on the same output object?&lt;/p&gt;

&lt;p&gt;In STEM BIO-AI, the answer is visible in the execution order.&lt;/p&gt;

&lt;p&gt;The scanner computes the formal score first. In the result object, that means keys like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 1&lt;/li&gt;
&lt;li&gt;Stage 2R&lt;/li&gt;
&lt;li&gt;Stage 3&lt;/li&gt;
&lt;li&gt;risk penalty&lt;/li&gt;
&lt;li&gt;score cap&lt;/li&gt;
&lt;li&gt;&lt;code&gt;final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does it append the non-scoring layers, again as explicit result keys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;regulatory_basis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stage_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;regulatory_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;reasoning_model&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;optional &lt;code&gt;ai_advisory&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ordering matters.&lt;/p&gt;

&lt;p&gt;The score is not derived from the advisory lane.&lt;br&gt;
The regulatory mapping does not mutate the formal tier.&lt;br&gt;
The diagnostics lane can emit evidence without becoming a hidden score multiplier.&lt;/p&gt;

&lt;p&gt;This is also why the JSON shape ended up more layered than earlier versions. The output had to preserve the distinction the code was already enforcing.&lt;/p&gt;

&lt;p&gt;That execution order is the architectural reason the next four sections exist.&lt;/p&gt;

&lt;p&gt;Once I had the lanes separated in code, each lane needed its own claim boundary, its own output semantics, and its own reason for not being collapsed into the others.&lt;/p&gt;

&lt;p&gt;Put differently, the next four sections answer four different questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is allowed to change the formal tier&lt;/li&gt;
&lt;li&gt;what is useful enough to emit, but not yet mature enough to score&lt;/li&gt;
&lt;li&gt;what can support regulatory review without pretending to be compliance&lt;/li&gt;
&lt;li&gt;what can involve AI without letting AI become the scoring authority&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;1. Deterministic structural scoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrypiyfrasqki89isbj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrypiyfrasqki89isbj.png" alt="The official baseline for triage" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This remains the official score and tier.&lt;/p&gt;

&lt;p&gt;It measures the main repository-visible signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README evidence&lt;/li&gt;
&lt;li&gt;repo-local consistency&lt;/li&gt;
&lt;li&gt;code and bio responsibility&lt;/li&gt;
&lt;li&gt;dependency hygiene&lt;/li&gt;
&lt;li&gt;changelog and provenance surfaces&lt;/li&gt;
&lt;li&gt;code-integrity patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lane is local, deterministic, and machine-checkable.&lt;/p&gt;

&lt;p&gt;That is the part that can legitimately drive a formal triage tier.&lt;/p&gt;

&lt;p&gt;I am not claiming this is the only possible architecture. A different system could have folded diagnostics or replication more aggressively into one unified score.&lt;/p&gt;

&lt;p&gt;I chose not to, because the narrower score proved easier to defend. A smaller claim with cleaner boundaries was more valuable here than a broader score with ambiguous semantics.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;2. Deterministic diagnostics&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is where the deterministic diagnostics spec became important.&lt;/p&gt;

&lt;p&gt;I needed a place for findings that are real, useful, and inspectable, but should not silently perturb the main score until they are calibrated.&lt;/p&gt;

&lt;p&gt;That is what &lt;code&gt;docs/DETERMINISTIC_DIAGNOSTICS.md&lt;/code&gt; defines.&lt;/p&gt;

&lt;p&gt;It separates the diagnostic problem into two lanes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lane A: deterministic local diagnostics&lt;/li&gt;
&lt;li&gt;Lane B: optional AI-assisted semantic review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation is central.&lt;/p&gt;

&lt;p&gt;The deterministic lane is authoritative for hard findings.&lt;br&gt;
The AI lane is advisory only.&lt;/p&gt;

&lt;p&gt;The local diagnostic lane currently focuses on evidence-bearing bio-specific signals such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;malformed or suspicious SMILES-like outputs&lt;/li&gt;
&lt;li&gt;missing parser guards&lt;/li&gt;
&lt;li&gt;silent mock or simulated-data fallbacks&lt;/li&gt;
&lt;li&gt;risky subprocess construction around bio tools&lt;/li&gt;
&lt;li&gt;traceability manifest surfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point was not to create a “bio slop detector” with a catchy label.&lt;/p&gt;

&lt;p&gt;The point was to create a local evidence lane that could say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;here is the file&lt;/li&gt;
&lt;li&gt;here is the line&lt;/li&gt;
&lt;li&gt;here is the snippet&lt;/li&gt;
&lt;li&gt;here is the bounded interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is much more useful than a vague semantic warning.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why diagnostics stayed evidence-only
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4uyzvdequdx3f2juhj58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4uyzvdequdx3f2juhj58.png" alt="Retaining evidence without inflating the score" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was one of the harder engineering decisions.&lt;/p&gt;

&lt;p&gt;It would have been easy to push every new bio-specific detector directly into the final score.&lt;/p&gt;

&lt;p&gt;I did not do that.&lt;/p&gt;

&lt;p&gt;The deterministic diagnostics spec is explicit that many of these findings begin as evidence-only. In practice, they are emitted as line-level records in the result object's &lt;code&gt;evidence_ledger&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;findings are emitted into the result object’s &lt;code&gt;evidence_ledger&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;findings appear in Markdown and &lt;code&gt;--explain&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;findings do not change &lt;code&gt;final_score&lt;/code&gt; or &lt;code&gt;formal_tier&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the right default.&lt;/p&gt;

&lt;p&gt;For example, the SMILES lane can be very useful for detecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;malformed surface strings&lt;/li&gt;
&lt;li&gt;low-entropy placeholders&lt;/li&gt;
&lt;li&gt;repeated trivial outputs&lt;/li&gt;
&lt;li&gt;missing parser guards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does not prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;medicinal usefulness&lt;/li&gt;
&lt;li&gt;synthetic feasibility&lt;/li&gt;
&lt;li&gt;binding plausibility&lt;/li&gt;
&lt;li&gt;biological efficacy&lt;/li&gt;
&lt;li&gt;full chemical validity in every edge case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That boundary is important.&lt;/p&gt;

&lt;p&gt;A detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/p&gt;

&lt;p&gt;Just as importantly, this is not meant to be a permanent holding area for every detector. The diagnostics spec is explicit that score impact should only happen after commit-pinned benchmark evidence, explicit false-positive review, and reproducible calibration. In other words, evidence-only is the temporary safe default until a detector has earned score authority.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;3. Regulatory traceability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftesjgssomvk8xj4t0sye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftesjgssomvk8xj4t0sye.png" alt="Traceability is not a permission slip" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second document that became central was &lt;code&gt;docs/REGULATORY_MAPPING.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This solved a different problem.&lt;/p&gt;

&lt;p&gt;Once you audit clinical-adjacent repositories, people naturally ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does this align with EU AI Act themes?&lt;/li&gt;
&lt;li&gt;does this help with FDA-oriented review?&lt;/li&gt;
&lt;li&gt;is there anything relevant to IMDRF or SaMD evidence families?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The wrong answer would be to turn those questions into a fake compliance score.&lt;/p&gt;

&lt;p&gt;So I did the opposite.&lt;/p&gt;

&lt;p&gt;The regulatory layer is explicitly framed as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a traceability aid, not a compliance verdict&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That document maps observed evidence classes to requirement families with bounded confidence labels like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strong&lt;/li&gt;
&lt;li&gt;moderate&lt;/li&gt;
&lt;li&gt;weak-moderate&lt;/li&gt;
&lt;li&gt;weak&lt;/li&gt;
&lt;li&gt;not assessed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it makes an important distinction:&lt;/p&gt;

&lt;p&gt;the confidence applies to the mapping relationship, not to legal acceptability.&lt;/p&gt;

&lt;p&gt;Those confidence labels are not model outputs and they are not inferred at runtime. They are fixed, rule-level mapping judgments attached to evidence classes in the mapping document itself. For example, changelog / checksum / config-manifest style evidence is treated as a moderate traceability signal for Article 12-style review, while human-oversight interface signals stay weak because interface presence is not the same thing as oversight procedure.&lt;/p&gt;

&lt;p&gt;That means the tool can say things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;versioned manifests and changelogs may support record-keeping / traceability review&lt;/li&gt;
&lt;li&gt;intended-use and disclaimer sections may support transparency scaffolding review&lt;/li&gt;
&lt;li&gt;override interfaces may support human-oversight interface review&lt;/li&gt;
&lt;li&gt;subgroup measurement language may support weak evidence of data-governance intent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without claiming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;legal compliance&lt;/li&gt;
&lt;li&gt;regulatory clearance&lt;/li&gt;
&lt;li&gt;clinical certification&lt;/li&gt;
&lt;li&gt;deployer conformance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a regulated domain, traceability is useful only when it does not pretend to be permission.&lt;/p&gt;
&lt;h3&gt;
  
  
  A concrete example: why Article 12 is traceability, not compliance
&lt;/h3&gt;

&lt;p&gt;The best example here is EU AI Act Article 12 style traceability.&lt;/p&gt;

&lt;p&gt;The regulatory mapping layer treats signals like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;checksum manifests&lt;/li&gt;
&lt;li&gt;versioned config surfaces&lt;/li&gt;
&lt;li&gt;audit-log schema fragments&lt;/li&gt;
&lt;li&gt;decision-event or override-event schema tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;as evidence that a repository may have traceability scaffolding.&lt;/p&gt;

&lt;p&gt;That is useful.&lt;/p&gt;

&lt;p&gt;It is also bounded.&lt;/p&gt;

&lt;p&gt;The mapping document is explicit that changelog presence is not the same thing as deploy-time event logging, and that current scope does not establish runtime log completeness.&lt;/p&gt;

&lt;p&gt;So the output can legitimately say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;there is structural evidence relevant to traceability review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;while refusing to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this system satisfies traceability obligations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly the kind of distinction I wanted this lane to enforce.&lt;/p&gt;

&lt;p&gt;What this buys in practice is not a compliance shortcut, but a faster review question. If a repository exposes none of the scaffolding signals in this lane — no change history, no artifact hashes, no versioned manifests, no event-schema surfaces — then there is very little reason to treat it as traceability-ready for deeper institutional review. If those signals do exist, the next step is still expert inspection, but the scanner has at least opened the right folder and pointed at the right files.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why regulatory mapping stayed subordinate to evidence
&lt;/h3&gt;

&lt;p&gt;This was non-negotiable.&lt;/p&gt;

&lt;p&gt;Regulatory relevance had to remain downstream from evidence, not a score multiplier pretending to be law.&lt;/p&gt;

&lt;p&gt;That is why the output shape separates things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;regulatory_basis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stage_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;regulatory_traceability&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;from the actual score computation.&lt;/p&gt;

&lt;p&gt;And it is not just decorative structure.&lt;/p&gt;

&lt;p&gt;The regulatory basis object is registry-driven. It can mark &lt;code&gt;review_required&lt;/code&gt; when the basis registry is stale or required source families are missing. That is a traceability control on the mapping layer itself, not an input into the scoring formula.&lt;/p&gt;

&lt;p&gt;This is also why the regulatory note belongs in a muted traceability panel, not next to the main score.&lt;/p&gt;

&lt;p&gt;If a repo has traceability-relevant scaffolding, that is useful.&lt;/p&gt;

&lt;p&gt;If a repo has traceability-relevant scaffolding, that is still not compliance.&lt;/p&gt;

&lt;p&gt;The distinction has to remain visible in both the code and the artifacts.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;4. Optional AI advisory&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zlifs3rxmxuwm87n8z0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zlifs3rxmxuwm87n8z0.png" alt="Enforcing a bounded intelligence sandbox" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fourth lane is the advisory layer.&lt;/p&gt;

&lt;p&gt;This one exists for bounded model-assisted review, but it does not get to rewrite the official outcome.&lt;/p&gt;

&lt;p&gt;That means workflows like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory packet /path/to/repo
stem advisory check-response /path/to/repo &lt;span class="nt"&gt;--response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can exist without creating ambiguity about who owns the formal result.&lt;/p&gt;

&lt;p&gt;The advisory layer can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;export a provider-neutral packet&lt;/li&gt;
&lt;li&gt;validate downstream response structure&lt;/li&gt;
&lt;li&gt;enforce finding-ID citation rules&lt;/li&gt;
&lt;li&gt;reject prohibited claims&lt;/li&gt;
&lt;li&gt;surface runtime and secret boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it cannot do is silently override:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;score.final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;score.formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How that rule is actually enforced
&lt;/h3&gt;

&lt;p&gt;This is not just policy language in the README.&lt;/p&gt;

&lt;p&gt;The advisory validator explicitly checks for score-override attempts. If a response includes fields like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;replication_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;replication_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;or sets &lt;code&gt;final_score_override&lt;/code&gt;, the response is marked invalid with &lt;code&gt;final_score_override_requested&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The packet contract also exports the rule in plain language:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Do not modify or override &lt;code&gt;final_score&lt;/code&gt;, &lt;code&gt;formal_tier&lt;/code&gt;, &lt;code&gt;replication_score&lt;/code&gt;, or &lt;code&gt;replication_tier&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And provider responses must cite exact values from &lt;code&gt;allowed_finding_ids&lt;/code&gt;; citation strings are not repaired or loosely matched later.&lt;/p&gt;

&lt;p&gt;So the advisory lane is bounded in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it has no authority to change the deterministic result&lt;/li&gt;
&lt;li&gt;it cannot cite evidence outside the bounded packet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of mechanism I mean when I say “better boundaries.” If the rule cannot be checked, it is not really part of the architecture yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What operational use looks like now&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqaat0z96ca7h4e0obx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqaat0z96ca7h4e0obx8.png" alt="One execution driving distinct operator surfaces" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once these lanes were separated, the CLI became much easier to reason about.&lt;/p&gt;

&lt;p&gt;Local engineering review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI/CD gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem gate /path/to/repo &lt;span class="nt"&gt;--min-tier&lt;/span&gt; T2 &lt;span class="nt"&gt;--summary&lt;/span&gt; off &lt;span class="nt"&gt;--output&lt;/span&gt; results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Offline advisory packet generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory packet /path/to/repo &lt;span class="nt"&gt;--output&lt;/span&gt; advisory_out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Downstream provider response validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory check-response /path/to/repo &lt;span class="nt"&gt;--response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important point is not just that these commands exist.&lt;/p&gt;

&lt;p&gt;It is that each one represents a distinct trust boundary.&lt;/p&gt;

&lt;p&gt;That made the project feel more like engineering infrastructure and less like a scoring demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;A real v1.6.2 packet&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To make that less abstract, I re-ran STEM BIO-AI v1.6.2 against a local clone of &lt;a href="https://github.com/ClawBio/ClawBio" rel="noopener noreferrer"&gt;ClawBio&lt;/a&gt;, which describes itself as a local-first, privacy-focused, reproducible bioinformatics-native AI skill library.&lt;/p&gt;

&lt;p&gt;The command was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; stem_ai.cli scan /path/to/ClawBio &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitfu4woqxooan1vs88zb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitfu4woqxooan1vs88zb.png" alt="ClawBio_ClawBio_detailed_5p-1" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6lqg9kj5y8v5n0vcpia.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6lqg9kj5y8v5n0vcpia.png" alt="ClawBio_ClawBio_detailed_5p-2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On my machine, that run took about &lt;strong&gt;9.4 seconds&lt;/strong&gt; and emitted the usual CLI output set: a machine-readable JSON result, a Markdown report, a 5-page PDF packet, and a line-level explain trace.&lt;/p&gt;

&lt;p&gt;Before the numbers, the important context is that STEM BIO-AI uses a published triage scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;T0&lt;/code&gt; = 0-39&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T1&lt;/code&gt; = 40-54&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T2&lt;/code&gt; = 55-69&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T3&lt;/code&gt; = 70-84&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T4&lt;/code&gt; = 85-100&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stage 4 replication is reported separately as its own lane, where &lt;code&gt;R2&lt;/code&gt; means some reproducibility scaffolding is present, but not yet enough to call the repository replication-strong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Governance note:&lt;br&gt;
This is not a “bad repository” scoreboard, a clinical safety verdict, or a moral ranking. It is a deterministic evidence-surface pre-screen intended to support review, not replace it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that in mind, the result was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;67 / 100&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;T2 Caution&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication lane: 55 / 100 (R2)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clinical adjacency: CA-DIRECT&lt;/strong&gt; (the repository surface makes direct healthcare-facing claims, even though it also carries an explicit non-clinical boundary)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code integrity warnings: C2 dependency pinning, C4 exception handling&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the workflow shift I wanted the tool to support.&lt;/p&gt;

&lt;p&gt;The same deterministic scan is rendered into multiple operator surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON for automation&lt;/li&gt;
&lt;li&gt;Markdown for review&lt;/li&gt;
&lt;li&gt;PDF for human-facing packet inspection&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--explain&lt;/code&gt; for file / line / snippet proof tracing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That output shape is only possible because the result object already separates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formal score and tier&lt;/li&gt;
&lt;li&gt;replication lane&lt;/li&gt;
&lt;li&gt;diagnostics lane&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;advisory boundary state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the PDF is not a separate product. It is a view over the same bounded audit object.&lt;/p&gt;

&lt;p&gt;Two details from this run are worth calling out.&lt;/p&gt;

&lt;p&gt;First, the scanner did &lt;strong&gt;not&lt;/strong&gt; manufacture chemistry findings just because ClawBio is bio-adjacent. The deterministic diagnostics lane reported:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SMILES Surface Integrity: not_detected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SMILES RDKit Validation: not_applicable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SMILES Parser Guard: not_detected&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the behavior I want. If a detector has no evidence, it should stay silent instead of inflating the report with domain-flavored noise. This is what the earlier thesis looks like when it hits real output: a detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/p&gt;

&lt;p&gt;Second, the score is strict about observable repository conventions. ClawBio uses &lt;code&gt;ClawBio_README_Repo.md&lt;/code&gt; rather than a root &lt;code&gt;README.md&lt;/code&gt;, so the scan records &lt;code&gt;S1_missing_readme: -20&lt;/code&gt;. A human reviewer might decide that this is acceptable contextually. The scanner does not make that leap for them. It only records what the repository exposes through the surfaces it knows how to measure.&lt;/p&gt;

&lt;p&gt;That distinction matters. A &lt;code&gt;T2 Caution&lt;/code&gt; result here does not mean “ClawBio is unsafe.” It means the current repository surface still raises review-relevant signals under the published deterministic rules, including dependency-pinning warnings, exception-handling warnings in a clinical-adjacent surface, and a stricter-than-human README convention check.&lt;/p&gt;

&lt;p&gt;And that is exactly why the next section matters: once the workflow is concrete, the remaining question is not whether the tool can produce an answer, but where its current boundaries still need to stay visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What still has to stay bounded&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The system is better than it was, but there are still obvious next steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The public surface is broad
&lt;/h3&gt;

&lt;p&gt;There is now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoring&lt;/li&gt;
&lt;li&gt;diagnostics&lt;/li&gt;
&lt;li&gt;replication&lt;/li&gt;
&lt;li&gt;advisory packeting&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;JSON / Markdown / PDF / explain outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is useful, but it increases onboarding cost.&lt;/p&gt;

&lt;p&gt;The CLI is clearer now, but the broader public surface has to stay disciplined.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The deterministic diagnostics lane is still missing a published calibration threshold
&lt;/h3&gt;

&lt;p&gt;The diagnostics lane is evidence-first by design, but one practical gap remains: the public release does not yet ship a benchmark-backed threshold document saying exactly when a detector is mature enough to graduate from evidence-only into score-bearing territory.&lt;/p&gt;

&lt;p&gt;Right now the rule is conceptually clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;commit-pinned fixtures&lt;/li&gt;
&lt;li&gt;reproducible detector output&lt;/li&gt;
&lt;li&gt;explicit false-positive review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the public decision boundary is still partly narrative. Until that calibration surface is published in a more operational form, keeping diagnostics evidence-only is the safer choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The regulatory confidence labels are rule-authored, not empirically validated
&lt;/h3&gt;

&lt;p&gt;The mapping labels like &lt;code&gt;strong&lt;/code&gt;, &lt;code&gt;moderate&lt;/code&gt;, and &lt;code&gt;weak-moderate&lt;/code&gt; are currently fixed rule-level judgments in the mapping document. They are not runtime model outputs, but they are also not yet backed by inter-rater reliability studies or a published reviewer-agreement benchmark.&lt;/p&gt;

&lt;p&gt;That means they are useful as bounded structural mapping language, but they should not be treated as empirical proof that multiple auditors would converge on exactly the same label distribution.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Earlier context&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f"&gt;Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5"&gt;How Auditing 10 Bio-AI Repositories Shaped STEM-AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2"&gt;How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Try it yourself&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;STEM BIO-AI is Apache 2.0 and fully open source.&lt;/p&gt;

&lt;p&gt;If you want to know whether a bio/medical AI repository is actually exposing reviewable evidence, or whether your own repository is weaker than you think, run it yourself.&lt;/p&gt;

&lt;p&gt;That is the real test.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/flamehaven01/STEM-BIO-AI" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/STEM-BIO-AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;License: Apache 2.0&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Final thought&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The earlier STEM-AI posts were about why repository trust deserves its own audit layer.&lt;/p&gt;

&lt;p&gt;This phase was about something more practical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;what does that audit layer have to look like if an engineer is actually going to run it, inspect it, and put it in a pipeline?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For me, the answer was simple:&lt;/p&gt;

&lt;p&gt;Separate the workflows.&lt;br&gt;
Separate the lanes.&lt;br&gt;
Keep diagnostics evidence-first.&lt;br&gt;
Keep regulatory mapping subordinate to evidence.&lt;br&gt;
Keep advisory AI bounded.&lt;/p&gt;

&lt;p&gt;Optimize for inspectability, not just score production.&lt;/p&gt;

&lt;p&gt;That is what changed the project.&lt;/p&gt;

&lt;p&gt;Not bigger claims.&lt;/p&gt;

&lt;p&gt;Better boundaries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6pay2ivnar9ryd22kjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6pay2ivnar9ryd22kjk.png" alt="Final thought" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>opensource</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:51:45 +0000</pubDate>
      <link>https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2</link>
      <guid>https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fj5kfblnvz80fjttvnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fj5kfblnvz80fjttvnl.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Previous article:&lt;br&gt;
&lt;a href="https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5"&gt;&lt;strong&gt;How Auditing 10 Bio-AI Repositories Shaped STEM-AI&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.&lt;/p&gt;

&lt;p&gt;The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.&lt;/p&gt;

&lt;p&gt;The more useful lesson was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Text-only review is too weak for bio/medical AI. You have to inspect the code path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That worked.&lt;/p&gt;

&lt;p&gt;But it exposed the next problem.&lt;/p&gt;

&lt;p&gt;If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?&lt;/p&gt;

&lt;p&gt;LLMs drift. &lt;br&gt;
One session can enforce a clinical boundary strictly. &lt;br&gt;
Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.&lt;/p&gt;

&lt;p&gt;STEM-AI v1.1.2 is my answer to that problem.&lt;/p&gt;

&lt;p&gt;It does not try to make the LLM deterministic by writing a longer prompt.&lt;/p&gt;

&lt;p&gt;It binds the audit to a memory contract.&lt;/p&gt;


&lt;h2&gt;
  
  
  What v1.1.2 adds
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcsghndeq1guwwkoy2y7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcsghndeq1guwwkoy2y7.png" alt="standard audit vs Bio/Medical AI audit" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;STEM-AI v1.1.2 introduces &lt;a href="https://dev.to/flamehaven01/series/37087"&gt;MICA: Memory-Injected Contract Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The v1.1.2 layer includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;memory/mica.yaml&lt;/code&gt; -- composition contract&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai.mica.v1.1.2.json&lt;/code&gt; -- machine-checkable memory archive&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai-playbook.v1.1.2.md&lt;/code&gt; -- session playbook and drift guard&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai-lessons.v1.1.2.md&lt;/code&gt; -- historical failure-mode archive&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;spec/STEM-AI_v1.1.2_CORE.md&lt;/code&gt; -- canonical audit spec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The contract pins 18 invariants.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage order is fixed: README intent, cross-platform evidence, code/bio evidence.&lt;/li&gt;
&lt;li&gt;Stage weights are fixed.&lt;/li&gt;
&lt;li&gt;Tier boundaries are fixed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T0_HARD_FLOOR&lt;/code&gt; cannot be bypassed.&lt;/li&gt;
&lt;li&gt;Stage 2 may use external evidence or Stage 2R repo-local consistency in LOCAL_ANALYSIS mode.&lt;/li&gt;
&lt;li&gt;Governance overlay cannot raise the formal base tier.&lt;/li&gt;
&lt;li&gt;C1-C4 code-integrity checks only run in LOCAL_ANALYSIS mode.&lt;/li&gt;
&lt;li&gt;Mandatory clinical-use disclaimers cannot be omitted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a claim that the LLM becomes perfectly deterministic.&lt;/p&gt;

&lt;p&gt;It is a narrower claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The auditor is forced to operate inside a contract whose scoring rules, hard floors, and evidence requirements are inspectable.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the useful layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  What "loading the contract" means
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9k78n4gct6xfq15ls2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9k78n4gct6xfq15ls2.png" alt="Forcing the auditor to operate inside a machine-checkable memory contract" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/series/37087"&gt;MICA&lt;/a&gt;&lt;/strong&gt; is not hidden model memory.&lt;/p&gt;

&lt;p&gt;It is also not a claim that the model provider changed the LLM.&lt;/p&gt;

&lt;p&gt;In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;memory/mica.yaml
memory/stem-ai.mica.v1.1.2.json
memory/stem-ai-playbook.v1.1.2.md
memory/stem-ai-lessons.v1.1.2.md
spec/STEM-AI_v1.1.2_CORE.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcucfrq4bst2cxc3olb1d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcucfrq4bst2cxc3olb1d.png" alt="Pinning the audit rules mathematically" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The auditor then performs a pre-execution contract test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;confirm the canonical spec exists&lt;/li&gt;
&lt;li&gt;confirm the memory archive exists&lt;/li&gt;
&lt;li&gt;confirm the invariant count is 18&lt;/li&gt;
&lt;li&gt;confirm the fixed tier boundaries are present&lt;/li&gt;
&lt;li&gt;confirm the Stage 2 / Stage 2R lane rule is present&lt;/li&gt;
&lt;li&gt;confirm Stage 3G cannot raise the formal tier&lt;/li&gt;
&lt;li&gt;confirm C1-C4 mode gating is active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does the audit proceed.&lt;/p&gt;

&lt;p&gt;This does not make the LLM mathematically deterministic.&lt;/p&gt;

&lt;p&gt;It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.&lt;/p&gt;

&lt;p&gt;That is the difference between &lt;strong&gt;"please be consistent"&lt;/strong&gt; and &lt;strong&gt;"execute this versioned contract."&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit workflow
&lt;/h2&gt;

&lt;p&gt;STEM-AI v1.1.2 runs as a structured audit workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frihb9k729ll3vbqpvu3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frihb9k729ll3vbqpvu3c.png" alt="STEM-AI v1.1.2 workflow" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.&lt;/p&gt;

&lt;p&gt;It can inspect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;package metadata&lt;/li&gt;
&lt;li&gt;workflow files&lt;/li&gt;
&lt;li&gt;test definitions&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;source-code paths&lt;/li&gt;
&lt;li&gt;deprecated or dead-code paths&lt;/li&gt;
&lt;li&gt;exception handling&lt;/li&gt;
&lt;li&gt;credential patterns&lt;/li&gt;
&lt;li&gt;provenance and hash-checking logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output is intentionally split into two files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report.md                  # human-readable audit judgment
experiment_results.json    # machine-readable evidence and score object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kvqqnty1q4d011wcyo6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kvqqnty1q4d011wcyo6.png" alt="Separating subjective reasoning from verifiable mathematics" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That split matters.&lt;/p&gt;

&lt;p&gt;The report explains the reasoning.&lt;/p&gt;

&lt;p&gt;The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.&lt;/p&gt;




&lt;h2&gt;
  
  
  A real target audit, not a synthetic example
&lt;/h2&gt;

&lt;p&gt;For this v1.1.2 demonstration, I used a real public repository:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/artic-network/fieldbioinformatics" rel="noopener noreferrer"&gt;artic-network/fieldbioinformatics&lt;br&gt;
&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The target is not the protagonist of this post.&lt;/p&gt;

&lt;p&gt;It is only the specimen used to show the audit workflow against a real bioinformatics codebase.&lt;/p&gt;

&lt;p&gt;The local audit produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audits/fieldbioinformatics_v1_1_2/report.md
audits/fieldbioinformatics_v1_1_2/experiment_results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The target snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"artic-network/fieldbioinformatics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"remote"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/artic-network/fieldbioinformatics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"branch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"master"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"commit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8008b4c97c2193a82308ff6f0be507b1d9306e36"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;114&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the important part: the audit did not ask, "Does this README sound trustworthy?"&lt;/p&gt;

&lt;p&gt;It asked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do README claims match actual package metadata and entry points?&lt;/li&gt;
&lt;li&gt;Are there real CI and domain-specific tests?&lt;/li&gt;
&lt;li&gt;Are dependencies reproducible enough?&lt;/li&gt;
&lt;li&gt;Are there credential leaks?&lt;/li&gt;
&lt;li&gt;Are there deprecated patient-adjacent paths?&lt;/li&gt;
&lt;li&gt;Do clinical-adjacent output paths fail closed?&lt;/li&gt;
&lt;li&gt;Does the repository include governance evidence, or only governance absence?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where STEM-AI is useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  The score object
&lt;/h2&gt;

&lt;p&gt;The machine-readable result records the score like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_1_readme_intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_cross_platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_repo_local_consistency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_lane"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STAGE_2R_REPO_LOCAL_CONSISTENCY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"External Stage 2 was not collected; LOCAL_ANALYSIS used Stage 2R in the fixed 0.20 Stage 2 slot."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_3_code_bio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"weights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"risk_penalty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"formal_tier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"T2 Caution"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;External Stage 2 is explicitly represented as &lt;code&gt;null&lt;/code&gt; for this local-only audit.&lt;/p&gt;

&lt;p&gt;That does not mean cross-platform consistency is unimportant.&lt;/p&gt;

&lt;p&gt;It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.&lt;/p&gt;

&lt;p&gt;Stage 2R asks whether the repository's own surfaces agree with each other:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README vs package metadata and CLI entry points&lt;/li&gt;
&lt;li&gt;README vs docs, tutorials, and troubleshooting&lt;/li&gt;
&lt;li&gt;README test claims vs CI workflow and test definitions&lt;/li&gt;
&lt;li&gt;clinical-adjacent outputs vs local intended-use boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The contract defines the fixed-weight calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Final = (Stage 1 x 0.40) + (Stage 2R x 0.20) + (Stage 3 x 0.40) - Risk Penalty
      = (65 x 0.40) + (75 x 0.20) + (55 x 0.40) - 0
      = 63
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final tier is therefore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T2 Caution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not because the prose sounded balanced.&lt;/p&gt;

&lt;p&gt;Because the contract math forces that result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the T0 hard floor did not trigger
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdntp9d5l6yysb9sbndw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdntp9d5l6yysb9sbndw.png" alt="Why the T0 hard floor did not trigger" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;T0_HARD_FLOOR&lt;/code&gt; is the rule that prevents a clinically dangerous repository from escaping rejection through good wording.&lt;/p&gt;

&lt;p&gt;In simplified form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If a repository is CA-DIRECT
and it has no substantive code implementation,
then final tier = T0 regardless of score math.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.&lt;/p&gt;

&lt;p&gt;The audited repository did not trigger that floor because STEM-AI classified it as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clinical_adjacent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ca_severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"t0_hard_floor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.&lt;/p&gt;

&lt;p&gt;So the result is not T0.&lt;/p&gt;

&lt;p&gt;But it is also not high-trust.&lt;/p&gt;

&lt;p&gt;The bounded result is T2 Caution.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3o3nr2j7trtap0med4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3o3nr2j7trtap0med4y.png" alt="Stem-AI Audit v1.1.2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Code-integrity findings
&lt;/h2&gt;

&lt;p&gt;The same JSON records C1-C4 LOCAL_ANALYSIS checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C1_hardcoded_credentials"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C2_dependency_pinning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C3_dead_or_deprecated_patient_adjacent_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C4_exception_handling_clinical_adjacent_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the difference between a general review and a code-path audit.&lt;/p&gt;

&lt;p&gt;A text review can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The project appears technically mature.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A code-path audit can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Credential patterns were checked. Dependency pinning is weak. Deprecated patient-adjacent metadata exists. One clinical-adjacent filtering path does not fail closed on missing depth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a more useful governance object.&lt;/p&gt;

&lt;p&gt;It is not a certificate.&lt;/p&gt;

&lt;p&gt;It is a map of what a reviewer should trust, distrust, or inspect next.&lt;/p&gt;




&lt;h2&gt;
  
  
  A small Python verifier
&lt;/h2&gt;

&lt;p&gt;Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;


&lt;span class="n"&gt;RESULT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audits/fieldbioinformatics_v1_1_2/experiment_results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;69&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;84&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;filled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;█&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;filled&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;░&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;filled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RESULT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;stage_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_1_readme_intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;stage_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_2_repo_local_consistency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stage_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_3_code_bio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;risk_penalty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;computed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;risk_penalty&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;computed&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;formal_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 1  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 2R &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 3  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;formal_tier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_integrity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected digest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1   65/100  █████████████░░░░░░░
Stage 2R  75/100  ███████████████░░░░░
Stage 3   55/100  ███████████░░░░░░░░░
Final     63/100  █████████████░░░░░░░
Tier      T2 Caution
C1_hardcoded_credentials: PASS
C2_dependency_pinning: WARN
C3_dead_or_deprecated_patient_adjacent_paths: WARN
C4_exception_handling_clinical_adjacent_paths: WARN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Bio/medical AI governance is full of language that sounds safe but is hard to verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"research use only"&lt;/li&gt;
&lt;li&gt;"not medical advice"&lt;/li&gt;
&lt;li&gt;"validated pipeline"&lt;/li&gt;
&lt;li&gt;"clinical-grade"&lt;/li&gt;
&lt;li&gt;"responsible AI"&lt;/li&gt;
&lt;li&gt;"human-in-the-loop"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those phrases are not enough.&lt;/p&gt;

&lt;p&gt;STEM-AI asks for observable structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source-code reality&lt;/li&gt;
&lt;li&gt;test reality&lt;/li&gt;
&lt;li&gt;CI reality&lt;/li&gt;
&lt;li&gt;dependency reality&lt;/li&gt;
&lt;li&gt;clinical boundary reality&lt;/li&gt;
&lt;li&gt;governance artifact reality&lt;/li&gt;
&lt;li&gt;code-integrity reality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;v1.1.2 adds another layer:&lt;/p&gt;

&lt;p&gt;auditor reality.&lt;/p&gt;

&lt;p&gt;The AI auditor itself has to load a memory contract before it scores.&lt;/p&gt;

&lt;p&gt;That is what MICA is for.&lt;/p&gt;

&lt;p&gt;The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.&lt;/p&gt;

&lt;p&gt;Not hype.&lt;/p&gt;

&lt;p&gt;Not rejection by default.&lt;/p&gt;

&lt;p&gt;A bounded trust judgment with evidence paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The follow-on lane should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provision the target dependency environment&lt;/li&gt;
&lt;li&gt;run selected target tests in a controlled shell&lt;/li&gt;
&lt;li&gt;capture command, exit code, environment hash, and output digest&lt;/li&gt;
&lt;li&gt;attach a replay manifest to &lt;code&gt;experiment_results.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;keep runtime evidence separate from source/document/CI evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0wo50wt3x5bfg8xrip3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0wo50wt3x5bfg8xrip3.png" alt="Stem-AI" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;STEM-AI is &lt;strong&gt;not a clinical certifier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is also &lt;strong&gt;not trying to replace scientific review, regulatory review, or domain experts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Its role is narrower: &lt;strong&gt;make the governance conversation start from observable evidence instead of presentation quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, that means asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What did the repository claim?&lt;/li&gt;
&lt;li&gt;What does the code actually implement?&lt;/li&gt;
&lt;li&gt;Do the local surfaces agree with each other?&lt;/li&gt;
&lt;li&gt;Are the tests domain-specific or merely infrastructural?&lt;/li&gt;
&lt;li&gt;Are clinical-adjacent boundaries explicit?&lt;/li&gt;
&lt;li&gt;Can the auditor's own scoring logic be inspected?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where I think STEM-AI belongs in AI governance.&lt;/p&gt;

&lt;p&gt;Not as the final authority.&lt;/p&gt;

&lt;p&gt;As the evidence gate before authority is invoked.&lt;/p&gt;

&lt;p&gt;It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does this repository establish enough observable trust to be considered, contained, or rejected?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>bioinformatics</category>
      <category>medicalai</category>
      <category>aigovernance</category>
      <category>ai</category>
    </item>
    <item>
      <title>Each /slop Is a Calibration Signal — AI-SLOP Detector v3.6.0 and the Claude Code Skill</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 28 Apr 2026 12:04:44 +0000</pubDate>
      <link>https://dev.to/flamehaven01/each-slop-is-a-calibration-signal-ai-slop-detector-v360-and-the-claude-code-skill-3909</link>
      <guid>https://dev.to/flamehaven01/each-slop-is-a-calibration-signal-ai-slop-detector-v360-and-the-claude-code-skill-3909</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2hqsg05873bhhlmli4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2hqsg05873bhhlmli4u.png" alt="The Quiet Failure of AI Development" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI-assisted development has a quiet failure mode: the assistant that creates the pattern often becomes the assistant that reviews it.&lt;/p&gt;

&lt;p&gt;When you and Claude work inside the same session, you drift together. The review criteria shift with the assistant's habits. After enough sessions, the same assistant that wrote the hollow function body is also the one approving the pull request. There is no external reference point — unless you build one.&lt;/p&gt;

&lt;p&gt;That is the problem AI-SLOP Detector v3.6.0 addresses with the Claude Code skill.&lt;/p&gt;

&lt;p&gt;Every time you run &lt;code&gt;/slop&lt;/code&gt; inside a session, the scan result is recorded to a project-scoped history. When enough re-scan evidence accumulates, bounded self-calibration adjusts the detection weights for your codebase — automatically, without a manual command. The scanner does not drift with the session. It stays anchored to observed scan outcomes.&lt;/p&gt;

&lt;p&gt;It does not get smarter every time. It builds calibration signal every time. That is a more accurate claim, and the distinction matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Skill Does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5vvu9dumj5w8b54vzd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5vvu9dumj5w8b54vzd2.png" alt="The Skill layer Quality Policy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; claude-skills/slop-detector ~/.claude/skills/slop-detector
&lt;span class="c"&gt;# restart Claude Code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four slash commands become available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full project scan — interprets findings, prioritizes fixes, proposes patch plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-file [path]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-file deep-dive — explains each metric, gives concrete fix per pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-gate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hard gate decision — PASS or FAIL, lists blocking files with deficit_score &amp;gt;= 70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-spar&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adversarial validation — probes metric boundaries, catches calibration drift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The intended workflow inside a Claude session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. /slop               → baseline scan, identify top offenders
2. review findings     → Claude prioritizes by deficit_score
3. patch files         → fix patterns with Claude's help
4. /slop-file &amp;lt;path&amp;gt;   → verify improvement per file
5. /slop               → confirm project aggregate improved
6. /slop-gate          → gate decision before merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality policy lives in the skill layer. You do not re-explain what &lt;code&gt;CRITICAL_DEFICIT&lt;/code&gt; means or which patterns are critical on every session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The LEDA Flywheel
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy5sjec49yyfddet0bqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy5sjec49yyfddet0bqj.png" alt="The LEDA Flywheel" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the part that matters.&lt;/p&gt;

&lt;p&gt;LEDA is not model retraining. It is bounded weight calibration based on repeated scan outcomes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/slop&lt;/code&gt; runs &lt;code&gt;slop-detector --project . --json&lt;/code&gt; — without &lt;code&gt;--no-history&lt;/code&gt;. Every invocation auto-records results to &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt;, tagged with a &lt;code&gt;project_id&lt;/code&gt; (sha256 of cwd) so signals never mix across different repositories.&lt;/p&gt;

&lt;p&gt;After every &lt;strong&gt;10 re-scanned files&lt;/strong&gt;, the tool runs the LEDA self-calibration loop automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/slop called
    │
    ├─► scan result → recorded to history.db (project-scoped)
    │
    ├─► 10 re-scanned files milestone?
    │       └─► SelfCalibrator: 4D grid-search over run history
    │               (ldr × inflation × ddc × purity weights)
    │               └─► confidence gap &amp;gt; 0.10?
    │                       └─► .slopconfig.yaml updated silently
    │
    └─► next /slop → calibrated weights, sharper detection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The calibrator uses re-scanned files as signal — not raw record count. A file counts toward the milestone only when the tool has seen it improve or degrade across at least two runs. This prevents first-time project scans from triggering calibration on noise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1yf4jspidkr41hpvoe8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1yf4jspidkr41hpvoe8.png" alt="Constrained to Reality" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three constraints keep calibration bounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain-anchored&lt;/strong&gt; — grid search is constrained to ±0.15 around domain baseline weights. Detection cannot drift outside the meaningful range for your project type.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence gate&lt;/strong&gt; — only applies when the top candidate weight set beats the second by &amp;gt; 0.10. Ambiguous signals produce no change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift warnings&lt;/strong&gt; — &lt;code&gt;CalibrationResult.warnings&lt;/code&gt; flags any dimension that shifted &amp;gt; 0.25 from the anchor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;/slop-spar&lt;/code&gt; adds a separate adversarial layer: it probes known-pattern anchors, metric boundary cases, and existence conditions. When it detects that measured behavior has diverged from metric claims, it recommends &lt;code&gt;--self-calibrate --apply-calibration&lt;/code&gt; explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Data Shows — and What We Won't Claim
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzbr2rhp3wc97g1teul2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzbr2rhp3wc97g1teul2.png" alt="Workflow telemetry, not empty claims" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will not tell you that AI-SLOP Detector improves code quality by X%.&lt;/p&gt;

&lt;p&gt;We have not run a controlled study. We have not compared matched projects with and without the tool. Any number we put here would be a claim we cannot prove, and this tool is built specifically to catch that kind of thing.&lt;/p&gt;

&lt;p&gt;What we do have: the tool scanning itself. Every time a core module was changed, it got re-scanned. N = 14,367 records across all projects in &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not outcome evidence. It is workflow telemetry. Here is what the scan history shows for the eight most-improved files in this codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;File                   Scans  Worst → Best   Improvement
─────────────────────────────────────────────────────────
ddc.py                   86   87.8 →  11.0    -76.8 pts
placeholder.py           92   70.3 →   0.0    -70.3 pts
cross_file.py            89   70.3 →   5.0    -65.3 pts
ci_gate.py               88   69.3 →   6.2    -63.1 pts
cli.py                   88   68.4 →   8.4    -60.0 pts
ldr.py                   90   58.0 →   0.1    -57.9 pts
python_advanced.py       95   74.0 →  18.0    -55.9 pts
context_jargon.py        86   55.7 →   5.0    -50.7 pts
─────────────────────────────────────────────────────────
Source: self-scan, history.db — not an independent study
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the weekly project aggregate (avg deficit score):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week      Avg Deficit   Critical Files   Note
────────────────────────────────────────────────────────
2026-W09     11.9            3           baseline
2026-W10     22.1           20           structural refactor spike
2026-W14     20.0           58           large feature addition
2026-W15     11.9           14           post-refactor recovery
2026-W17     12.2           13           current — stable CLEAN state
────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mechanism is not mysterious. Scan reveals structural problems → Claude sees exact pattern names and line references → Claude (or the developer) fixes them → rescan confirms improvement → LEDA registers the delta and adjusts detection weights accordingly.&lt;/p&gt;

&lt;p&gt;The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.&lt;/p&gt;

&lt;p&gt;Whether that loop improves your codebase is something your &lt;code&gt;history.db&lt;/code&gt; will tell you — not us.&lt;/p&gt;




&lt;h2&gt;
  
  
  Also in v3.6.0
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo24mnhsdtwi84p8wzx0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo24mnhsdtwi84p8wzx0l.png" alt="System diagnostics &amp;amp; Protocol refinements" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI gate exit code fix.&lt;/strong&gt; &lt;code&gt;--ci-mode hard&lt;/code&gt; without &lt;code&gt;--ci-report&lt;/code&gt; was returning exit 0 even on &lt;code&gt;CRITICAL_DEFICIT&lt;/code&gt; files — a two-line fix in &lt;code&gt;_evaluate_ci_gate()&lt;/code&gt; (commit &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/commit/0d67997" rel="noopener noreferrer"&gt;&lt;code&gt;0d67997&lt;/code&gt;&lt;/a&gt;). This affected v3.1.1 through v3.5.0 on the specific path of using the gate without the reporting flag. A regression test at the subprocess level was added to prevent recurrence (commit &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/commit/0208af4" rel="noopener noreferrer"&gt;&lt;code&gt;0208af4&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-commit hooks rewritten.&lt;/strong&gt; Three hook variants now use &lt;code&gt;python -m slop_detector.cli&lt;/code&gt; as entry point (bypasses Windows &lt;code&gt;.exe&lt;/code&gt; wrapper exit-code issue), and &lt;code&gt;--severity high&lt;/code&gt; (nonexistent flag) replaced with &lt;code&gt;--ci-mode&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;repos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/flamehaven01/AI-SLOP-Detector&lt;/span&gt;
    &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v3.6.0&lt;/span&gt;
    &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slop-detector&lt;/span&gt;           &lt;span class="c1"&gt;# hard gate&lt;/span&gt;
      &lt;span class="c1"&gt;# - id: slop-detector-warn    # report only&lt;/span&gt;
      &lt;span class="c1"&gt;# - id: slop-detector-patterns  # fast per-file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;VS Code Extension v3.6.0.&lt;/strong&gt; Version tracks core library. No behavior changes from v3.5.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shape of the Loop
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaifh98pgwka02u110lx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaifh98pgwka02u110lx.png" alt="An External reference point" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill + LEDA loop is the external reference point. Detection weights stay grounded in observed scan outcomes — files that improved across re-scans, files that stayed problematic — rather than in what the assistant believes is correct at any given moment.&lt;/p&gt;

&lt;p&gt;The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.&lt;/p&gt;

&lt;p&gt;We won't tell you what percentage your code will improve. That would make us the thing we are trying to detect.&lt;/p&gt;

&lt;p&gt;The scanner is not Claude's opinion about code quality. It is a measurement that gets calibrated against reality, session by session. Your &lt;code&gt;history.db&lt;/code&gt; will tell you the rest.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y0o9wh1nmvimjajm12r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y0o9wh1nmvimjajm12r.png" alt="The Shape of the Loop" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/ai-slop-detector/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/docs/CLAUDE_CODE_SKILL.md" rel="noopener noreferrer"&gt;Claude Code Skill docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/docs/SELF_CALIBRATION.md" rel="noopener noreferrer"&gt;Self-Calibration docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>claudeai</category>
      <category>codequality</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
